Results 1 - 10
of
72
Data and Computation Transformations for Multiprocessors
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimiza ..."
Abstract
-
Cited by 156 (14 self)
- Add to MetaCart
Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimization algorithm consists of two steps. The first step chooses the parallelization and computation assignment such that synchronization and data sharing are minimized. The second step then restructures the layout of the data in the shared address space with an algorithm that is based on a new data transformation framework. We ran our compiler on a set of application programs and measured their performance on the Stanford DASH multiprocessor. Our results show that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance. 1 Introduction In the last decade, microprocessor speeds have been steadily improving at a rate of 50% to 100% every year[16]. Meanwh...
Unifying Data and Control Transformations for Distributed Shared-Memory Machines
, 1994
"... We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimiz ..."
Abstract
-
Cited by 150 (10 self)
- Add to MetaCart
We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimizations for distributed shared-memory machines, although the same techniques can be used for sequential machines with a memory hierarchy. Our compiler optimizations are based on an algebraic representation of data mappings and a new data locality model. We present a pure data transformation algorithm and an algorithm unifying data and control transformations. While there has been much work on control transformations, the opportunities for data transformations have been largely neglected. In fact, data transformations have the advantage of being applicable to programs that cannot be optimized with control transformations. The unified algorithm, which performs data and control transformations s...
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1994
"... We have developed compiler algorithms that analyze coarse-grained, explicitly parallel programs and restructure their shared data to minimize the number of false sharing misses. The algorithms analyze the per-process data accesses to shared data, use this information to pinpoint the data structures ..."
Abstract
-
Cited by 113 (1 self)
- Add to MetaCart
We have developed compiler algorithms that analyze coarse-grained, explicitly parallel programs and restructure their shared data to minimize the number of false sharing misses. The algorithms analyze the per-process data accesses to shared data, use this information to pinpoint the data structures that are prone to false sharing and choose an appropriate transformation to reduce it. The algorithms eliminated an average (across the entire workload) of 64% of false sharing misses, and in two programs more than 90%. However, how well the reduction in false sharing misses translated into improved execution time depended heavily on the memory subsystem architecture and previous programmer efforts to optimize for locality. On a multiprocessor with a large cache configuration and high cache miss penalty, the transformations improved the execution time of programmer-unoptimized applications by as much as 60%. However, on programs where previous programmer efforts to improve data locality had ...
Performance Analysis Using the MIPS R10000 Performance Counters
, 1996
"... : Tuning supercomputer application performance often requires analyzing the interaction of the application and the underlying architecture. In this paper, we describe support in the MIPS R10000 for non-intrusively monitoring a variety of processor events -- support that is particularly useful for c ..."
Abstract
-
Cited by 91 (0 self)
- Add to MetaCart
: Tuning supercomputer application performance often requires analyzing the interaction of the application and the underlying architecture. In this paper, we describe support in the MIPS R10000 for non-intrusively monitoring a variety of processor events -- support that is particularly useful for characterizing the dynamic behavior of multi-level memory hierarchies, hardware-based cache coherence, and speculative execution. We first explain how performance data is collected using an integrated set of hardware mechanisms, operating system abstractions, and performance tools. We then describe several examples drawn from scientific applications, which illustrate how the counters and profiling tools provide information that helps developers analyze and tune applications. Keywords: performance analysis, profiling tools, hardware performance counters, MIPS R10000, SGI Power Challenge 1. Introduction A fundamental question asked by HPC application developers is: "Where is the time spent?"...
The detection and elimination of useless misses in multiprocessors
- In Proceedings of the 20th International Symposium on Computer Architecture
, 1993
"... In this paper we introduce a classification of misses in shared-memory multiprocessors based on inter processor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All ..."
Abstract
-
Cited by 64 (3 self)
- Add to MetaCart
In this paper we introduce a classification of misses in shared-memory multiprocessors based on inter processor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting program execution. Based on the new classification we evaluate miss reduction techniques in hardware, based on delaying and combining invalidations. We compare the effectiveness of five different protocols for combining invalidations leading to useless misses for cachebased multiprocessors and for multiprocessors with virtual shared memory. In cache based systems these techniques are very effective and lead to miss rates which are close to the minimum. In virtual shared memory systems, the techniques are also effective but leave room for additional improvements.
Parallel data mining for association rules on shared-memory multiprocessors
- In Proc. Supercomputing’96
, 1996
"... Abstract. In this paper we present a new parallel algorithm for data mining of association rules on shared-memory multiprocessors. We study the degree of parallelism, synchronization, and data locality issues, and present optimizations for fast frequency computation. Experiments show that a signific ..."
Abstract
-
Cited by 62 (19 self)
- Add to MetaCart
Abstract. In this paper we present a new parallel algorithm for data mining of association rules on shared-memory multiprocessors. We study the degree of parallelism, synchronization, and data locality issues, and present optimizations for fast frequency computation. Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm. A lot of data-mining tasks (e.g. association rules, sequential patterns) use complex pointer-based data structures (e.g. hash trees) that typically suffer from suboptimal data locality. In the multiprocessor case shared access to these data structures may also result in false sharing. For these tasks it is commonly observed that the recursive data structure is built once and accessed multiple times during each iteration. Furthermore, the access patterns after the build phase are highly ordered. In such cases locality and false sharing sensitive memory placement of these structures can enhance performance significantly. We evaluate a set of placement policies for parallel association discovery, and show that simple placement schemes can improve execution time by more than a factor of two. More complex schemes yield additional gains.
On the value locality of store instructions
- In Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... Value locality, a recently discovered program attribute that describes the likelihood of the recurrence of previ-ously-seen program values, has been studied enthusias-tically in the recent published literature. Much of the energy has focused on refining the initial efforts at pre-dicting load instru ..."
Abstract
-
Cited by 60 (8 self)
- Add to MetaCart
Value locality, a recently discovered program attribute that describes the likelihood of the recurrence of previ-ously-seen program values, has been studied enthusias-tically in the recent published literature. Much of the energy has focused on refining the initial efforts at pre-dicting load instruction outcomes, with the balance of the effort examining the value locality of either all reg-ister-writing instructions, or a focused subset of them. Surprisingly, there has been very little published char-acterization of or effort to exploit the value locality of data words stored to memory by computer programs. This paper presents such a characterization, proposes both memory-centric (based on message passing) and producer-centric (based on pfogram structure) predic-
Adjustable Block Size Coherent Caches
- In Proceedings of the 19th Annual International Symposium on Computer Architecture
, 1992
"... caches depends on the relationship between the granularity of sharing and locality exhibited by the program and the cache block size. Large cache blocks exploit processor and spatial locality, but may cause unnecessary cache invalidations due to false sharing. Small cache blocks can reduce the numbe ..."
Abstract
-
Cited by 55 (0 self)
- Add to MetaCart
caches depends on the relationship between the granularity of sharing and locality exhibited by the program and the cache block size. Large cache blocks exploit processor and spatial locality, but may cause unnecessary cache invalidations due to false sharing. Small cache blocks can reduce the number of cache invalidations, but increase the number of bus or network transactions required to load data into the cache. In this paper we describe a cache organization that dynamically adjusts the cache block size according to recently observed reference behavior. Cache blocks are split across cache lines when false sharing occurs, and merged back into a single cache line to exploit spatial locality. To evaluate this cache organization, we simulate a scalable multiprocessor with coherent caches, using a suite of memory reference traces to model program behavior. We show that for every fixed block size, some program suffers a 33% increase in the average waiting time per reference, and a factor of 2 increase in the average number of words transferred per reference, when compared against the performance of an adjustable block size cache. In the few cases where adjusting the block size does not provide superior performance, it comes within 7% of the best fixed block size alternative. We conclude that an adjustable block size cache offers significantly better performance than every fixed block size cache, especially when there is variability in the granularity of sharing exhibited by applications.
False Sharing and its Effect on Shared Memory Performance
- IN PROCEEDINGS OF THE USENIX SYMPOSIUM ON EXPERIENCES WITH DISTRIBUTED AND MULTIPROCESSOR SYSTEMS (SEDMS IV
, 1993
"... False sharing occurs when processors in a shared-memory parallel system make references to different data objects within the same coherence block (cache line or page), thereby inducing "unnecessary" coherence operations. False sharing is widely believed to be a serious problem for parallel program p ..."
Abstract
-
Cited by 50 (3 self)
- Add to MetaCart
False sharing occurs when processors in a shared-memory parallel system make references to different data objects within the same coherence block (cache line or page), thereby inducing "unnecessary" coherence operations. False sharing is widely believed to be a serious problem for parallel program performance, but a precise definition and quantification of the problem has proven to be elusive. We explain why. In the process, we present a variety of possible definitions for false sharing, and discuss the merits and drawbacks of each. Our discussion is based on experience gained during a fouryear study of multiprocessor memory architecture and its effect on the behavior of applications in a sixteen-program suite. Using
Optimizing Data Locality by Array Restructuring
, 1995
"... It is increasingly important that optimizing compilers restructure programs for data locality to obtain high performance on today's powerful architectures. In this paper, we focus on array restructuring , a technique that improves the spatial locality exhibited by array accesses in nested loops. Spe ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
It is increasingly important that optimizing compilers restructure programs for data locality to obtain high performance on today's powerful architectures. In this paper, we focus on array restructuring , a technique that improves the spatial locality exhibited by array accesses in nested loops. Specifically, we address the following question: Given a set of such accesses, how should the array elements be laid out in memory to match the access pattern and thus maximize locality? Our approach is based on an invertible linear transformation of array index vectors. We present algorithms to choose a suitable transformation, and hence array layout, given the set of array accesses. Our analysis places no restrictions on the loop's nesting structure or dependence pattern. Although we focus on cases where the array indexing expressions are affine functions of loop variables, our techniques can be applied to the non-affine case as well. We have implemented our technique in the SUIF compiler [17...

