Results 11 - 20
of
504
Cache-Conscious Structure Layout
, 1999
"... Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary appro ..."
Abstract
-
Cited by 164 (8 self)
- Add to MetaCart
Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary approach that attacks the source (poor reference locality) of the problem rather than its manifestation (memory latency). It demonstrates that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance. It explores two placement technique-lustering and colorinet improve cache performance by increasing a pointer structure’s spatial and temporal locality, and by reducing cache-conflicts. To reduce the cost of applying these techniques, this paper discusses two strategies-cache-conscious reorganization and cacheconscious allocation--and describes two semi-automatic toolsccmorph and ccmalloc-that use these strategies to produce cache-conscious pointer structure layouts. ccmorph is a transparent tree reorganizer that utilizes topology information to cluster and color the structure. ccmalloc is a cache-conscious heap allocator that attempts to co-locate contemporaneously accessed data elements in the same physical cache block. Our evaluations, with microbenchmarks, several small benchmarks, and a couple of large real-world applications, demonstrate that the cache-conscious structure layouts produced by ccmorph and ccmalloc offer large performance benefit-n most cases, significantly outperforming state-of-the-art prefetching.
Data and Computation Transformations for Multiprocessors
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimiza ..."
Abstract
-
Cited by 156 (14 self)
- Add to MetaCart
Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimization algorithm consists of two steps. The first step chooses the parallelization and computation assignment such that synchronization and data sharing are minimized. The second step then restructures the layout of the data in the shared address space with an algorithm that is based on a new data transformation framework. We ran our compiler on a set of application programs and measured their performance on the Stanford DASH multiprocessor. Our results show that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance. 1 Introduction In the last decade, microprocessor speeds have been steadily improving at a rate of 50% to 100% every year[16]. Meanwh...
Unifying Data and Control Transformations for Distributed Shared-Memory Machines
, 1994
"... We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimiz ..."
Abstract
-
Cited by 150 (10 self)
- Add to MetaCart
We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimizations for distributed shared-memory machines, although the same techniques can be used for sequential machines with a memory hierarchy. Our compiler optimizations are based on an algebraic representation of data mappings and a new data locality model. We present a pure data transformation algorithm and an algorithm unifying data and control transformations. While there has been much work on control transformations, the opportunities for data transformations have been largely neglected. In fact, data transformations have the advantage of being applicable to programs that cannot be optimized with control transformations. The unified algorithm, which performs data and control transformations s...
Data-centric Multi-level Blocking
, 1997
"... We present a simple and novel framework for generating blocked codes for high-performance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the fl ..."
Abstract
-
Cited by 133 (9 self)
- Add to MetaCart
We present a simple and novel framework for generating blocked codes for high-performance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the flow of data through the memory hierarchy. Our data-centric transformations permit a more direct solution to the problem of enhancing data locality than current control-centric techniques do, and generalize easily to multiple levels of memory hierarchy. We buttress these claims with performance numbers for standard benchmarks from the problem domain of dense numerical linear algebra. The simplicity and intuitive appeal of our approach should make it attractive to compiler writers as well as to library writers. 1 Introduction Data reuse is imperative for good performance on modern high-performance computers because the memory architecture of these machines is a hierarchy in which the cost of ...
Lifetime-Sensitive Modulo Scheduling
- In Proc. of the ACM SIGPLAN '93 Conf. on Programming Language Design and Implementation
, 1993
"... This paper shows how to software pipeline a loop for minimal register pressure without sacrificing the loop's minimum execution time. This novel bidirectional slack-scheduling method has been implemented in a FORTRAN compiler and tested on many scientific benchmarks. The empirical results---when me ..."
Abstract
-
Cited by 129 (0 self)
- Add to MetaCart
This paper shows how to software pipeline a loop for minimal register pressure without sacrificing the loop's minimum execution time. This novel bidirectional slack-scheduling method has been implemented in a FORTRAN compiler and tested on many scientific benchmarks. The empirical results---when measured against an absolute lower bound on execution time, and against a novel schedule-independent absolute lower bound on register pressure---indicate nearoptimal performance. 1 Introduction Software pipelining increases a loop's throughput by overlapping the loop's iterations; that is, by initiating successive iterations before prior iterations complete. With sufficient overlap, a functional unit can be saturated, at which point the loop initiates iterations at the maximum possible rate. To find an overlapped schedule, a compiler must represent the complex resource constraints that can arise. Efficiently representing these constraints is especially difficult when adjacent iterations do n...
Cache Miss Equations: A Compiler Framework for Analyzing and Tuning Memory Behavior
- ACM Transactions on Programming Languages and Systems
, 1999
"... This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in loop-oriented scientific code. Implemented within the SUIF compiler framework, our approach extends traditional compiler reuse an ..."
Abstract
-
Cited by 127 (1 self)
- Add to MetaCart
This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in loop-oriented scientific code. Implemented within the SUIF compiler framework, our approach extends traditional compiler reuse analysis to generate linear Diophantine equations that summarize each loop's memory behavior. While solving these equations is in general di#- cult, we show that is also unnecessary, as mathematical techniques for manipulating Diophantine equations allow us to relatively easily compute and/or reduce the number of possible solutions, where each solution corresponds to a potential cache miss. The mathematical precision of CMEs allows us to find true optimal solutions for transformations such as blocking or padding. The generality of CMEs also allows us to reason about interactions between transformations applied in concert. The article also gives examples of their use to determine array padding and o#set amounts that minimize cache misses, and to determine optimal blocking factors for tiled code. Overall, these equations represent an analysis framework that o#ers the generality and precision needed for detailed compiler optimizations
Data Transformations for Eliminating Conflict Misses
- In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation
, 1998
"... Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses ..."
Abstract
-
Cited by 118 (12 self)
- Add to MetaCart
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses, while intra-variable padding modifies array dimension sizes. Two levels of precision are evaluated. PadLite only uses array and column dimension sizes, relying on assumptions about common array reference patterns. Pad analyzes programs, detecting conflict misses by linearizing array references and calculating conflict distances between uniformly-generated references. The Euclidean algorithm for computing the gcd of two numbers is used to predict conflicts between different array columns for linear algebra codes. Experiments on a range of programs indicate PadLite can eliminate conflicts for benchmarks, but Pad is more effective over a range of cache and problem sizes. Padding reduces c...
False Sharing and Spatial Locality in Multiprocessor Caches
- IEEE Transactions on Computers
, 1992
"... The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache mi ..."
Abstract
-
Cited by 114 (4 self)
- Add to MetaCart
The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in this paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We...
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1994
"... We have developed compiler algorithms that analyze coarse-grained, explicitly parallel programs and restructure their shared data to minimize the number of false sharing misses. The algorithms analyze the per-process data accesses to shared data, use this information to pinpoint the data structures ..."
Abstract
-
Cited by 112 (1 self)
- Add to MetaCart
We have developed compiler algorithms that analyze coarse-grained, explicitly parallel programs and restructure their shared data to minimize the number of false sharing misses. The algorithms analyze the per-process data accesses to shared data, use this information to pinpoint the data structures that are prone to false sharing and choose an appropriate transformation to reduce it. The algorithms eliminated an average (across the entire workload) of 64% of false sharing misses, and in two programs more than 90%. However, how well the reduction in false sharing misses translated into improved execution time depended heavily on the memory subsystem architecture and previous programmer efforts to optimize for locality. On a multiprocessor with a large cache configuration and high cache miss penalty, the transformations improved the execution time of programmer-unoptimized applications by as much as 60%. However, on programs where previous programmer efforts to improve data locality had ...
To Copy or Not to Copy: A Compile-Time Technique for Assessing When Data Copying Should be Used to Eliminate Cache Conflicts
- In Proceedings of Supercomputing '93
, 1993
"... this paper, we present a compile-time technique for making this determination, and present a selective copying strategy based on this methodology. Preliminary experimental results demonstrate that, because of the sensitivity of cache conflicts to small changes in problem size and base addresses, sel ..."
Abstract
-
Cited by 107 (5 self)
- Add to MetaCart
this paper, we present a compile-time technique for making this determination, and present a selective copying strategy based on this methodology. Preliminary experimental results demonstrate that, because of the sensitivity of cache conflicts to small changes in problem size and base addresses, selective copying can lead to better overall performance than either no copying, complete copying or manually applied heuristics.

