Results 1  10
of
57
Improving Data Locality with Loop Transformations
 ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1996
"... ..."
Compiler Optimizations for Improving Data Locality
 IN PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS
, 1994
"... In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality base ..."
Abstract

Cited by 187 (17 self)
 Add to MetaCart
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality basedon a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the origi...
Unifying Data and Control Transformations for Distributed SharedMemory Machines
, 1994
"... We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimiz ..."
Abstract

Cited by 152 (10 self)
 Add to MetaCart
We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimizations for distributed sharedmemory machines, although the same techniques can be used for sequential machines with a memory hierarchy. Our compiler optimizations are based on an algebraic representation of data mappings and a new data locality model. We present a pure data transformation algorithm and an algorithm unifying data and control transformations. While there has been much work on control transformations, the opportunities for data transformations have been largely neglected. In fact, data transformations have the advantage of being applicable to programs that cannot be optimized with control transformations. The unified algorithm, which performs data and control transformations s...
Datacentric Multilevel Blocking
, 1997
"... We present a simple and novel framework for generating blocked codes for highperformance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the fl ..."
Abstract

Cited by 139 (9 self)
 Add to MetaCart
We present a simple and novel framework for generating blocked codes for highperformance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the flow of data through the memory hierarchy. Our datacentric transformations permit a more direct solution to the problem of enhancing data locality than current controlcentric techniques do, and generalize easily to multiple levels of memory hierarchy. We buttress these claims with performance numbers for standard benchmarks from the problem domain of dense numerical linear algebra. The simplicity and intuitive appeal of our approach should make it attractive to compiler writers as well as to library writers. 1 Introduction Data reuse is imperative for good performance on modern highperformance computers because the memory architecture of these machines is a hierarchy in which the cost of ...
Cache Miss Equations: A Compiler Framework for Analyzing and Tuning Memory Behavior
 ACM Transactions on Programming Languages and Systems
, 1999
"... This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in looporiented scientific code. Implemented within the SUIF compiler framework, our approach extends traditional compiler reuse an ..."
Abstract

Cited by 134 (1 self)
 Add to MetaCart
This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in looporiented scientific code. Implemented within the SUIF compiler framework, our approach extends traditional compiler reuse analysis to generate linear Diophantine equations that summarize each loop's memory behavior. While solving these equations is in general di# cult, we show that is also unnecessary, as mathematical techniques for manipulating Diophantine equations allow us to relatively easily compute and/or reduce the number of possible solutions, where each solution corresponds to a potential cache miss. The mathematical precision of CMEs allows us to find true optimal solutions for transformations such as blocking or padding. The generality of CMEs also allows us to reason about interactions between transformations applied in concert. The article also gives examples of their use to determine array padding and o#set amounts that minimize cache misses, and to determine optimal blocking factors for tiled code. Overall, these equations represent an analysis framework that o#ers the generality and precision needed for detailed compiler optimizations
Customized Dynamic Load Balancing for a Network of Workstations
, 1997
"... this paper we show that different load balancing schemes are best for different applications under varying program and system parameters. Therefore, applicationdriven customized dynamic load balancing becomes essential for good performance. We present a hybrid compiletime and runtime modeling and ..."
Abstract

Cited by 70 (0 self)
 Add to MetaCart
this paper we show that different load balancing schemes are best for different applications under varying program and system parameters. Therefore, applicationdriven customized dynamic load balancing becomes essential for good performance. We present a hybrid compiletime and runtime modeling and decision process which selects (customizes) the best scheme, along with automatic generation of parallel code with calls to a runtime library for load balancing. 1997 Academic Press 1.
Synthesizing transformations for locality enhancement of imperfectlynested loop nests
 In Proceedings of the 2000 ACM International Conference on Supercomputing
, 2000
"... We present an approach for synthesizing transformations to enhance locality in imperfectlynested loops. The key idea is to embed the iteration space of every statement in a loop nest into a special iteration space called the product space. The product space can be viewed as a perfectlynested loop ..."
Abstract

Cited by 56 (3 self)
 Add to MetaCart
We present an approach for synthesizing transformations to enhance locality in imperfectlynested loops. The key idea is to embed the iteration space of every statement in a loop nest into a special iteration space called the product space. The product space can be viewed as a perfectlynested loop nest, so embedding generalizes techniques like code sinking and loop fusion that are used in ad hoc ways in current compilers to produce perfectlynested loops from imperfectlynested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space is then transformed further to enhance locality, after which fully permutable loops are tiled, and code is generated. We evaluate the effectiveness of this approach for dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks. 1. BACKGROUND AND PREVIOUSWORK Sophisticated algorithms based on polyhedral algebra have been developed for determining good sequences of linear loop transformations (permutation, skewing, reversal and scaling) for enhancing locality in perfectlynested loops 1. Highlights of this technology are the following. The iterations of the loop nest are modeled as points in an integer lattice, and linear loop transformations are modeled as nonsingular matrices mapping one lattice to another. A sequence of loop transformations is modeled by the product of matrices representing the individual transformations; since the set of nonsingular matrices is closed under matrix product, this means that a sequence of linear loop transformations can be represented by a nonsingular matrix. The problem of finding an optimal sequence of linear loop transformations is thus reduced to the problem of finding an integer matrix that satisfies some desired property, permitting the full machinery of matrix methods and lattice theory to ¢ This work was supported by NSF grants CCR9720211, EIA9726388, ACI9870687,EIA9972853. £ A perfectlynested loop is a set of loops in which all assignment statements are contained in the innermost loop.
A memory model for scientific algorithms on graphics processors
 in Proc. of the ACM/IEEE Conference on Supercomputing (SC’06
, 2006
"... We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D blockbased array representation to perform the underlying computations. We incorporate many characteristics ..."
Abstract

Cited by 50 (3 self)
 Add to MetaCart
We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D blockbased array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C’s model to analyze the cache misses. Moreover, we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memoryintensive scientific applications – sorting, fast Fourier transform and dense matrixmultiplication. In practice, our cacheefficient algorithms for these applications are able to achieve memory throughput of 30–50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPUbased and CPUbased implementations on highend processors. In practice, we are able to achieve 2–5× performance improvement.
Is Search Really Necessary to Generate HighPerformance BLAS?
, 2005
"... Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of p ..."
Abstract

Cited by 42 (8 self)
 Add to MetaCart
Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of parameter values by generating programs with many different combinations of parameter values, and running them on the actual hardware to determine which values give the best performance. It is widely believed that traditional modeldriven optimization cannot compete with searchbased empirical optimization because tractable analytical models cannot capture all the complexities of modern highperformance architectures, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the global search engine in ATLAS with a modeldriven optimization engine, and measured the relative performance of the code produced by the two systems on a variety of architectures. Since both systems use the same code generator, any differences in the performance of the code produced by the two systems can come only from differences in optimization parameter values. Our experiments show that modeldriven optimization can be surprisingly effective, and can generate code with performance comparable to that of code generated by ATLAS using global search. Index Terms — program optimization, empirical optimization, modeldriven optimization, compilers, library generators, BLAS, highperformance computing
Quantifying Loop Nest Locality Using SPEC'95 and the Perfect Benchmarks
, 1999
"... This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target lo ..."
Abstract

Cited by 38 (6 self)
 Add to MetaCart
This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target loop nests. In contrast, the locality characteristics that drive these optimizations are usually collected across the entire application rather than the nest level. Researchers have studied numerical codes for so long that a number of commonly held assertions have emerged on their locality characteristics. In light of these assertions, we use the SPEC'95 and Perfect Benchmarks to take a new look at measuring locality on numerical codes based on references, loop nests, and program locality properties. Our results show that several popular assertions are at best overstatements. For example, although most reuse is within a loop nest, in line with popular assertions, most misses are internest cap...