Results 1  10
of
79
Design and Evaluation of a Compiler Algorithm for Prefetching
 in Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1992
"... Softwarecontrolled data prefetching is a promising technique for improving the performance of the memory subsystem to match today's highperformance processors. While prefetching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memo ..."
Abstract

Cited by 467 (22 self)
 Add to MetaCart
Softwarecontrolled data prefetching is a promising technique for improving the performance of the memory subsystem to match today's highperformance processors. While prefetching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memory subsystem. As a result, care must be taken to ensure that such overheads do not exceed the benefits.
Improving Data Locality with Loop Transformations
 ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1996
"... ..."
Compiler Optimizations for Improving Data Locality
 IN PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS
, 1994
"... In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality base ..."
Abstract

Cited by 187 (17 self)
 Add to MetaCart
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality basedon a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the origi...
Data Transformations for Eliminating Conflict Misses
 In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation
, 1998
"... Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compiletime datalayout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Intervariable padding adjusts variable base addresses ..."
Abstract

Cited by 125 (12 self)
 Add to MetaCart
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compiletime datalayout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Intervariable padding adjusts variable base addresses, while intravariable padding modifies array dimension sizes. Two levels of precision are evaluated. PadLite only uses array and column dimension sizes, relying on assumptions about common array reference patterns. Pad analyzes programs, detecting conflict misses by linearizing array references and calculating conflict distances between uniformlygenerated references. The Euclidean algorithm for computing the gcd of two numbers is used to predict conflicts between different array columns for linear algebra codes. Experiments on a range of programs indicate PadLite can eliminate conflicts for benchmarks, but Pad is more effective over a range of cache and problem sizes. Padding reduces c...
Symbolic Analysis for Parallelizing Compilers
, 1994
"... Symbolic Domain The objects in our abstract symbolic domain are canonical symbolic expressions. A canonical symbolic expression is a lexicographically ordered sequence of symbolic terms. Each symbolic term is in turn a pair of an integer coefficient and a sequence of pairs of pointers to program va ..."
Abstract

Cited by 105 (4 self)
 Add to MetaCart
Symbolic Domain The objects in our abstract symbolic domain are canonical symbolic expressions. A canonical symbolic expression is a lexicographically ordered sequence of symbolic terms. Each symbolic term is in turn a pair of an integer coefficient and a sequence of pairs of pointers to program variables in the program symbol table and their exponents. The latter sequence is also lexicographically ordered. For example, the abstract value of the symbolic expression 2ij+3jk in an environment that i is bound to (1; (( " i ; 1))), j is bound to (1; (( " j ; 1))), and k is bound to (1; (( " k ; 1))) is ((2; (( " i ; 1); ( " j ; 1))); (3; (( " j ; 1); ( " k ; 1)))). In our framework, environment is the abstract analogous of state concept; an environment is a function from program variables to abstract symbolic values. Each environment e associates a canonical symbolic value e x for each variable x 2 V ; it is said that x is bound to e x. An environment might be represented by...
Counting Solutions to Linear and Nonlinear Constraints through Ehrhart Polynomials: Applications to Analyze and Transform Scientific Programs
, 1996
"... In order to produce efficient parallel programs, optimizing compilers need to include an analysis of the initial sequential code. When analyzing loops with affine loop bounds, many computations are relevant to the same general problem: counting the number of integer solutions of selected free variab ..."
Abstract

Cited by 96 (0 self)
 Add to MetaCart
In order to produce efficient parallel programs, optimizing compilers need to include an analysis of the initial sequential code. When analyzing loops with affine loop bounds, many computations are relevant to the same general problem: counting the number of integer solutions of selected free variables in a set of linear and/or nonlinear parameterized constraints. For example, computing the number of flops executed by a loop, of memory locations touched by a loop, of cache lines touched by a loop, or of array elements that need to be transmitted from a processor to another during the execution of a loop, is useful to determine if a loop is load balanced, evaluate message traffic and allocate message buffers. The objective of the presented method is to evaluate symbolically, in terms of symbolic constants (the size parameters) , this number of integer solutions. By modeling the considered counting problem as a union of rational convex polytopes, the number of included integer points is ...
Optimizing for Parallelism and Data Locality
 In Proceedings of the 1992 ACM International Conference on Supercomputing
, 1992
"... Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the tradeoffs between effectively utilizing parallelism and memory hierarchy on sharedmemory mu ..."
Abstract

Cited by 94 (14 self)
 Add to MetaCart
Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the tradeoffs between effectively utilizing parallelism and memory hierarchy on sharedmemory multiprocessors. We present a simple, but surprisingly accurate, memory model to determine cache line reuse from both multiple accesses to the same memory location and from consecutive memory access. The model is used in memory optimizing and loop parallelization algorithms that effectively exploit data locality and parallelism in concert. We demonstrate the efficacy of this approach with very encouraging experimental results. 1 Introduction Transformations to exploit parallelism and to improve data locality are two of the most valuable compiler techniques in use today. Independently, each of these optimizations has been shown to result in dramatic improvements. This paper seeks to combine t...
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings
 International Journal of Parallel Programming
, 2001
"... The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically underutilize multilevel memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to i ..."
Abstract

Cited by 89 (2 self)
 Add to MetaCart
The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically underutilize multilevel memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on spacefilling curves and we introduce a new architectureindependent multilevel blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering.
Counting Solutions to Presburger Formulas: How and Why
, 1994
"... We describe methods that are able to count the number of integer solutions to selected free variables of a Presburger formula, or sum a polynomial over all integer solutions of selected free variables of a Presburger formula. This answer is given symbolically, in terms of symbolic constants (the rem ..."
Abstract

Cited by 83 (2 self)
 Add to MetaCart
We describe methods that are able to count the number of integer solutions to selected free variables of a Presburger formula, or sum a polynomial over all integer solutions of selected free variables of a Presburger formula. This answer is given symbolically, in terms of symbolic constants (the remaining free variables in the Presburger formula). For example...
Parametric Analysis of Polyhedral Iteration Spaces
 JOURNAL OF VLSI SIGNAL PROCESSING
, 1998
"... In the area of automatic parallelization of programs, analyzing and transforming loop nests with parametric affine loop bounds requires fundamental mathematical results. The most common geometrical model of iteration spaces, called the polytope model, is based on mathematics dealing with convex and ..."
Abstract

Cited by 68 (13 self)
 Add to MetaCart
In the area of automatic parallelization of programs, analyzing and transforming loop nests with parametric affine loop bounds requires fundamental mathematical results. The most common geometrical model of iteration spaces, called the polytope model, is based on mathematics dealing with convex and discrete geometry, linear programming, combinatorics and geometry of numbers. In this paper, we present automatic methods for computing the parametric vertices and the Ehrhart polynomial, i.e. a parametric expression of the number of integer points, of a polytope defined by a set of parametric linear constraints. These methods have many applications in analysis and transformations of nested loop programs. The paper is illustrated with exact symbolic array dataflow analysis, estimation of execution time, and with the computation of the maximum available parallelism of given loop nests.