Results 1  10
of
128
Scanning Polyhedra with DO Loops
, 1991
"... Supercompilers perform complex program transformations which often result in new loop bounds. This paper shows that, under the usual assumptions in automatic parallelization, most transformations on loop nests can be expressed as affine transformations on integer sets de ned by polyhedra and that th ..."
Abstract

Cited by 195 (5 self)
 Add to MetaCart
Supercompilers perform complex program transformations which often result in new loop bounds. This paper shows that, under the usual assumptions in automatic parallelization, most transformations on loop nests can be expressed as affine transformations on integer sets de ned by polyhedra and that the new loop bounds can be computed with algorithms using Fourier's pairwise elimination method although it is not exact for integer sets. Sufficient conditions to use pairwise elimination on integer sets and to extend it to pseudolinear constraints are also given. A tradeo has to be made between dynamic overhead due to some bound slackness and compilation complexity but the resulting code is always correct. These algorithms can be used to interchange or block loops regardless of the loop bounds or the blocking strategy and to safely exchange array parts between two levels of a memory hierarchy or between neighboring processors in a distributed memory machine.
Datacentric Multilevel Blocking
, 1997
"... We present a simple and novel framework for generating blocked codes for highperformance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the fl ..."
Abstract

Cited by 139 (9 self)
 Add to MetaCart
We present a simple and novel framework for generating blocked codes for highperformance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the flow of data through the memory hierarchy. Our datacentric transformations permit a more direct solution to the problem of enhancing data locality than current controlcentric techniques do, and generalize easily to multiple levels of memory hierarchy. We buttress these claims with performance numbers for standard benchmarks from the problem domain of dense numerical linear algebra. The simplicity and intuitive appeal of our approach should make it attractive to compiler writers as well as to library writers. 1 Introduction Data reuse is imperative for good performance on modern highperformance computers because the memory architecture of these machines is a hierarchy in which the cost of ...
Maximizing Parallelism and Minimizing Synchronization with Affine Transforms
 Parallel Computing
, 1997
"... This paper presents the first algorithm to find the optimal affine transform that maximizes the degree of parallelism while minimizing the degree of synchronization in a program with arbitrary loop nestings and affine data accesses. The problem is formulated without the use of imprecise data depende ..."
Abstract

Cited by 129 (7 self)
 Add to MetaCart
This paper presents the first algorithm to find the optimal affine transform that maximizes the degree of parallelism while minimizing the degree of synchronization in a program with arbitrary loop nestings and affine data accesses. The problem is formulated without the use of imprecise data dependence abstractions such as data dependence vectors. The algorithm presented subsumes previously proposed program transformation algorithms that are based on unimodular transformations, loop fusion, fission, scaling, reindexing and/or statement reordering. 1 Introduction As multiprocessors become popular, it is important to develop compilers that can automatically translate sequential programs into efficient parallel code. Getting high performance on a multiprocessor requires not only finding parallelism in the program but also minimizing the synchronization overhead. Synchronization is expensive on a multiprocessor. The cost of synchronization goes far beyond just the operations that manipul...
Automatic Program Parallelization
, 1993
"... This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straightline code parallelization, do loop transformations, and parallelization of recursive routines. The last s ..."
Abstract

Cited by 105 (8 self)
 Add to MetaCart
This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straightline code parallelization, do loop transformations, and parallelization of recursive routines. The last section of the paper surveys several experimental studies on the effectiveness of parallelizing compilers.
Tiling Multidimensional Iteration Spaces for Multicomputers
, 1992
"... This paper addresses the problem of compiling perfectly nested loops for multicomputers (distributed memory machines). The relatively high communication startup costs in these machines renders frequent communication very expensive. Motivated by this, we present a method of aggregating a number of lo ..."
Abstract

Cited by 103 (20 self)
 Add to MetaCart
This paper addresses the problem of compiling perfectly nested loops for multicomputers (distributed memory machines). The relatively high communication startup costs in these machines renders frequent communication very expensive. Motivated by this, we present a method of aggregating a number of loop iterations into tiles where the tiles execute atomically  a processor executing the iterations belonging to a tile receives all the data it needs before executing any one of the iterations in the tile, executes all the iterations in the tile and then sends the data needed by other processors. Since synchronization is not allowed during the execution of a tile, partitioning the iteration space into tiles must not result in deadlock. We first show the equivalence between the problem of finding partitions and the problem of determining the cone for a given set of dependence vectors. We then present an approach to partitioning the iteration space into deadlockfree tiles so that communicati...
Automatic Data Partitioning on Distributed Memory Multiprocessors
, 1991
"... An important problem facing numerous research projects on parallelizing compilers for distributed memory machines is that of automatically determining a suitable data partitioning scheme for a program. Most of the current projects leave this tedious problem almost entirely to the user. In this paper ..."
Abstract

Cited by 102 (6 self)
 Add to MetaCart
An important problem facing numerous research projects on parallelizing compilers for distributed memory machines is that of automatically determining a suitable data partitioning scheme for a program. Most of the current projects leave this tedious problem almost entirely to the user. In this paper, we present a novel approach to the problem of automatic data partitioning. We introduce the notion of constraints on data distribution, and show how, based on performance considerations, a compiler identifies constraints to be imposed on the distribution of various data structures. These constraints are then combined by the compiler to obtain a complete and consistent picture of the data distribution scheme, one that offers good performance in terms of the overall execution time.
Optimizing for Parallelism and Data Locality
 In Proceedings of the 1992 ACM International Conference on Supercomputing
, 1992
"... Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the tradeoffs between effectively utilizing parallelism and memory hierarchy on sharedmemory mu ..."
Abstract

Cited by 94 (14 self)
 Add to MetaCart
Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the tradeoffs between effectively utilizing parallelism and memory hierarchy on sharedmemory multiprocessors. We present a simple, but surprisingly accurate, memory model to determine cache line reuse from both multiple accesses to the same memory location and from consecutive memory access. The model is used in memory optimizing and loop parallelization algorithms that effectively exploit data locality and parallelism in concert. We demonstrate the efficacy of this approach with very encouraging experimental results. 1 Introduction Transformations to exploit parallelism and to improve data locality are two of the most valuable compiler techniques in use today. Independently, each of these optimizations has been shown to result in dramatic improvements. This paper seeks to combine t...
Beyond Induction Variables
, 1992
"... Induction variable detection is usually closely tied to the strength reduction optimization. This paper studies induction variable analysis from a different perspective, that of finding induction variables for data dependence analysis. While classical induction variable analysis techniques have been ..."
Abstract

Cited by 90 (6 self)
 Add to MetaCart
Induction variable detection is usually closely tied to the strength reduction optimization. This paper studies induction variable analysis from a different perspective, that of finding induction variables for data dependence analysis. While classical induction variable analysis techniques have been used successfully up to now, we have found a simple algorithm based on the the Static Single Assignment form of a program that finds all linear induction variables in a loop. Moreover, this algorithm is easily extended to find induction variables in multiple nested loops, to find nonlinear induction variables, and to classify other integer scalar assignments in loops, such as monotonic, periodic and wraparound variables. Some of these other variables are now classified using ad hoc pattern recognition, while others are not analyzed by current compilers. Giving a unified approach improves the speed of compilers and allows a more general classification scheme. We also show how to use these va...
Code Generation for Multiple Mappings
 IN FRONTIERS '95: THE 5TH SYMPOSIUM ON THE FRONTIERS OF MASSIVELY PARALLEL COMPUTATION
, 1994
"... There has been a great amount of recent work toward unifying iteration reordering transformations. Many of these approaches represent transformations as affine mappings from the original iteration space to a new iteration space. These approaches show a great deal of promise, but they all rely on the ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
There has been a great amount of recent work toward unifying iteration reordering transformations. Many of these approaches represent transformations as affine mappings from the original iteration space to a new iteration space. These approaches show a great deal of promise, but they all rely on the ability to generate code that iterates over the points in these new iteration spaces in the appropriate order. This problem has been fairly wellstudied in the case where all statements use the same mapping. We have developed an algorithm for the less wellstudied case where each statement uses a potentially different mapping. Unlike many other approaches, our algorithm can also generate code from mappings corresponding to loop blocking. We address the important tradeoff between reducing control overhead and duplicating code.
A Framework for Unifying Reordering Transformations
, 1993
"... We present a framework for unifying iteration reordering transformations such as loop interchange, loop distribution, skewing, tiling, index set splitting and statement reordering. The framework is based on the idea that a transformation can be represented as a schedule that maps the original iterat ..."
Abstract

Cited by 72 (10 self)
 Add to MetaCart
We present a framework for unifying iteration reordering transformations such as loop interchange, loop distribution, skewing, tiling, index set splitting and statement reordering. The framework is based on the idea that a transformation can be represented as a schedule that maps the original iteration space to a new iteration space. The framework is designed to provide a uniform way to represent and reason about transformations. As part of the framework, we provide algorithms to assist in the building and use of schedules. In particular, we provide algorithms to test the legality of schedules, to align schedules and to generate optimized code for schedules. This work is supported by an NSF PYI grant CCR9157384 and by a Packard Fellowship. 1 Introduction Optimizing compilers reorder iterations of statements to improve instruction scheduling, register use, and cache utilization, and to expose parallelism. Many different reordering transformations have been developed and studied, su...