Results 1  10
of
58
Maximizing Parallelism and Minimizing Synchronization with Affine Transforms
 Parallel Computing
, 1997
"... This paper presents the first algorithm to find the optimal affine transform that maximizes the degree of parallelism while minimizing the degree of synchronization in a program with arbitrary loop nestings and affine data accesses. The problem is formulated without the use of imprecise data depende ..."
Abstract

Cited by 148 (6 self)
 Add to MetaCart
This paper presents the first algorithm to find the optimal affine transform that maximizes the degree of parallelism while minimizing the degree of synchronization in a program with arbitrary loop nestings and affine data accesses. The problem is formulated without the use of imprecise data dependence abstractions such as data dependence vectors. The algorithm presented subsumes previously proposed program transformation algorithms that are based on unimodular transformations, loop fusion, fission, scaling, reindexing and/or statement reordering. 1 Introduction As multiprocessors become popular, it is important to develop compilers that can automatically translate sequential programs into efficient parallel code. Getting high performance on a multiprocessor requires not only finding parallelism in the program but also minimizing the synchronization overhead. Synchronization is expensive on a multiprocessor. The cost of synchronization goes far beyond just the operations that manipul...
Generation of Efficient Nested Loops from Polyhedra
 International Journal of Parallel Programming
, 2000
"... Automatic parallelization in the polyhedral model is based on affine transformations from an original computation domain (iteration space) to a target spacetime domain, often with a different transformation for each variable. Code generation is an often ignored step in this process that has a signi ..."
Abstract

Cited by 89 (5 self)
 Add to MetaCart
Automatic parallelization in the polyhedral model is based on affine transformations from an original computation domain (iteration space) to a target spacetime domain, often with a different transformation for each variable. Code generation is an often ignored step in this process that has a significant impact on the quality of the final code. It involves making a tradeoff between code size and control code simplification/optimization. Previous methods of doing code generation are based on loop splitting, however they have nonoptimal behavior when working on parameterized programs. We present a general parameterized method for code generation based on dual representation of polyhedra. Our algorithm uses a simple recursion on the dimensions of the domains, and enables fine control over the tradeoff between code size and control overhead.
SUIF Explorer: an interactive and interprocedural parallelizer
, 1999
"... The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarsegrain loops, thus mini ..."
Abstract

Cited by 76 (5 self)
 Add to MetaCart
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarsegrain loops, thus minimizing the number of spurious dependences requiring attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmer in the parallelization process using a set of sophisticated visualization techniques. This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables. 1. Introduction Exploiting coarsegrain parallelism i...
Automatic Parallelization in the Polytope Model
 Laboratoire PRiSM, Université des Versailles StQuentin en Yvelines, 45, avenue des ÉtatsUnis, F78035 Versailles Cedex
, 1996
"... . The aim of this paper is to explain the importance of polytope and polyhedra in automatic parallelization. We show that the semantics of parallel programs is best described geometrically, as properties of sets of integral points in ndimensional spaces, where n is related to the maximum nesting de ..."
Abstract

Cited by 58 (3 self)
 Add to MetaCart
. The aim of this paper is to explain the importance of polytope and polyhedra in automatic parallelization. We show that the semantics of parallel programs is best described geometrically, as properties of sets of integral points in ndimensional spaces, where n is related to the maximum nesting depth of DO loops. The needed properties translate nicely to properties of polyhedra, for which many algorithms have been designed for the needs of optimization and operation research. We show how these ideas apply to scheduling, placement and parallel code generation. R'esum'e Le but de cet article est d'expliquer le role jou'e par les poly`edres et les polytopes en parall'elisation automatique. On montre que la s'emantique d'un programme parall`ele se d'ecrit naturellement sous forme g'eom'etrique, les propri'et'es du programme 'etant formalis'ees comme des propri'et'es d'ensemble de points dans un espace `a n dimensions. n est li'e `a la profondeur maximale d'imbrication des boucles DO. Les...
An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication
 In Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing
, 1999
"... An affine partitioning framework unifies many useful program transforms such as unimodular transformations (interchange, reversal, skewing), loop fusion, fission, scaling, reindexing, and statement reordering. This paper presents an algorithm, based on this unified framework, that maximizes parallel ..."
Abstract

Cited by 56 (3 self)
 Add to MetaCart
(Show Context)
An affine partitioning framework unifies many useful program transforms such as unimodular transformations (interchange, reversal, skewing), loop fusion, fission, scaling, reindexing, and statement reordering. This paper presents an algorithm, based on this unified framework, that maximizes parallelism while minimizing communication in programs with arbitrary loop nestings and affine data accesses. Our algorithm can find the optimal affine partition that maximizes the degree of parallelism with the minimum degree of synchronizations. In addition, it uses a greedy algorithm to minimize communication between loops heuristically by aligning the computation partitions for different loops, trading off excess degrees of parallelism, and choosing pipelined parallelism over doall parallelism if it can significantly reduce the communication. The algorithm is optimal in maximizing the degrees of parallelism that require (1) no communication, (2) nearneighbor communication and a constant number ...
Solving Alignment using Elementary Linear Algebra
 IN PROCEEDINGS OF THE 7TH ANNUAL WORKSHOP ON LANAGUAGES AND COMPILERS FOR PARALLEL COMPUTERS
, 1994
"... Data and computation alignment is an important part of compiling sequential programs to architectures with nonuniform memory access times. In this paper, we show that elementary matrix methods can be used to determine communicationfree alignment of code and data. We also solve the problem of repli ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
(Show Context)
Data and computation alignment is an important part of compiling sequential programs to architectures with nonuniform memory access times. In this paper, we show that elementary matrix methods can be used to determine communicationfree alignment of code and data. We also solve the problem of replicating readonly data to eliminate communication. Our matrixbased approach leads to algorithms which are simpler and faster than existing algorithms for the alignment problem. 1 Introduction: A key problem in generating code for nonuniform memory access (NUMA) parallel machines is data and computation placement  that is, determining what work each processor must do, and what data must reside in each local memory. The goal of placement is to exploit parallelism by spreading the work across the processors, and to exploit locality by spreading data so that memory accesses are local whenever possible. The problem of determining a good placement for a program is usually solved in two phases called alignment and distribution.
CommunicationFree Parallelization via Affine Transformations
 In 24 th ACM Symp. on Principles of Programming Languages
, 1994
"... . The paper describes a parallelization algorithm for programs consisting of arbitrary nestings of loops and sequences of loops. The code produced by our algorithm yields all the degrees of communicationfree parallelism that can be obtained via loop fission, fusion, interchange, reversal, skewing, ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
(Show Context)
. The paper describes a parallelization algorithm for programs consisting of arbitrary nestings of loops and sequences of loops. The code produced by our algorithm yields all the degrees of communicationfree parallelism that can be obtained via loop fission, fusion, interchange, reversal, skewing, scaling, reindexing and statement reordering. The algorithm first assigns the iterations of instructions in the program to processors via affine processor mappings, then generates the correct code by ensuring that the code executed by each processor is a subsequence of the original sequential execution sequence. 1 Introduction Previous research in vectorizing and parallelizing compilers has shown that parallelization can be improved by a host of highlevel loop transformations. These loop transformations include loop fission (or loop distribution), loop fusion, loop interchange, loop reversal, loop skewing, loop scaling, loop reindexing (also known as loop alignment or index set shifting), ...
Minimizing Communication while Preserving Parallelism
 In Proceedings of the 10th ACM International Conference on Supercomputing
, 1995
"... To compile programs for message passing architectures and to obtain good performance on NUMA architectures it is necessary to control how computations and data are mapped to processors. Languages such as HighPerformance Fortran use data distributions supplied by the programmer and the owner compute ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
(Show Context)
To compile programs for message passing architectures and to obtain good performance on NUMA architectures it is necessary to control how computations and data are mapped to processors. Languages such as HighPerformance Fortran use data distributions supplied by the programmer and the owner computes rule to specify this. However, the best data and computation decomposition may differ from machine to machine and require substantial expertise to determine. Therefore, automated decomposition is desirable. All existing methods for automated data/computation decomposition share a common failing: they are very sensitive to the original loop structure of the program. While they find a good decomposition for that loop structure, it may be possible to apply transformations (such as loop interchange and distribution) so that a different decomposition gives even better results. We have developed automatic data/computation decomposition methods that are not sensitive to the original program struc...