Results 1  10
of
41
A practical automatic polyhedral parallelizer and locality optimizer
 In PLDI ’08: Proceedings of the ACM SIGPLAN 2008 conference on Programming language design and implementation
, 2008
"... We present the design and implementation of an automatic polyhedral sourcetosource transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical mod ..."
Abstract

Cited by 110 (7 self)
 Add to MetaCart
(Show Context)
We present the design and implementation of an automatic polyhedral sourcetosource transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical modeldriven automatic transformation in the polyhedral model.Unlike previous polyhedral frameworks, our approach is an endtoend fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high performance for local and parallel execution on multicores, when compared with stateoftheart compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.
On Tiling as a Loop Transformation
, 1997
"... This paper is a followup to Irigoin and Triolet's earlier work and our recent work on tiling. In this paper, tiling is discussed in terms of its effects on the dependences between tiles, the dependences within a tile and the required dependence test for legality. A necessary and sufficient con ..."
Abstract

Cited by 40 (10 self)
 Add to MetaCart
This paper is a followup to Irigoin and Triolet's earlier work and our recent work on tiling. In this paper, tiling is discussed in terms of its effects on the dependences between tiles, the dependences within a tile and the required dependence test for legality. A necessary and sufficient condition is given for enforcing the data dependences of the program, while Irigoin and Triolet's atomic tile constraint is only sufficient. A condition is identified under which both Irigoin and Triolet's and our constraints are equivalent. The results of this paper are discussed in terms of their impact on dependence abstractions suitable for legality test and on tiling to optimise a certain given goal. Keywords: Tiling, loop transformation, dependence analysis, code generation. 1. Introduction Blocked algorithms are widely known to achieve high performance on parallel computers with a memory hierarchy 8;9 . Tiling is a loop transformation that a parallelising compiler can use to automatically ...
Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping
 In Proceedings of IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS’01
, 2001
"... This paper proposes a new method for the problem of minimizing the execution time of nested forloops using a tiling transformation. In our approach, we are interested not only in tile size and shape according to the required communication to computation ratio, but also in overall completion time. W ..."
Abstract

Cited by 29 (14 self)
 Add to MetaCart
(Show Context)
This paper proposes a new method for the problem of minimizing the execution time of nested forloops using a tiling transformation. In our approach, we are interested not only in tile size and shape according to the required communication to computation ratio, but also in overall completion time. We select a time hyperplane to execute different tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. We assign tiles to processors according to the tile space boundaries, thus considering the iteration space bounds. Our schedule considerably reduces overall completion time under the assumption that some part from every communication phase can be efficiently overlapped with atomic, pure tile computations. The overall schedule resembles a pipelined datapath where computations are not anymore interleaved with sends and receives to nonlocal processors. Experimental results in a cluster of Pentiums by using various MPI send primitives show that the total completion time is significantly reduced.
PLuTo: A practical and fully automatic polyhedral program optimization system
 IN: PROCEEDINGS OF THE ACM SIGPLAN 2008 CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION (PLDI 08
, 2008
"... We present the design and implementation of a fully automatic polyhedral sourcetosource transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytica ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
(Show Context)
We present the design and implementation of a fully automatic polyhedral sourcetosource transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical modeldriven automatic transformation in the polyhedral model – far beyond what is possible by current production compilers. Unlike previous works, our approach is an endtoend fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. We also address generation of tiled code for multiple statement domains of arbitrary dimensionalities under (statementwise) affine transformations – an issue that has not been addressed previously. Experimental results from the implemented system show very high speedups for local and parallel execution on multicores over stateoftheart compiler frameworks from the research community as well as the best native compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.
Automatic Transformations for CommunicationMinimized Parallelization and Locality Optimization in the Polyhedral Model
"... Abstract. Many compute intensive applications spend a significant fraction of their time in nested loops. The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses for parallel execution. Affine transformations in this model capture a complex sequence of execut ..."
Abstract

Cited by 24 (10 self)
 Add to MetaCart
(Show Context)
Abstract. Many compute intensive applications spend a significant fraction of their time in nested loops. The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses for parallel execution. Affine transformations in this model capture a complex sequence of executionreordering loop transformations that can improve performance by parallelization as well as locality enhancement. Although a significant amount of research has addressed affine scheduling and partitioning, the problem of automatically finding good affine transforms for communicationoptimized coarsegrained parallelization along with locality optimization for the general case of arbitrarilynested loop sequences remains a challenging problem. In this paper, we propose an automatic transformation framework to optimize arbitrarilynested loop sequences with affine dependences for parallelism and locality simultaneously. The approach finds good tiling hyperplanes by embedding a powerful and versatile cost function into an Integer Linear Programming formulation. These tiling hyperplanes are used for communicationminimized coarsegrained parallelization as well as locality optimization. It enables the minimization of intertile communication volume in the processor space, and minimization of reuse distances for local execution at each node. Programs requiring onedimensional versus multidimensional time schedules (with schedulingbased approaches) are all handled with the same algorithm. Synchronizationfree parallelism, permutable loops or pipelined parallelism at various levels can be detected. Preliminary results from the implemented framework show promising performance and scalability with input size. 1
On Time Optimal Supernode Shape
 IEEE Trans. Par. & Dist. Sys
, 1999
"... this paper discusses the selection of an optimal supernode shape of a supernode transformation (also known as tiling). We assume that the communication cost is dominated by the startup penalty and therefore, can be approximated by a constant. We identify three parameters of a supernode transformatio ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
this paper discusses the selection of an optimal supernode shape of a supernode transformation (also known as tiling). We assume that the communication cost is dominated by the startup penalty and therefore, can be approximated by a constant. We identify three parameters of a supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For algorithms with perfectly nested loops and uniform dependencies, we give a closed form expression for an optimal linear schedule vector, and a necessary and sufficient condition for optimal relative side lengths. We prove that the total running time is minimized by cutting hyperplane direction matrix whose rows are from the surface of the polar cone of the cone spanned by dependence vectors, also known as tiling cone. The results are derived in continuous space and should for that reason be considered approximate.
An Efficient Code Generation Technique for Tiled Iteration Spaces
 IEEE Transactions on Parallel and Distributed Systems
, 2003
"... This paper presents a novel approach for the problem of generating tiled code for nested forloops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto para ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
(Show Context)
This paper presents a novel approach for the problem of generating tiled code for nested forloops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and second sweep all points within each tile. For the first subproblem, 1 we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code.
Automatic Data and Computation Decomposition on Distributed Memory Parallel Computers
 ACM Trans. Programming Languages and Systems
, 2002
"... On shared memory parallel computers (SMPCs) it is natural to focus on decomposing the computation (mainly by distributing the iterations of the nested DoLoops). In contrast, on distributed memory parallel computers (DMPCs) the decomposition of computation and the distribution of data must both be h ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
On shared memory parallel computers (SMPCs) it is natural to focus on decomposing the computation (mainly by distributing the iterations of the nested DoLoops). In contrast, on distributed memory parallel computers (DMPCs) the decomposition of computation and the distribution of data must both be handledin order to balance the computation load and to minimize the migration of data. We propose and validate experimentally a method for handling computations and data synergistically to optimize the overall execution time. The method relies on a number of novel techniques, also presented in this paper. The core idea is to rank the "importance" of data arrays in a program and define some of the dominant. The intuition is that the dominant arrays are the ones whose migration would be the most expensive. Using the correspondence between iteration space mapping vectors and distributed dimensions of the dominant data array in each nested Doloop, we are able to design algorithms for determin...
Generating Efficient Tiled Code for Distributed Memory Machines
 Parallel Computing
, 2000
"... Abstract — Tiling can improve the performance of nested loops on distributed memory machines by exploiting coarsegrain parallelism and reducing communication overhead and frequency. Tiling calls for a compilation approach that performs first computation distribution and then data distribution, both ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Abstract — Tiling can improve the performance of nested loops on distributed memory machines by exploiting coarsegrain parallelism and reducing communication overhead and frequency. Tiling calls for a compilation approach that performs first computation distribution and then data distribution, both possibly on a skewed iteration space. This paper presents a suite of compiler techniques for generating efficient SPMD programs to execute rectangularly tiled iteration spaces on distributed memory machines. The following issues are addressed: computation and data distribution, messagepassing code generation, memory management and optimisations, and global to local address translation. Methods are developed for partitioning arbitrary iteration spaces and skewed data spaces. Techniques for generating efficient messagepassing code for both arbitrary and rectangular iteration spaces are presented. A storage scheme for managing both local and nonlocal references is developed, which leads to the SPMD code with high locality of references. Two memory optimisations are given to reduce the amount of memory usage for skewed iteration spaces and expanded arrays, respectively. The proposed compiler techniques are illustrated using a simple running example and finally analysed and evaluated based on experimental results on a Fujitsu AP1000 consisting of 128 processors.
Affine transformation for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences
, 2007
"... A long running program often spends most of its time in nested loops. The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses for parallel execution. Affine transformations in this model capture a complex sequence of executionreordering loop transformations ..."
Abstract

Cited by 10 (7 self)
 Add to MetaCart
(Show Context)
A long running program often spends most of its time in nested loops. The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses for parallel execution. Affine transformations in this model capture a complex sequence of executionreordering loop transformations that improve performance by parallelization as well as better locality. Although a significant amount of research has addressed affine scheduling and partitioning, the problem of automatically finding good affine transforms for communicationoptimized coarsegrained parallelization along with locality optimization for the general case of arbitrarilynested loop sequences remains a challenging problem most frameworks do not treat parallelization and locality optimization in an integrated manner, and/or do not optimize across a sequence of producerconsumer loops. In this paper, we develop an approach to communication minimization and locality optimization in tiling of arbitrarily nested loop sequences with affine dependences. We address the minimization of intertile communication volume in the processor space, and minimization of reuse distances for local execution at each node. The approach can also fuse across a long sequence of loop nests that have a producer/consumer relationship. Programs requiring onedimensional versus multidimensional time schedules are all handled with the same algorithm. Synchronizationfree parallelism, permutable loops or pipelined parallelism, and inner parallel loops can be detected. Examples are provided that demonstrate the power of the framework. The algorithm has been incorporated into a tool chain to generate transformations from C/Fortran code in a fully automatic fashion.