Results 1  10
of
84
Datacentric Multilevel Blocking
, 1997
"... We present a simple and novel framework for generating blocked codes for highperformance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the fl ..."
Abstract

Cited by 139 (9 self)
 Add to MetaCart
We present a simple and novel framework for generating blocked codes for highperformance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the flow of data through the memory hierarchy. Our datacentric transformations permit a more direct solution to the problem of enhancing data locality than current controlcentric techniques do, and generalize easily to multiple levels of memory hierarchy. We buttress these claims with performance numbers for standard benchmarks from the problem domain of dense numerical linear algebra. The simplicity and intuitive appeal of our approach should make it attractive to compiler writers as well as to library writers. 1 Introduction Data reuse is imperative for good performance on modern highperformance computers because the memory architecture of these machines is a hierarchy in which the cost of ...
A comparison of empirical and modeldriven optimization
 In ACM Symp. on Programming Language Design and Implementation (PLDI’03
, 2003
"... Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose thes ..."
Abstract

Cited by 81 (9 self)
 Add to MetaCart
Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose these parameters. It is widely believed that empirical optimization is more effective than modeldriven optimization, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the empirical optimization engine in ATLAS (a system for generating dense numerical linear algebra libraries) with a modelbased optimization engine that used detailed models to estimate values for optimization parameters, and then measured the relative performance of the two systems on three different hardware platforms. Our experiments show that although modelbased optimization can be surprisingly effective, useful models may have to consider not only hardware parameters but also the ability of backend compilers to exploit hardware resources. 1.
A practical automatic polyhedral parallelizer and locality optimizer
 In PLDI ’08: Proceedings of the ACM SIGPLAN 2008 conference on Programming language design and implementation
, 2008
"... We present the design and implementation of an automatic polyhedral sourcetosource transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical mod ..."
Abstract

Cited by 62 (2 self)
 Add to MetaCart
We present the design and implementation of an automatic polyhedral sourcetosource transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical modeldriven automatic transformation in the polyhedral model.Unlike previous polyhedral frameworks, our approach is an endtoend fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high performance for local and parallel execution on multicores, when compared with stateoftheart compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.
Synthesizing transformations for locality enhancement of imperfectlynested loop nests
 In Proceedings of the 2000 ACM International Conference on Supercomputing
, 2000
"... We present an approach for synthesizing transformations to enhance locality in imperfectlynested loops. The key idea is to embed the iteration space of every statement in a loop nest into a special iteration space called the product space. The product space can be viewed as a perfectlynested loop ..."
Abstract

Cited by 56 (3 self)
 Add to MetaCart
We present an approach for synthesizing transformations to enhance locality in imperfectlynested loops. The key idea is to embed the iteration space of every statement in a loop nest into a special iteration space called the product space. The product space can be viewed as a perfectlynested loop nest, so embedding generalizes techniques like code sinking and loop fusion that are used in ad hoc ways in current compilers to produce perfectlynested loops from imperfectlynested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space is then transformed further to enhance locality, after which fully permutable loops are tiled, and code is generated. We evaluate the effectiveness of this approach for dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks. 1. BACKGROUND AND PREVIOUSWORK Sophisticated algorithms based on polyhedral algebra have been developed for determining good sequences of linear loop transformations (permutation, skewing, reversal and scaling) for enhancing locality in perfectlynested loops 1. Highlights of this technology are the following. The iterations of the loop nest are modeled as points in an integer lattice, and linear loop transformations are modeled as nonsingular matrices mapping one lattice to another. A sequence of loop transformations is modeled by the product of matrices representing the individual transformations; since the set of nonsingular matrices is closed under matrix product, this means that a sequence of linear loop transformations can be represented by a nonsingular matrix. The problem of finding an optimal sequence of linear loop transformations is thus reduced to the problem of finding an integer matrix that satisfies some desired property, permitting the full machinery of matrix methods and lattice theory to ¢ This work was supported by NSF grants CCR9720211, EIA9726388, ACI9870687,EIA9972853. £ A perfectlynested loop is a set of loops in which all assignment statements are contained in the innermost loop.
On Tiling as a Loop Transformation
, 1997
"... This paper is a followup to Irigoin and Triolet's earlier work and our recent work on tiling. In this paper, tiling is discussed in terms of its effects on the dependences between tiles, the dependences within a tile and the required dependence test for legality. A necessary and sufficient conditio ..."
Abstract

Cited by 37 (9 self)
 Add to MetaCart
This paper is a followup to Irigoin and Triolet's earlier work and our recent work on tiling. In this paper, tiling is discussed in terms of its effects on the dependences between tiles, the dependences within a tile and the required dependence test for legality. A necessary and sufficient condition is given for enforcing the data dependences of the program, while Irigoin and Triolet's atomic tile constraint is only sufficient. A condition is identified under which both Irigoin and Triolet's and our constraints are equivalent. The results of this paper are discussed in terms of their impact on dependence abstractions suitable for legality test and on tiling to optimise a certain given goal. Keywords: Tiling, loop transformation, dependence analysis, code generation. 1. Introduction Blocked algorithms are widely known to achieve high performance on parallel computers with a memory hierarchy 8;9 . Tiling is a loop transformation that a parallelising compiler can use to automatically ...
CommunicationMinimal Tiling of Uniform Dependence Loops
, 1996
"... . Tiling is a loop transformation that a compiler uses to create automatically blocked algorithms in order to improve the benefits of the memory hierarchy and reduce the communication overhead between processors. Motivated by existing results, this paper presents a conceptually simple approach to fi ..."
Abstract

Cited by 37 (4 self)
 Add to MetaCart
. Tiling is a loop transformation that a compiler uses to create automatically blocked algorithms in order to improve the benefits of the memory hierarchy and reduce the communication overhead between processors. Motivated by existing results, this paper presents a conceptually simple approach to finding tilings with a minimal amount of communication between tiles. The development of almost all results is based primarily on the inequality of arithmetic and geometric means and the concept of extremal rays from convex cones. The key insight is that a tiling that is communicationminimal must induce the same amount of communication through all faces of a tile, which restricts the search space for optimal tilings to those tiling matrices whose rows are all extremal rays in a cone. For nested loops with several special forms of dependences, closedform optimal tilings are derived. In the general case, a procedure is given that always returns optimal tilings. An efficient implementation of t...
Synthesis of HighPerformance Parallel Programs for a Class of Ab Initio Quantum Chemistry Models
 PROCEEDINGS OF THE IEEE
, 2005
"... ..."
Tiling Imperfectlynested Loop Nests
 In Proc. of SC 2000
, 2000
"... Tiling is one of the more important transformations for enhancing locality of reference in programs. Tiling of perfectlynested loop nests (which are loop nests in which all assignment statements are contained in the innermost loop) is well understood. In practice, most loop nests are imperfectlyne ..."
Abstract

Cited by 33 (0 self)
 Add to MetaCart
Tiling is one of the more important transformations for enhancing locality of reference in programs. Tiling of perfectlynested loop nests (which are loop nests in which all assignment statements are contained in the innermost loop) is well understood. In practice, most loop nests are imperfectlynested, so existing compilers heuristically try to find a sequence of transformations that convert such loop nests into perfectlynested ones but not always succeed. In this paper, we propose a novel approach to tiling imperfectlynested loop nests. The key idea is to embed the iteration space of every statement in the imperfectlynested loop nest into a special space called the product space. The set of possible embeddings is constrained so that the resulting product space can be legally tiled. From this set we choose embeddings that enhance data reuse. We evaluate the effectiveness of this approach for dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the...
Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping
 In Proceedings of IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS’01
, 2001
"... This paper proposes a new method for the problem of minimizing the execution time of nested forloops using a tiling transformation. In our approach, we are interested not only in tile size and shape according to the required communication to computation ratio, but also in overall completion time. W ..."
Abstract

Cited by 26 (12 self)
 Add to MetaCart
This paper proposes a new method for the problem of minimizing the execution time of nested forloops using a tiling transformation. In our approach, we are interested not only in tile size and shape according to the required communication to computation ratio, but also in overall completion time. We select a time hyperplane to execute different tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. We assign tiles to processors according to the tile space boundaries, thus considering the iteration space bounds. Our schedule considerably reduces overall completion time under the assumption that some part from every communication phase can be efficiently overlapped with atomic, pure tile computations. The overall schedule resembles a pipelined datapath where computations are not anymore interleaved with sends and receives to nonlocal processors. Experimental results in a cluster of Pentiums by using various MPI send primitives show that the total completion time is significantly reduced.
On Supernode Transformation with Minimized Total Running Time
"... With the objective of minimizing the total execution time of a parallel program on a distributed memory parallel computer, this paper discusses how to find an optimal supernode size and optimal supernode relative side lengths of a supernode transformation (also known as tiling). We identify three ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
With the objective of minimizing the total execution time of a parallel program on a distributed memory parallel computer, this paper discusses how to find an optimal supernode size and optimal supernode relative side lengths of a supernode transformation (also known as tiling). We identify three parameters of supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For algorithms with perfectly nested loops and uniform dependencies, for sufficiently large supernodes and number of processors, and for the case where multiple supernodes are mapped to a single processor, we give an order n polynomial whose real positive roots include the optimal supernode size. For two special cases: (1) two dimensional algorithm problems and (2) ndimensional algorithm problems where the communication cost is dominated by the startup penalty and therefore, can be approximated by a constant, we give a closed form expression for the optimal supernode s...