Results 1 - 10
of
64
A practical automatic polyhedral parallelizer and locality optimizer
- In PLDI ’08: Proceedings of the ACM SIGPLAN 2008 conference on Programming language design and implementation
, 2008
"... We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical mod ..."
Abstract
-
Cited by 122 (8 self)
- Add to MetaCart
(Show Context)
We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical model-driven automatic transformation in the polyhedral model.Unlike previous polyhedral frameworks, our approach is an end-to-end fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high performance for local and parallel execution on multi-cores, when compared with state-of-the-art compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies.
- Int. J. Parallel Program.,
, 2006
"... Modern compilers are responsible for translating the idealistic operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Since optimization problems are associated with huge and unstructured search spaces, this combinational task is ..."
Abstract
-
Cited by 82 (24 self)
- Add to MetaCart
(Show Context)
Modern compilers are responsible for translating the idealistic operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Since optimization problems are associated with huge and unstructured search spaces, this combinational task is poorly achieved in general, resulting in weak scalability and disappointing sustained performance. We address this challenge by working on the program representation itself, using a semi-automatic optimization approach to demonstrate that current compilers offen suffer from unnecessary constraints and intricacies that can be avoided in a semantically richer transformation framework. Technically, the purpose of this paper is threefold: (1) to show that syntactic code representations close to the operational semantics lead to rigid phase ordering and cumbersome expression of architecture-aware loop transformations, (2) to illustrate how complex transformation sequences may be needed to achieve significant performance benefits, (3) to facilitate the automatic search for program transformation sequences, improving on classical polyhedral representations to better support operation research strategies in a simpler, structured search space. The proposed framework relies on a unified polyhedral representation of loops and statements, using normalization rules to allow flexible and expressive transformation sequencing. This representation allows to extend the scalability of polyhedral dependence analysis, and to delay the (automatic) legality checks until the end of a transformation sequence. Our work leverages on algorithmic advances in polyhedral code generation and has been implemented in a modern research compiler.
Blocking and Array Contraction Across Arbitrarily Nested Loops Using Affine Partitioning
, 2001
"... Applicable to arbitrary sequences and nests of loops, affine partitioning is a program transformation framework that unifies many previously proposed loop transformations, including unimodular transforms, fusion, fission, reindexing, scaling and statement reordering. Algorithms based on affine parti ..."
Abstract
-
Cited by 80 (1 self)
- Add to MetaCart
(Show Context)
Applicable to arbitrary sequences and nests of loops, affine partitioning is a program transformation framework that unifies many previously proposed loop transformations, including unimodular transforms, fusion, fission, reindexing, scaling and statement reordering. Algorithms based on affine partitioning have been shown to be effective for parallelization and communication minimization. This paper presents algorithms that improve data locality using affine partitioning. Blocking and array contraction are two important optimizations that have been shown to be useful for data locality. Blocking creates a set of inner loops so that data brought into the faster levels of the memory hierarchy can be reused. Array contraction reduces an array to a scalar variable and thereby reduces the number of memory operations executed and the memory footprint. Loop transforms are often necessary to make blocking and array contraction possible.
Compile-time Composition of Run-time Data and Iteration Reorderings
, 2003
"... Many important applications, such as those using sparse data structures, have memory reference patterns that are unknown at compile-time. Prior work has developed runtime reorderings of data and computation that enhance locality in such applications. ..."
Abstract
-
Cited by 54 (9 self)
- Add to MetaCart
Many important applications, such as those using sparse data structures, have memory reference patterns that are unknown at compile-time. Prior work has developed runtime reorderings of data and computation that enhance locality in such applications.
Estimating Cache Misses and Locality Using Stack Distances
, 2003
"... Cache behavior modeling is an important part of modern optimizing compilers. In this paper we present a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms. Our algorithm computes the stack histograms symbolically, using data de ..."
Abstract
-
Cited by 53 (0 self)
- Add to MetaCart
(Show Context)
Cache behavior modeling is an important part of modern optimizing compilers. In this paper we present a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms. Our algorithm computes the stack histograms symbolically, using data dependence distance vectors and is totally accurate when dependence distances are uniformly generated. The stack histogram models accurately fully associative caches with LRU replacement policy, and provides a very good approximation for set-associative caches and programs with non-constant dependence distances.
Synthesis of High-Performance Parallel Programs for a Class of Ab Initio Quantum Chemistry Models
- PROCEEDINGS OF THE IEEE
, 2005
"... ..."
(Show Context)
Tiling Imperfectly-nested Loop Nests
- In Proc. of SC 2000
, 2000
"... Tiling is one of the more important transformations for enhancing locality of reference in programs. Tiling of perfectly-nested loop nests (which are loop nests in which all assignment statements are contained in the innermost loop) is well understood. In practice, most loop nests are imperfectly-ne ..."
Abstract
-
Cited by 39 (0 self)
- Add to MetaCart
Tiling is one of the more important transformations for enhancing locality of reference in programs. Tiling of perfectly-nested loop nests (which are loop nests in which all assignment statements are contained in the innermost loop) is well understood. In practice, most loop nests are imperfectly-nested, so existing compilers heuristically try to find a sequence of transformations that convert such loop nests into perfectly-nested ones but not always succeed. In this paper, we propose a novel approach to tiling imperfectly-nested loop nests. The key idea is to embed the iteration space of every statement in the imperfectly-nested loop nest into a special space called the product space. The set of possible embeddings is constrained so that the resulting product space can be legally tiled. From this set we choose embeddings that enhance data reuse. We evaluate the effectiveness of this approach for dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the...
Facilitating the Search for Compositions of Program Transformations
- In ACM Int. Conf. on Supercomputing (ICS’05
, 2005
"... Static compiler optimizations can hardly cope with the complex run-time behavior and hardware components interplay of modern processor architectures. Multiple architectural phenomena occur and interact simultaneously, which requires the optimizer to combine multiple program transformations. Whether ..."
Abstract
-
Cited by 37 (11 self)
- Add to MetaCart
(Show Context)
Static compiler optimizations can hardly cope with the complex run-time behavior and hardware components interplay of modern processor architectures. Multiple architectural phenomena occur and interact simultaneously, which requires the optimizer to combine multiple program transformations. Whether these transformations are selected through static analysis and models, runtime feedback, or both, the underlying infrastructure must have the ability to perform long and complex compositions of program transformations in a flexible manner. Existing compilers are ill-equipped to perform that task because of rigid phase ordering, fragile selection rules using pattern matching, and cumbersome expression of loop transformations on syntax trees. Moreover, iterative optimization emerges as a pragmatic and general means to select an optimization strategy via machine learning and operations research. Searching for the composition of dozens of complex, dependent, parameterized transformations is a challenge for iterative approaches.
A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry
- In Proc. of Supercomputing 2002
, 2002
"... This paper discusses an approach to the synthesis of high-performance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of ..."
Abstract
-
Cited by 37 (15 self)
- Add to MetaCart
(Show Context)
This paper discusses an approach to the synthesis of high-performance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of the synthesis system, that transforms a high-level specification of the computation into high-performance parallel code, tailored to the characteristics of the target architecture. An example from computational chemistry is used to illustrate how different code structures are generated under different assumptions of available memory on the target computer.
Generating Cache Hints for Improved Program Efficiency
- JOURNAL OF SYSTEMS ARCHITECTURE
, 2004
"... One of the new extensions in EPIC architectures are cache hints. On each memory instruction, two kinds of hints can be attached: a source cache hint and a target cache hint. The source hint indicates the true latency of the instruction, which is used by the compiler to improve the instruction schedu ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
One of the new extensions in EPIC architectures are cache hints. On each memory instruction, two kinds of hints can be attached: a source cache hint and a target cache hint. The source hint indicates the true latency of the instruction, which is used by the compiler to improve the instruction schedule. The target hint indicates at which cache levels it is profitable to retain data, allowing to improve cache replacement decisions at run time. A compile-time method is presented which calculates appropriate cache hints. Both kind of hints are based on the locality of the instruction, measured by the reuse distance metric. Two