Results 1 -
6 of
6
Loop Transformation Recipes for Code Generation and Auto-Tuning
"... Abstract. In this paper, we describe transformation recipes, which provide a high-level interface to the code transformation and code generation capability of a compiler. These recipes can be generated by compiler decision algorithms or savvy software developers. This interface is part of an auto-tu ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
(Show Context)
Abstract. In this paper, we describe transformation recipes, which provide a high-level interface to the code transformation and code generation capability of a compiler. These recipes can be generated by compiler decision algorithms or savvy software developers. This interface is part of an auto-tuning framework that explores a set of different implementations of the same computation and automatically selects the best-performing implementation. Along with the original computation, a transformation recipe specifies a range of implementations of the computation resulting from composing a set of high-level code transformations. In our system, an underlying polyhedral framework coupled with transformation algorithms takes this set of transformations, composes them and automatically generates correct code. We first describe an abstract interface for transformation recipes, which we propose to facilitate interoperability with other transformation frameworks. We then focus on the specific transformation recipe interface used in our compiler and present performance results on its application to kernel and library tuning and tuning of key computations in high-end applications. We also show how this framework can be used to generate and auto-tune parallel OpenMP or CUDA code from a high-level specification. 1
Generating SIMD Vectorized Permutations
"... Abstract. This paper introduces a method to generate efficient vectorized implementations of small stride permutations using only vector load and vector shuffle instructions. These permutations are crucial for highperformance numerical kernels including the fast Fourier transform. Our generator take ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
(Show Context)
Abstract. This paper introduces a method to generate efficient vectorized implementations of small stride permutations using only vector load and vector shuffle instructions. These permutations are crucial for highperformance numerical kernels including the fast Fourier transform. Our generator takes as input only the specification of the target platform’s SIMD vector ISA and the desired permutation. The basic idea underlying our generator is to model vector instructions as matrices and sequences of vector instructions as matrix formulas using the Kronecker product formalism. We design a rewriting system and a search mechanism that applies matrix identities to generate those matrix formulas that have vector structure and minimize a cost measure that we define. The formula is then translated into the actual vector program for the specified permutation. For three important classes of permutations, we show that our method yields a solution with the minimal number of vector shuffles. Inserting into a fast Fourier transform yields a significant speedup. 1
Performance Optimization of Tensor Contraction Expressions for Many Body Methods in Quantum Chemistry
"... Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions. The identification of common subexpressions among a set of tensor contraction expressions can result in a reduction of the total number of operations required to evaluate the tensor contractions. The first part of the paper describes an effective algorithm for operation minimization
Transitive Closure on the Cell Broadband Engine: A study on Self-Scheduling in a Multicore Processor ∗
"... In this paper, we present a mapping methodology and optimizations for solving transitive closure on the Cell multicore processor. Using our approach, it is possible to achieve near peak performance for transitive closure on the Cell processor. We first parallelize the Standard Floyd Warshall algorit ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
In this paper, we present a mapping methodology and optimizations for solving transitive closure on the Cell multicore processor. Using our approach, it is possible to achieve near peak performance for transitive closure on the Cell processor. We first parallelize the Standard Floyd Warshall algorithm and show through analysis and experimental results that data communication is a bottleneck for performance and scalability. We parallelize a cache optimized version of Floyd Warshall algorithm to remove the memory bottleneck. As is the case with several scientific computing and industrial applications on a multicore processor, synchronization and scheduling of the cores plays a crucial role in determining the performance of this algorithm. We define a self-scheduling mechanism for the cores of a multicore processor and design a self-scheduler for Blocked Floyd Warshall algorithm on the Cell multicore processor to remove the scheduling bottleneck. We also present optimizations in scheduling order to remove synchronization points. Our implementations achieved up to 78GFLOPS. 1
ACKNOWLEDGEMENTS
, 2001
"... I dedicate this dissertation to my wife Muntaha and our two sons Hay them and Hani who have been always supportive, patient, and understanding. ii ..."
Abstract
- Add to MetaCart
(Show Context)
I dedicate this dissertation to my wife Muntaha and our two sons Hay them and Hani who have been always supportive, patient, and understanding. ii
Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry†
, 2009
"... Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties ..."
Abstract
- Add to MetaCart
Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions. The identification of common subexpressions among a set of tensor contraction expressions can result in a reduction of the total number of operations required to evaluate the tensor contractions. The first part of the paper describes an effective algorithm for operation minimization with common subexpression identification and demonstrates its effectiveness on tensor contraction expressions for coupled cluster equations. The second part of the paper highlights the importance of data layout transformation in the optimization of tensor contraction computations on modern processors. A number of considerations, such as minimization of cache misses and utilization of multimedia vector instructions, are discussed. A library for efficient index permutation of multidimensional tensors is described, and experimental performance data is provided that demonstrates its effectiveness.