Results 1  10
of
37
Statistical models for empirical searchbased performance tuning
 INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2004
"... Achieving peak performance from the computational kernels that dominate application performance often requires extensive machinedependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementa ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
Achieving peak performance from the computational kernels that dominate application performance often requires extensive machinedependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (i.e., actually running the code). This paper presents quantitative data that motivates the development of such a searchbased system, using dense matrix multiply as a case study. The statistical distributions of performance within spaces of reasonable implementations, when observed on a variety of hardware platforms, lead us to pose and address two general problems which arise during the search process. First, we develop a heuristic for stopping an exhaustive compiletime search early if a nearoptimal implementation is found. Second, we show how to construct
Predictive Modeling in a Polyhedral Optimization Space
 INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO'11)
, 2011
"... Significant advances in compiler optimization have been made in recent years, enabling many transformations such as tiling, fusion, parallelization and vectorization on imperfectly nested loops. Nevertheless, the problem of finding the best combination of loop transformations remains a major challen ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
Significant advances in compiler optimization have been made in recent years, enabling many transformations such as tiling, fusion, parallelization and vectorization on imperfectly nested loops. Nevertheless, the problem of finding the best combination of loop transformations remains a major challenge. Polyhedral models for compiler optimization have demonstrated strong potential for enhancing program performance, in particular for computeintensive applications. But existing static cost models to optimize polyhedral transformations have significant limitations, and iterative compilation has become a very promising alternative to these models to find the most effective transformations. But since the number of polyhedral optimization alternatives can be enormous, it is often impractical to iterate over a significant fraction of the entire space of polyhedrally transformed variants. Recent research has focused on iterating over this search space either with manuallyconstructed heuristics or with automatic but very expensive search algorithms (e.g., genetic algorithms) that can eventually find good points in the polyhedral space. In this paper, we propose the use of machine learning to address the problem of selecting the best polyhedral optimizations. We show that these models can quickly find highperformance program variants in the polyhedral space, without resorting to extensive empirical search. We introduce models that take as input a characterization of a program based on its dynamic behavior, and predict the performance of aggressive highlevel polyhedral transformations that includes tiling, parallelization and vectorization. We allow for a minimal empirical search on the target machine, discovering on average 83 % of the searchspaceoptimal combinations in at most 5 runs. Our endtoend framework is validated using numerous benchmarks on two multicore platforms.
Data locality optimization for synthesis of efficient outofcore algorithms
 In Proc. of the Intl. Conf. on High Performance Computing
, 2003
"... Abstract. This paper describes an approach to synthesis of efficient outofcore code for a class of imperfectly nested loops that represent tensor contraction computations. Tensor contraction expressions arise in many accurate computational models of electronic structure. The developed approach co ..."
Abstract

Cited by 14 (12 self)
 Add to MetaCart
(Show Context)
Abstract. This paper describes an approach to synthesis of efficient outofcore code for a class of imperfectly nested loops that represent tensor contraction computations. Tensor contraction expressions arise in many accurate computational models of electronic structure. The developed approach combines loop fusion with loop tiling and uses a performancemodel driven approach to loop tiling for the generation of outofcore code. Experimental measurements are provided that show a good match with modelbased predictions and demonstrate the effectiveness of the proposed algorithm. 1
Raising the Level of Programming Abstraction in Scalable Programming Models
 In IEEE International Conference on High Performance Computer Architecture (HPCA), Workshop on Productivity and Performance in HighEnd Computing (PPHEC
, 2004
"... The complexity of modern scientific simulations combined with the complexity of the highperformance computer hardware on which they run place an everincreasing burden on scientific software developers, with clear impacts on both productivity and performance. We argue that raising the level of abstr ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
The complexity of modern scientific simulations combined with the complexity of the highperformance computer hardware on which they run place an everincreasing burden on scientific software developers, with clear impacts on both productivity and performance. We argue that raising the level of abstraction of the programming model/environment is a key element of addressing this situation. We present examples of two distinctly different approaches to raising the level of abstraction of the programming model while maintaining or increasing performance: the Tensor Contraction engine, a narrowlyfocused domain specific language together with an optimizing compiler; and Extended Global Arrays, a programming framework that integrates programming models dealing with different layers of the memory/storage hierarchy using compiler analysis and code transformation techniques. 1.
Hypergraph partitioning for automatic memory hierarchy management
 In Supercomputing (SC06
, 2006
"... In this paper, we present a mechanism for automatic management of the memory hierarchy, including secondary storage, in the context of a global address space parallel programming framework. The programmer specifies the parallelism and locality in the computation. The scheduling of the computation in ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
In this paper, we present a mechanism for automatic management of the memory hierarchy, including secondary storage, in the context of a global address space parallel programming framework. The programmer specifies the parallelism and locality in the computation. The scheduling of the computation into stages, together with the movement of the associated data between secondary storage and global memory, and between global memory and local memory, is automatically managed. A novel formulation of hypergraph partitioning is used to model the optimization problem of minimizing disk I/O. Experimental evaluation of the proposed approach using a subcomputation from the quantum chemistry domain shows a reduction in the disk I/O cost by up to a factor of 11, and a reduction in turnaround time by up to 49%, as compared to alternative approaches used in stateoftheart quantum chemistry codes. 1
On efficient outofcore matrix transposition
, 2003
"... This paper addresses the problem of transposition of large outofcore arrays. Although algorithms for outofcore matrix transposition have been widely studied, previously proposed algorithms have sought to minimize the number of I/O operations and the inmemory permutation time. We propose an alg ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
This paper addresses the problem of transposition of large outofcore arrays. Although algorithms for outofcore matrix transposition have been widely studied, previously proposed algorithms have sought to minimize the number of I/O operations and the inmemory permutation time. We propose an algorithm that directly targets the improvement of overall transposition time. The algorithm proposed decouples the algorithm from the matrix dimensions and associates it with the I/O characteristics of the system. The I/O characteristics of the system are used to determine the read and write block sizes. These I/O block sizes are chosen in order to optimize the total execution time. Experimental results are provided that demonstrate the
P.: Efficient parallel outofcore matrix transposition
 In: Proceedings of the International Conference on Cluster Computing, IEEE Computer Society Press
, 2003
"... This paper addresses the problem of parallel transposition of large outofcore arrays. Although algorithms for outofcore matrix transposition have been widely studied, previously proposed algorithms have sought to minimize the number of I/O operations and the inmemory permutation time. We propose ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
This paper addresses the problem of parallel transposition of large outofcore arrays. Although algorithms for outofcore matrix transposition have been widely studied, previously proposed algorithms have sought to minimize the number of I/O operations and the inmemory permutation time. We propose an algorithm that directly targets the improvement of overall transposition time. The I/O characteristics of the system are used to determine the read, write and communication block sizes such that the total execution time is minimized. We also provide a solution to the array redistribution problem for arrays on disk. The solution to the sequential transposition problem and the parallel array redistribution problem are then combined to obtain an algorithm for the parallel outofcore transposition problem. 1.
Automated operation minimization of tensor contraction expressions in electronic structure calculations
 In Proc. ICCS 2005 5th International Conference, volume 3514 of Lecture Notes in Computer Science
, 2005
"... Abstract. Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the Coupled Cluster method. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operat ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the Coupled Cluster method. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions, but the optimization problem is NPhard. Operation minimization is an important optimization step for the Tensor Contraction Engine, a tool being developed for the automatic transformation of highlevel tensor contraction expressions into efficient programs. In this paper, we develop an effective heuristic approach to the operation minimization problem, and demonstrate its effectiveness on tensor contraction expressions for coupled cluster equations. 1
Efficient SearchSpace Pruning for Integrated Fusion and Tiling Transformations
"... Abstract. Compiletime optimizations involve a number of transformations such as loop permutation, fusion, tiling, array contraction, etc. Determination of the choice of these transformations that minimizes the execution time is a challenging task. We address this problem in the context of tensor co ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Compiletime optimizations involve a number of transformations such as loop permutation, fusion, tiling, array contraction, etc. Determination of the choice of these transformations that minimizes the execution time is a challenging task. We address this problem in the context of tensor contraction expressions involving arrays too large to fit in main memory. Domainspecific features of the computation are exploited to develop an integrated framework that facilitates the exploration of the entire search space of optimizations. In this paper, we discuss the exploration of the space of loop fusion and tiling transformations in order to minimize the disk I/O cost. These two transformations are integrated and pruning strategies are presented that significantly reduce the number of loop structures to be evaluated for subsequent transformations. The evaluation of the framework using representative contraction expressions from quantum chemistry shows a dramatic reduction in the size of the search space using the strategies presented. 1
AUsing Machine Learning to Improve Automatic Vectorization
"... Automatic vectorization is critical to enhancing performance of computeintensive programs on modern processors. However, there is much room for improvement over the autovectorization capabilities of current production compilers, through careful vectorcode synthesis that utilizes a variety of loo ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Automatic vectorization is critical to enhancing performance of computeintensive programs on modern processors. However, there is much room for improvement over the autovectorization capabilities of current production compilers, through careful vectorcode synthesis that utilizes a variety of loop transformations (e.g. unrollandjam, interchange, etc.). As the set of transformations considered is increased, the selection of the most effective combination of transformations becomes a significant challenge: currently used costmodels in vectorizing compilers are often unable to identify the best choices. In this paper, we address this problem using machine learning models to predict the performance of SIMD codes. In contrast to existing approaches that have used highlevel features of the program, we develop machine learning models based on features extracted from the generated assembly code, The models are trained offline on a number of benchmarks, and used at compiletime to discriminate between numerous possible vectorized variants generated from the input code. We demonstrate the effectiveness of the machine learning model by using it to guide automatic vectorization on a variety of tensor contraction kernels, with improvements ranging from 2 × to 8 × over Intel ICC’s autovectorized code. We also evaluate the effectiveness of the model on a number of stencil computations and show good improvement over autovectorized code. 1.