Results 1 -
9 of
9
Annotation-Based Empirical Performance Tuning Using Orio
"... In many scientific applications, significant time is spent tuning codes for a particular high-performance architecture. Tuning approaches range from the relatively nonintrusive (e.g., by using compiler options) to extensive code modifications that attempt to exploit specific architecture features. I ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
In many scientific applications, significant time is spent tuning codes for a particular high-performance architecture. Tuning approaches range from the relatively nonintrusive (e.g., by using compiler options) to extensive code modifications that attempt to exploit specific architecture features. Intrusive techniques often result in code changes that are not easily reversible, which can negatively impact readability, maintainability, and performance on different architectures. We introduce an extensible annotation-based empirical tuning system called Orio, which is aimed at improving both performance and productivity by enabling software developers to insert annotations in the form of structured comments into their source code that trigger a number of low-level performance optimizations on a specified code fragment. To maximize the performance tuning opportunities, we have designed the annotation processing infrastructure to support both architecture-independent and architecture-specific code optimizations. Given the annotated code as input, Orio generates many tuned versions of the same operation and empirically evaluates the versions to select the best performing one for production use. We have also enabled the use of the PLuTo automatic parallelization tool in conjunction with Orio to generate efficient OpenMP-based parallel code. We describe our experimental results involving a number of computational kernels, including dense array and sparse matrix operations.
Automatic correction of loop transformations
- In IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT’07
, 2007
"... Loop nest optimization is a combinatorial problem. Due to the growing complexity of modern architectures, it involves two increasingly difficult tasks: (1) analyzing the profitability of sequences of transformations to enhance parallelism, locality, and resource usage, which amounts to a hard proble ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Loop nest optimization is a combinatorial problem. Due to the growing complexity of modern architectures, it involves two increasingly difficult tasks: (1) analyzing the profitability of sequences of transformations to enhance parallelism, locality, and resource usage, which amounts to a hard problem on a non-linear objective function; (2) the construction and exploration of search space of legal transformation sequences. Practical optimizing and parallelizing compilers decouple these tasks, resorting to a predefined set of enabling transformations to eliminate all sorts of optimization-limiting semantical constraints. State-of-theart optimization heuristics face a hard decision problem on the selection of enabling transformations only remotely related to performance. We propose a new design where optimization heuristics first address the main performance anomalies, then correct potentially illegal loop transformations a posteriori, attempting to minimize the performance impact of the necessary adjustments. We propose a general method to correct any sequence of loop transformations through a combination of loop shifting, code motion and index-set splitting. Sequences of transformations are modeled by compositions of geometric transformations on multidimensional affine schedules. We provide experimental evidence of the scalability of the algorithms on real loop optimizations. 1.
Split Compilation: an Application to Just-in-Time Vectorization
, 2007
"... In a world of ubiquitous, heterogeneous parallelism, achieving portable performance is a challenge. It requires finely tuned coordination, from the programming language to the hardware, through the compiler and multiple layers of the run-time system. This document presents our work in split compilat ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
In a world of ubiquitous, heterogeneous parallelism, achieving portable performance is a challenge. It requires finely tuned coordination, from the programming language to the hardware, through the compiler and multiple layers of the run-time system. This document presents our work in split compilation and parallelization. Split compilation relies on automatically generated semantical annotations to enrich the intermediate format, decoupling costly offline analyzes from lighter, online or just-in-time program transformations. Our work focuses on automatic vectorization, a key optimization playing an increasing role in modern, power-efficient architectures. Our research platform uses GCC’s support for the Common Language Infrastructure (CLI ECMA-335 [8] and ISO 23271:2006 [10]); this choice is motivated by the unique combination of optimizations and portability of GCC, through a semantically rich and performance-friendly intermediate format. Implementation is still in progress.
Iterative Compilation with Kernel Exploration
"... Abstract. The increasing complexity of hardware mechanisms for recent processors makes high performance code generation very challenging. One of the main issue for high performance is the optimization of memory accesses. General purpose compilers, with no knowledge of the application context and app ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. The increasing complexity of hardware mechanisms for recent processors makes high performance code generation very challenging. One of the main issue for high performance is the optimization of memory accesses. General purpose compilers, with no knowledge of the application context and approximate memory model, seem inappropriate for this task. Combining applicationdependent optimizations on the source code and exploration of optimization parameters as it is achieved with ATLAS, has been shown as one way to improve performance. Yet, hand-tuned codes such as in the MKL library still outperform ATLAS with an important speed-up and some effort has to be done in order to bridge the gap between performance obtained by automatic and manual optimizations. In this paper, a new iterative compilation approach for the generation of high performance codes is proposed. This approach is not application-dependent, compared to ATLAS. The idea is to separate the memory optimization phase from the computation optimization phase. The first step automatically finds all possible decompositions of the code into kernels. With datasets that fit into the cache and simplified memory accesses, these kernels are simpler to optimize, either with the compiler, at source level, or with a dedicated code generator. The best decomposition is then found by a model-guided approach, performing on the source code the required memory optimizations. Exploration of optimization sequences and their parameters is achieved with a meta-compilation language, X language. The first results on linear algebra codes for Itanium show that the performance obtained reduce the gap with those of highly optimized hand-tuned codes. 1
Automatic Library Generation for BLAS3 on GPUs
"... High-performance libraries, the performancecritical building blocks for high-level applications, will assume greater importance on modern processors as they become more complex and diverse. However, automatic library generators are still immature, forcing library developers to manually tune library ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
High-performance libraries, the performancecritical building blocks for high-level applications, will assume greater importance on modern processors as they become more complex and diverse. However, automatic library generators are still immature, forcing library developers to manually tune library to meet their performance objectives. We are developing a new script-controlled compilation framework to help domain experts reduce much of the tedious and error-prone nature of manual tuning, by enabling them to leverage their expertise and reuse past optimization experiences. We focus on demonstrating improved performance and productivity obtained through using our framework to tune BLAS3 routines on three GPU platforms: up to 5.4x speedups over the CUBLAS achieved on NVIDIA GeForce 9800, 2.8x on GTX285, and 3.4x on Fermi Tesla C2050. Our results highlight the potential benefits of exploiting domain expertise and the relations between different routines (in terms of their algorithms and data structures).
Automated Transformation for Performance-Critical Kernels
"... The performance of many scientific applications depends on a small number of key computational kernels which require a level of efficiency rarely satisfied by existing native compilers. We present a new approach to high performance kernel optimization, where a general-purpose transformation engine a ..."
Abstract
- Add to MetaCart
The performance of many scientific applications depends on a small number of key computational kernels which require a level of efficiency rarely satisfied by existing native compilers. We present a new approach to high performance kernel optimization, where a general-purpose transformation engine automates the production of highly efficient library routines. The library routines are then empirically tested until an implementation with a satisfactory performance level is found. Our framework requires an annotated kernel specification and can automatically produce optimized implementations based on tuning parameters controlled by a search driver. The transformation engine includes an extensive suite of optimizations which can be easily expanded using a custom transformation language. We have applied our framework to generate code for key linear algebra kernels and have achieved similar performance as that achieved by ATLAS’s highly tuned kernels. In several cases, our kernels were faster than ATLAS’s native kernels; we have made these kernels available to ATLAS, which results in speedups for the ATLAS library, as we show. 1.
Extendable Pattern-Oriented Optimization Directives
"... Current programming models and compiler technologies for multi-core processors do not exploit well the performance benefits obtainable by applying algorithm-specific, i.e., semantic-specific optimizations to a particular application. In this work, we propose a pattern-making methodology that allows ..."
Abstract
- Add to MetaCart
Current programming models and compiler technologies for multi-core processors do not exploit well the performance benefits obtainable by applying algorithm-specific, i.e., semantic-specific optimizations to a particular application. In this work, we propose a pattern-making methodology that allows algorithm-specific optimizations to be encapsulated into “optimization patterns” that are expressed in terms of pre-processor directives so that simple annotations can result in significant performance improvements. To validate this new methodology, a framework, named EPOD, is developed to map such directives to the underlying optimization schemes. We have identified and implemented a number of optimization patterns for three representative computer platforms. Our experimental results show that a pattern-guided compiler can outperform the state-of-the-art compilers and even achieve performance as competitive as hand-tuned code. Thus, such a pattern-making methodology represents an encouraging direction for domain experts’ experience and knowledge to be integrated into general-purpose compilers.
Tools for Performance Optimizations and Tuning . . .
, 2009
"... Multicore processors have become mainstream and the number of cores in a chip will continue to increase every year. Programming these architectures to effectively exploit their very high computation power is a non trivial task. First, an application program needs to be explicitly restructured using ..."
Abstract
- Add to MetaCart
Multicore processors have become mainstream and the number of cores in a chip will continue to increase every year. Programming these architectures to effectively exploit their very high computation power is a non trivial task. First, an application program needs to be explicitly restructured using a set of code transformation techniques to optimize for specific architectural features, especially for parallelism and data locality. Then a significant amount of time is spent on tuning the optimized code to find the best optimization parameter values. However, high performance often means lower productivity as the optimized codes become difficult to understand, maintain and modify. In this dissertation, we present techniques to address these issues by automatic generation of efficient parallel programs, and by the use of empirical search for tuning. The research from this dissertation has been implemented and made publicly available as two useful software tools: one for parameterized tiled loop generation, and one for empirical performance tuning using annotations. Tiling is a critical loop transformation that optimizes both for data locality enhancement

