Results 11 - 20
of
22
Extendable Pattern-Oriented Optimization Directives
"... Current programming models and compiler technologies for multi-core processors do not exploit well the performance benefits obtainable by applying algorithm-specific, i.e., semantic-specific optimizations to a particular application. In this work, we propose a pattern-making methodology that allows ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Current programming models and compiler technologies for multi-core processors do not exploit well the performance benefits obtainable by applying algorithm-specific, i.e., semantic-specific optimizations to a particular application. In this work, we propose a pattern-making methodology that allows algorithm-specific optimizations to be encapsulated into “optimization patterns” that are expressed in terms of pre-processor directives so that simple annotations can result in significant performance improvements. To validate this new methodology, a framework, named EPOD, is developed to map such directives to the underlying optimization schemes. We have identified and implemented a number of optimization patterns for three representative computer platforms. Our experimental results show that a pattern-guided compiler can outperform the state-of-the-art compilers and even achieve performance as competitive as hand-tuned code. Thus, such a pattern-making methodology represents an encouraging direction for domain experts’ experience and knowledge to be integrated into general-purpose compilers.
Iterative Compilation with Kernel Exploration
"... Abstract. The increasing complexity of hardware mechanisms for recent processors makes high performance code generation very challenging. One of the main issue for high performance is the optimization of memory accesses. General purpose compilers, with no knowledge of the application context and app ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. The increasing complexity of hardware mechanisms for recent processors makes high performance code generation very challenging. One of the main issue for high performance is the optimization of memory accesses. General purpose compilers, with no knowledge of the application context and approximate memory model, seem inappropriate for this task. Combining applicationdependent optimizations on the source code and exploration of optimization parameters as it is achieved with ATLAS, has been shown as one way to improve performance. Yet, hand-tuned codes such as in the MKL library still outperform ATLAS with an important speed-up and some effort has to be done in order to bridge the gap between performance obtained by automatic and manual optimizations. In this paper, a new iterative compilation approach for the generation of high performance codes is proposed. This approach is not application-dependent, compared to ATLAS. The idea is to separate the memory optimization phase from the computation optimization phase. The first step automatically finds all possible decompositions of the code into kernels. With datasets that fit into the cache and simplified memory accesses, these kernels are simpler to optimize, either with the compiler, at source level, or with a dedicated code generator. The best decomposition is then found by a model-guided approach, performing on the source code the required memory optimizations. Exploration of optimization sequences and their parameters is achieved with a meta-compilation language, X language. The first results on linear algebra codes for Itanium show that the performance obtained reduce the gap with those of highly optimized hand-tuned codes. 1
Support of Collective Effort Towards Performance Portability
"... Performance portability, in the sense that a single source can run with good performance across a wide vari-ation of parallel hardware platforms, is strongly desired by industry and actively being researched. How-ever, evidence is mounting that performance portability cannot be realized at just the ..."
Abstract
- Add to MetaCart
(Show Context)
Performance portability, in the sense that a single source can run with good performance across a wide vari-ation of parallel hardware platforms, is strongly desired by industry and actively being researched. How-ever, evidence is mounting that performance portability cannot be realized at just the toolchain level, or just at the runtime level or just at the hardware abstraction level. This is a position paper, making a suggestion for how the groups involved can more efficiently solve the performance portability problem together. We don’t propose a solution, at all, but rather a support system for the players to self organize and collectively find one. The support system is based on a new extendable virtualization mechanism called VMS (Virtualized Master-Slave), that fulfills the needs of an organizing principle, and provides focus that may increase research efficiency. The difficult work will be the on-going research efforts on parallel language design, compilers, source-to-source transform tools, binary optimization, run-time schedulers, and hardware support for parallelism. Although it doesn’t in itself solve the problem, such an organizing principle may be a valuable step towards a solution – the problem may be too complex and require cooperation of too many real-world entities for a single-entity solution. We briefly review VMS, and illustrate how it could be used to give rise to an eco-system in which perfor-mance portability is collectively realized. To support the suggestion, we give measurements of the time to implement three parallelism-construct libraries, and performance numbers for them, along with measure-ments of the basic overhead of VMS. 1
A Extendable Pattern-Oriented Optimization Directives
"... Algorithm-specific, i.e., semantic-specific optimizations have been observed to bring significant performance gains, especially for a diverse set of multi/many-core architectures. However, current programming models and compiler technologies for the state-of-the-art architectures do not exploit well ..."
Abstract
- Add to MetaCart
Algorithm-specific, i.e., semantic-specific optimizations have been observed to bring significant performance gains, especially for a diverse set of multi/many-core architectures. However, current programming models and compiler technologies for the state-of-the-art architectures do not exploit well these performance opportunities. In this paper, we propose a pattern-making methodology that enables algorithm-specific optimizations to be encapsulated into “optimization patterns”. Such optimization patterns are expressed in terms of preprocessor directives so that simple annotations can result in significant performance improvements. To validate this new methodology, a framework, named EPOD, is developed to map these directives into the underlying optimization schemes for a particular architecture. It is difficult to create an exact performance model to determine an optimal or near-optimal optimization scheme (including which optimizations to apply and in which order) for a specific application, due to the complexity of applications and architectures. However, it is trackable to build individual optimization components and let compiler developers synthesize an optimization scheme from these components. Therefore, our EPOD framework provides an Optimization Programming Interface (OPI) for compiler developers to define new optimization schemes. Thus, new patterns can be integrated into EPOD in a flexible manner. We have identified and implemented a number of optimization patterns for three representative computer
Doctorat de l’universite ́ de Versailles Saint-Quentin-en-Yvelines
"... Déport de différentes phases de ..."
(Show Context)
Generating Empirically Optimized Numerical Software from MATLAB Prototypes
, 2008
"... The growing demand for higher levels of detail and accuracy in results means that the size and complexity of scientific computations is increasing at least as fast as the improvements in processor technology. Programming scientific applications is hard, and optimizing them for high performance is ev ..."
Abstract
- Add to MetaCart
(Show Context)
The growing demand for higher levels of detail and accuracy in results means that the size and complexity of scientific computations is increasing at least as fast as the improvements in processor technology. Programming scientific applications is hard, and optimizing them for high performance is even harder. The development of optimized codes requires extensive knowledge, not only of the costs of floating-point arithmetic but also of memory access issues and compiler optimizations. Experiments show that the complexity of this hardware-software system means that performance is difficult to predict fully. Therefore, computational scientists are often forced to choose between investing too much time in tuning code or accepting performance that is significantly lower than the best achievable performance on a given architecture. In this paper, we describe the first steps toward a fully automated system for the optimization of the matrix algebra kernels that are a foundational part of many scientific applications. To generate highly optimized code from a high-level MATLAB prototype, we define a three-step approach. To begin, we have developed a compiler that converts a MATLAB script into simple C code. We then use the polyhedral optimization system PLuTo to optimize that code for coarse-grained parallelism and locality simultaneously. Finally, we annotate the resulting code with performance tuning directives and
HABILITATION A DIRIGER DES RECHERCHES de l’UNIVERSITÉ de VERSAILLES ST QUENTIN-EN-YVELINES
, 2011
"... Contributions à l’optimisation de code et à la génération de bibliothèques hautes performances. Soutenue le 18 Février 2008 devant le jury composé de: ..."
Abstract
- Add to MetaCart
(Show Context)
Contributions à l’optimisation de code et à la génération de bibliothèques hautes performances. Soutenue le 18 Février 2008 devant le jury composé de:
Deliverable 5.5- Methodology for Performance Analysis and Code Optimization with Experimental Search and Performance Model for Adaptive Optimization
"... Abstract. The increasing complexity of hardware features in modern processors makes compilation for high performance very challenging. Per-formance models used by compilers are too simple to take into account this complexity and choose accordingly the most effective optimization sequence. Adaptive c ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. The increasing complexity of hardware features in modern processors makes compilation for high performance very challenging. Per-formance models used by compilers are too simple to take into account this complexity and choose accordingly the most effective optimization sequence. Adaptive compilation is now a widespread approach relying on an explo-ration for optimization sequences or compiler flags and on code execution in order to evaluate precisely performance. The main drawback of this approach is its very high cost, that is partially addressed by efficient search techniques based on genetic or machine learning algorithms. This paper presents a novel approach for adaptive compilation, relying on performance evaluation of only fragments of the code, named constant performance codelets, and on a simple performance model. The search for transformations leading to these codelets is user-defined through prag-mas. We show on three large applications (two numerical simulations and a genomic application) that the performance prediction for the op-timized function is quite accurate and that substantial speed-up can be reached on Itanium2 architecture. 1
Combining Experimental Search and Performance Model for Adaptive Optimization
"... Abstract. The increasing complexity of hardware features in modern processors makes compilation for high performance very challenging. Per-formance models used by compilers are too simple to take into account this complexity and choose accordingly the most effective optimization sequence. Adaptive c ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. The increasing complexity of hardware features in modern processors makes compilation for high performance very challenging. Per-formance models used by compilers are too simple to take into account this complexity and choose accordingly the most effective optimization sequence. Adaptive compilation is now a widespread approach relying on an explo-ration for optimization sequences or compiler flags and on code execution in order to evaluate precisely performance. The main drawback of this approach is its very high cost, that is partially addressed by efficient search techniques based on genetic or machine learning algorithms. This paper presents a novel approach for adaptive compilation, relying on performance evaluation of only fragments of the code, named constant performance codelets, and on a simple performance model. The search for transformations leading to these codelets is user-defined through prag-mas. We show on three large applications (two numerical simulations and a genomic application) that the performance prediction for the op-timized function is quite accurate and that substantial speed-up can be reached on Itanium2 architecture. 1