Results 1 - 10
of
22
Rapidly selecting good compiler optimizations using performance counters
- In Proceedings of the 5th Annual International Symposium on Code Generation and Optimization (CGO
, 2007
"... Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is nontrivial. There have been several proposed techniques that search the space ..."
Abstract
-
Cited by 66 (24 self)
- Add to MetaCart
(Show Context)
Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is nontrivial. There have been several proposed techniques that search the space of compiler options to find good solutions; however such approaches can be expensive. This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings. This is achieved by learning a model off-line which can then be used to determine good settings for any new program. We show that such an approach outperforms the state-ofthe-art and is two orders of magnitude faster on average. Furthermore, we show that our performance counter based approach outperforms techniques based on static code features. Finally, we show that such improvements are stable across varying input data sets. Using our technique we achieve a 10 % improvement over the highest optimization setting of the commercial PathScale EKOPath 2.3.1 optimizing compiler on the SPEC benchmark suite on a recent AMD Athlon 64 3700+ platform in just three evaluations. 1
CHiLL: A framework for composing high-level loop transformations
, 2008
"... This paper describes a general and robust loop transformation framework that enables compilers to generate efficient code on complex loop nests. Despite two decades of prior research on loop optimization, performance of compiler-generated code often falls short of manually optimized versions, even f ..."
Abstract
-
Cited by 38 (5 self)
- Add to MetaCart
(Show Context)
This paper describes a general and robust loop transformation framework that enables compilers to generate efficient code on complex loop nests. Despite two decades of prior research on loop optimization, performance of compiler-generated code often falls short of manually optimized versions, even for some well-studied BLAS kernels. There are two primary reasons for this. First, today’s compilers employ fixed transformation strategies, making it difficult to adapt to different optimization requirements for different application codes. Second, code transformations are treated in isolation, not taking into account the interactions between different transformations. This paper addresses such limitations in a unified framework that supports a broad collection of transformations, (permutation, tiling, unroll-and-jam, data copying, iteration space splitting, fusion, distribution and others), which go beyond existing polyhedral transformation models. This framework is a key element of a compiler we are developing which performs empirical optimization to evaluate a collection of alternative optimized variants of a code segment. A script interface to code generation and empirical search permits transformation parameters to be adjusted independently and tested; alternative scripts are used to represent different code variants. By applying this framework to example codes, we show performance results on automaticallygenerated code for the Pentium M and MIPS R10000 that are comparable to the best hand-tuned codes, and significantly better (up to a 14x speedup) than the native compilers. 1
Automatic performance model construction for the fast software exploration of new hardware designs
- In ACM International Conference on Compilers, Architecture and Synthesis for Embedded Systems
, 2006
"... ..."
(Show Context)
Predictive Modeling in a Polyhedral Optimization Space
- INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO'11)
, 2011
"... Significant advances in compiler optimization have been made in recent years, enabling many transformations such as tiling, fusion, parallelization and vectorization on imperfectly nested loops. Nevertheless, the problem of finding the best combination of loop transformations remains a major challen ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Significant advances in compiler optimization have been made in recent years, enabling many transformations such as tiling, fusion, parallelization and vectorization on imperfectly nested loops. Nevertheless, the problem of finding the best combination of loop transformations remains a major challenge. Polyhedral models for compiler optimization have demonstrated strong potential for enhancing program performance, in particular for compute-intensive applications. But existing static cost models to optimize polyhedral transformations have significant limitations, and iterative compilation has become a very promising alternative to these models to find the most effective transformations. But since the number of polyhedral optimization alternatives can be enormous, it is often impractical to iterate over a significant fraction of the entire space of polyhedrally transformed variants. Recent research has focused on iterating over this search space either with manually-constructed heuristics or with automatic but very expensive search algorithms (e.g., genetic algorithms) that can eventually find good points in the polyhedral space. In this paper, we propose the use of machine learning to address the problem of selecting the best polyhedral optimizations. We show that these models can quickly find high-performance program variants in the polyhedral space, without resorting to extensive empirical search. We introduce models that take as input a characterization of a program based on its dynamic behavior, and predict the performance of aggressive high-level polyhedral transformations that includes tiling, parallelization and vectorization. We allow for a minimal empirical search on the target machine, discovering on average 83 % of the searchspace-optimal combinations in at most 5 runs. Our end-to-end framework is validated using numerous benchmarks on two multi-core platforms.
Analytical bounds for optimal tile size selection
- In Proc. 21st Int. Conf. on Compiler Construction, CC’12
, 2012
"... Abstract. In this paper, we introduce a novel approach to guide tile size se-lection by employing analytical models to limit empirical search within a sub-space of the full search space. Two analytical models are used together: 1) an existing conservative model, based on the data footprint of a tile ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In this paper, we introduce a novel approach to guide tile size se-lection by employing analytical models to limit empirical search within a sub-space of the full search space. Two analytical models are used together: 1) an existing conservative model, based on the data footprint of a tile, which ignores intra-tile cache block replacement, and 2) an aggressive new model that assumes optimal cache block replacement within a tile. Experimental results on multiple platforms demonstrate the practical effectiveness of the approach by reducing the search space for the optimal tile size by 1,307 × to 11,879 × for an Intel Core-2-Quad system; 358 × to 1,978 × for an Intel Nehalem system; and 45 × to 1,142× for an IBM Power7 system. The execution of rectangularly tiled code tuned by a search of the subspace identified by our model achieves speed-ups of up to 1.40 × (Intel Core-2 Quad), 1.28 × (Nehalem) and 1.19 × (Power 7) relative to the best possible square tile sizes on these different processor architectures. We also demonstrate the integration of the analytical bounds with existing search op-timization algorithms. Our approach not only reduces the total search time from Nelder-Mead Simplex and Parallel Rank Ordering methods by factors of up to 4.95 × and 4.33×, respectively, but also finds better tile sizes that yield higher performance in tuned tiled code. 1
P.: Combining analytical and empirical approaches in tuning matrix transposition
- In: Proc. Parallel Architectures and Compilation Techniques (PACT
, 2006
"... Matrix transposition is an important kernel used in many applications. Even though its optimization has been the subject of many studies, an optimization procedure that targets the characteristics of current processor architectures has not been developed. In this paper, we develop an integrated opti ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Matrix transposition is an important kernel used in many applications. Even though its optimization has been the subject of many studies, an optimization procedure that targets the characteristics of current processor architectures has not been developed. In this paper, we develop an integrated optimization framework that addresses a number of issues, including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors. A judicious combination of analytical and empirical approaches is used to determine the most appropriate optimizations. The absence of problem information until execution time is handled by generating multiple versions of the code- the best version is chosen at runtime, with assistance from minimal-overhead inspectors. The approach highlights aspects of empirical optimization that are important for similar computations with little temporal reuse. Experimental results on PowerPC G5 and Intel Pentium 4 demonstrate the effectiveness of the developed framework. Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors—code generation; compilers; optimization
Intelligent Compilers
"... Abstract—The industry is now in agreement that the future of architecture design lies in multiple cores. As a consequence, all computer systems today, from embedded devices to petascale computing systems, are being developed using multicore processors. Although researchers in industry and academia a ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract—The industry is now in agreement that the future of architecture design lies in multiple cores. As a consequence, all computer systems today, from embedded devices to petascale computing systems, are being developed using multicore processors. Although researchers in industry and academia are exploring many different multicore hardware design choices, most agree that developing portable software that achieves high performance on multicore processors is a major unsolved problem. We now see a plethora of architectural features, with little consensus on how the computation, memory, and communication structures in multicore systems will be organized. The wide disparity in hardware systems available has made it nearly impossible to write code that is portable in functionality while still taking advantage of the performance potential of each system. In this paper, we propose exploring the viability of developing intelligent compilers, focusing on key components that will allow application portability while still achieving high performance. I.
Compiler-Assisted Performance Tuning
- In Proceedings of SciDAC 2007, Journal of Physics: Conference Series
, 2007
"... ..."
(Show Context)
Model-Guided Empirical Optimization for Multimedia Extension Architectures: A Case Study
- Proceedings of the Workshop on Performance Optimization of High-Level Languages
, 2007
"... Compiler technology for multimedia extensions must effectively utilize not only the SIMD compute engines but also the various levels of the memory hierarchy: superword registers, multi-level caches and TLB. In this paper, we describe a compiler that combines optimization across all levels of the mem ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Compiler technology for multimedia extensions must effectively utilize not only the SIMD compute engines but also the various levels of the memory hierarchy: superword registers, multi-level caches and TLB. In this paper, we describe a compiler that combines optimization across all levels of the memory hierarchy with automatic generation of SIMD code for multimedia extensions. At the high-level, model-guided empirical optimization is used to transform code to optimize for all levels of the memory hierarchy. This compiler interacts with a backend compiler exploiting superword-level parallelism that takes sequential code as input and produces SIMD code. This paper discusses how we have combined these technologies into a single framework. Through a case study with matrix multiply, we observe performance results that outperform the hand-tuned Intel MKL library, and achieve performance that is within 4 % of the ATLAS self-tuning library with architectural defaults and more than 4X faster than the native Intel compiler. 1