Results 1 - 10
of
27
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation
, 2000
"... Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. However, these transformations are not independent and each can adversely affect the goal of the other. Furthermore, the best combination will vary drama ..."
Abstract
-
Cited by 78 (9 self)
- Add to MetaCart
Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. However, these transformations are not independent and each can adversely affect the goal of the other. Furthermore, the best combination will vary dramatically from one processor to the next. In this paper, we therefore address the problem of how to select tile sizes and unroll factors simultaneously. We approach this problem in an architecturally adaptive manner by means of iterative compilation, where we generate many versions of a program and decide upon the best by actually executing them and measuring their execution time. We evaluate several iterative strategies based on genetic algorithms, random sampling and simulated annealing. We compare the levels of optimization obtained by iterative compilation to several well-known static techniques and show that we outperform each of them on a range of benchmarks across a variety of ar...
Evaluating Iterative Compilation
, 2002
"... This paper describes a platform independent optimisation approach based on feedback-directed program restructuring. We have developed two strategies that search the optimisation space by means of profiling to find the best possible program variant. These strategies have no a priori knowledge of the ..."
Abstract
-
Cited by 43 (10 self)
- Add to MetaCart
This paper describes a platform independent optimisation approach based on feedback-directed program restructuring. We have developed two strategies that search the optimisation space by means of profiling to find the best possible program variant. These strategies have no a priori knowledge of the target machine and can be run on any platform. In this paper our approach is evaluated on three full SPEC benchmarks, rather than the kernels evaluated in earlier studies where the optimisation space is relatively small. This approach was evaluated on six di#erent platforms, where it is shown that we obtain on average a 20.5% reduction in execution time compared to the native compiler with full optimisation. By using training data instead of reference data for the search procedure, we can reduce compilation time and still give on average a 16.5% reduction in time when running on reference data. We show that our approach is able to give similar significant reductions in execution time over a state of the art high level restructurer based on static analysis and a platform specific profile feedback directed compiler that employs the same transformations as our iterative system. 1.
Iterative optimization in the polyhedral model: Part I, one-dimensional time
- In IEEE/ACM Intl. Conf. on Code Generation and Optimization (CGO’07
, 2007
"... Emerging microprocessors offer unprecedented parallel computing capabilities and deeper memory hierarchies, increasing the importance of loop transformations in optimizing compilers. Because compiler heuristics rely on simplistic performance models, and because they are bound to a limited set of tra ..."
Abstract
-
Cited by 25 (6 self)
- Add to MetaCart
Emerging microprocessors offer unprecedented parallel computing capabilities and deeper memory hierarchies, increasing the importance of loop transformations in optimizing compilers. Because compiler heuristics rely on simplistic performance models, and because they are bound to a limited set of transformations sequences, they only uncover a fraction of the peak performance on typical benchmarks. Iterative optimization is a maturing framework to address these limitations, but so far, it was not successfully applied complex loop transformation sequences because of the combinatorics of the optimization search space. We focus on the class of loop transformation which can be expressed as one-dimensional affine schedules. We define a systematic exploration method to enumerate the space of all legal, distinct transformations in this class. This method is based on an upstream characterization, as opposed to state-of-the-art downstream filtering approaches. Our results demonstrate orders of magnitude improvements in the size of the search space and in the convergence speed of a dedicated iterative optimization heuristic. 1.
Fast and effective orchestration of compiler optimizations for automatic performance tuning
- In Proceedings of the International Symposium on Code Generation and Optimization (CGO
, 2006
"... Although compile-time optimizations generally improve program performance, degradations caused by individual techniques are to be expected. One promising research direction to overcome this problem is the development of dynamic, feedback-directed optimization orchestration algorithms, which automati ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Although compile-time optimizations generally improve program performance, degradations caused by individual techniques are to be expected. One promising research direction to overcome this problem is the development of dynamic, feedback-directed optimization orchestration algorithms, which automatically search for the combination of optimization techniques that achieves the best program performance. The challenge is to develop an orchestration algorithm that finds, in an exponential search space, a solution that is close to the best, in acceptable time. In this paper, we build such a fast and effective algorithm, called Combined Elimination (CE). The key advance of CE over existing techniques is that it takes the least tuning time (57% of the closest alternative), while achieving the same program performance. We conduct the experiments on both a Pentium IV machine and a SPARC II machine, by measuring performance of SPEC CPU2000 benchmarks under a large set of 38 GCC compiler options. Furthermore, through orchestrating a small set of optimizations causing the most degradation, we show that the performance achieved by CE is close to the upper bound obtained by an exhaustive search algorithm. The gap is less than 0.2 % on average. 1
Probabilistic source-level optimisation of embedded programs
- In Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES
, 2005
"... Efficient implementation of DSP applications is critical for many embedded systems. Optimising C compilers for embedded processors largely focus on code generation and instruction scheduling which, with their growing maturity, are providing diminishing returns. This paper empirically evaluates anoth ..."
Abstract
-
Cited by 22 (14 self)
- Add to MetaCart
Efficient implementation of DSP applications is critical for many embedded systems. Optimising C compilers for embedded processors largely focus on code generation and instruction scheduling which, with their growing maturity, are providing diminishing returns. This paper empirically evaluates another approach, namely source-level transformations and the probabilistic feedback-driven search for “good ” transformation sequences within a large optimisation space. This novel approach combines two selection methods: one based on exploring the optimisation space, the other focused on localised search of good areas. This technique was applied to the UTDSP benchmark suite on two digital signal and multimedia processors (Analog Devices TigerSHARC TS-101, Philips TriMedia TM-1100) and an embedded processor derived from a popular general-purpose processor architecture (Intel Celeron 400). On average, our approach gave a factor of 1.71 times improvement across all platforms equivalent to an average 41 % reduction in execution time, outperforming existing approaches. In certain cases a speedup of up to ≈ 7 was found for individual benchmarks.
The effect of cache models on iterative compilation for combined tiling and unrolling
, 2004
"... ..."
Exhaustive optimization phase order space exploration
- In The International Symposium on Code Generation and Optimization
, 2006
"... The phase-ordering problem is a long standing issue for compiler writers. Most optimizing compilers typically have numerous different code-improving phases, many of which can be applied in any order. These phases interact by enabling or disabling opportunities for other optimization phases to be app ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
The phase-ordering problem is a long standing issue for compiler writers. Most optimizing compilers typically have numerous different code-improving phases, many of which can be applied in any order. These phases interact by enabling or disabling opportunities for other optimization phases to be applied. As a result, varying the order of applying optimization phases to a program can produce different code, with potentially significant performance variation amongst them. Complicating this problem further is the fact that there is no universal optimization phase order that will produce the best code, since the best phase order depends on the function being compiled, the compiler, and the target architecture characteristics. Moreover, finding the optimal optimization sequence for even a single function is hard
ABSTRACT COLE: Compiler Optimization Level Exploration
"... Modern compilers implement a large number of optimizations which all interact in complex ways, and which all have a different impact on code quality, compilation time, code size, energy consumption, etc. For this reason, compilers typically provide a limited number of standard optimization levels, s ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Modern compilers implement a large number of optimizations which all interact in complex ways, and which all have a different impact on code quality, compilation time, code size, energy consumption, etc. For this reason, compilers typically provide a limited number of standard optimization levels, such as-O1,-O2,-O3 and-Os, that combine various optimizations providing a number of trade-offs between multiple objective functions (such as code quality, compilation time and code size). The construction of these optimization levels, i.e., choosing which optimizations to activate at each level, is a manual process typically done using high-level heuristics based on the compiler developer’s experience. This paper proposes COLE, Compiler Optimization Level Exploration, a framework for automatically finding Pareto optimal optimization levels through multi-objective evolutionary searching. Our experimental results using GCC and the SPEC CPU benchmarks show that the automatic construction of optimization levels is feasible in practice, and in addition, yields better optimization levels than GCC’s manually derived (-Os,-O1,-O2 and-O3) optimization levels, as well as the optimization levels obtained through random sampling. We also demonstrate that COLE can be used to gain insight into the effectiveness of compiler optimizations as well as to better understand a benchmark’s sensitivity to compiler optimizations.
Fast and efficient searches for effective optimization-phase sequences
- ACM Trans. Archit. Code Optim
"... It has long been known that a fixed ordering of optimization phases will not produce the best code for every application. One approach for addressing this phase-ordering problem is to use an evolutionary algorithm to search for a specific sequence of phases for each module or function. While such se ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
It has long been known that a fixed ordering of optimization phases will not produce the best code for every application. One approach for addressing this phase-ordering problem is to use an evolutionary algorithm to search for a specific sequence of phases for each module or function. While such searches have been shown to produce more efficient code, the approach can be extremely slow because the application is compiled and possibly executed to evaluate each sequence’s effectiveness. Consequently, evolutionary or iterative compilation schemes have been promoted for compilation systems targeting embedded applications where meeting strict constraints on execution time, code size, and power consumption is paramount and longer compilation times may be tolerated in the final stage of development, when an application is compiled one last time and embedded in a product. Unfortunately, even for small embedded applications, the search process can take many hours or even days making the approach less attractive to developers. In this paper, we describe two complementary general approaches for achieving faster searches for effective optimization sequences when using a genetic algorithm. The first approach reduces the search time by avoiding unnecessary executions of the application when possible. Results indicate search time reductions of 62%, on average, often reducing searches from hours to minutes. The second approach
Midatasets: Creating the conditions for a more realistic evaluation of iterative optimization
- In Proceedings of the International Conference on High Performance Embedded Architectures & Compilers (HiPEAC
, 2007
"... Abstract. Iterative optimization has become a popular technique to obtain improvements over the default settings in a compiler for performance-critical applications, such as embedded applications. An implicit assumption, however, is that the best configuration found for any arbitrary data set will w ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
Abstract. Iterative optimization has become a popular technique to obtain improvements over the default settings in a compiler for performance-critical applications, such as embedded applications. An implicit assumption, however, is that the best configuration found for any arbitrary data set will work well with other data sets that a program uses. In this article, we evaluate that assumption based on 20 data sets per benchmark of the MiBench suite. We find that, though a majority of programs exhibit stable performance across data sets, the variability can significantly increase with many optimizations. However, for the best optimization configurations, we find that this variability is in fact small. Furthermore, we show that it is possible to find a compromise optimization configuration across data sets which is often within 5 % of the best possible configuration for most data sets, and that the iterative process can converge in less than 20 iterations (for a population of 200 optimization configurations). All these conclusions have significant and positive implications for the practical utilization of iterative optimization. 1

