Results 11 - 20
of
31
Evaluating heuristic optimization phase order search algorithms
- In Proceedings of the International Symposium on Code Generation and Optimization (CGO’07
, 2007
"... Program-specific or function-specific optimization phase sequences are universally accepted to achieve better overall performance than any fixed optimization phase ordering. A number of heuristic phase order space search algorithms have been devised to find customized phase orderings achieving high ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Program-specific or function-specific optimization phase sequences are universally accepted to achieve better overall performance than any fixed optimization phase ordering. A number of heuristic phase order space search algorithms have been devised to find customized phase orderings achieving high performance for each function. However, to make this approach of iterative compilation more widely accepted and deployed in mainstream compilers, it is essential to modify existing algorithms, or develop new ones that find near-optimal solutions quickly. As a step in this direction, in this paper we attempt to identify and understand the important properties of some commonly employed heuristic search methods by using information collected during an exhaustive exploration of the phase order search space. We compare the performance obtained by each algorithm with all others, as well as with the optimal phase ordering performance. Finally, we show how we can use the features of the phase order space to improve existing algorithms as well as devise new, and better performing search algorithms. 1.
Vista: Vpo interactive system for tuning applications
- ACM Transactions on Embedded Computing Systems
, 2005
"... Software designers face many challenges when developing applications for embedded systems. One major challenge is meeting the conflicting constraints of speed, code size and power consumption. Embedded application developers often resort to hand-coded assembly language to meet these constraints sinc ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Software designers face many challenges when developing applications for embedded systems. One major challenge is meeting the conflicting constraints of speed, code size and power consumption. Embedded application developers often resort to hand-coded assembly language to meet these constraints since traditional optimizing compiler technology is usually of little help in addressing this challenge. The results are software systems that are not portable, less robust and more costly to develop and maintain. Another limitation is that compilers traditionally apply the optimizations to a program in a fixed order. However, it has long been known that a single ordering of optimization phases will not produce the best code for every application. In fact, the smallest unit of compilation in most compilers is typically a function and the programmer has no control over the code improvement process other than setting flags to enable or disable certain optimization phases. This paper describes a new code improvement paradigm implemented in a system called VISTA that can help achieve the cost/performance trade-offs that embedded applications demand. The VISTA system opens the code improvement process and gives the application programmer, when necessary, the ability to finely control it. VISTA also provides support for finding effective sequences of optimization phases. This support includes the ability to interactively get
In search of near-optimal optimization phase orderings
- in Proceedings of the 2006 ACM Conference on Languages, Compilers, and Tools for Embedded Systems
, 2006
"... Phase ordering is a long standing challenge for traditional optimizing compilers. Varying the order of applying optimization phases to a program can produce different code, with potentially significant performance variation amongst them. A key insight to addressing the phase ordering problem is that ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Phase ordering is a long standing challenge for traditional optimizing compilers. Varying the order of applying optimization phases to a program can produce different code, with potentially significant performance variation amongst them. A key insight to addressing the phase ordering problem is that many different optimization sequences produce the same code. In an earlier study, we used this observation to restate the phase ordering problem to concentrate on finding all distinct function instances that can be produced due to different phase orderings, instead of attempting to generate code for all possible optimization sequences. Using a novel search algorithm we were able to show that it is possible to exhaustively enumerate the set of all possible function instances that can be produced by different phase orderings in our compiler for most of the functions in our benchmark suite [1]. Finding the optimal function instance within this set for almost any dynamic measure of performance still appears impractical since that would involve execution/simulation of all generated function instances. To find the dynamically optimal function instance we exploit the observation that the enumeration space for a function typically contains a very small number of distinct control flow paths. We simulate only one function instance from each group of function instances having the identical control flow, and use that information to estimate the dynamic performance of the remaining functions in that group. We further show that the estimated dynamic frequency counts obtained by using our method correlate extremely well to simulated processor cycle counts. Thus, by using our measure of dynamic frequencies to identify a small number of the best performing function instances we can often find the optimal phase ordering for a function within a reasonable amount of time. Finally, we perform a case study to evaluate how adept our genetic algorithm is for finding optimal phase orderings within our compiler, and demonstrate how the algorithm can be improved.
Compilation order matters: Exploring the structure of the space of compilation sequences using randomized search algorithms
- In Proceedings of the ACM SIGPLAN Symposium on Languages, Compilers, and Tools for Embedded Systems (LCTES
, 2004
"... Most modern compilers operate by applying a fixed sequence of code optimizations, called a compilation sequence, to all programs. Compiler writers determine a small set of good, general-purpose, compilation sequences by extensive hand-tuning over particular benchmarks. The compilation sequence makes ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Most modern compilers operate by applying a fixed sequence of code optimizations, called a compilation sequence, to all programs. Compiler writers determine a small set of good, general-purpose, compilation sequences by extensive hand-tuning over particular benchmarks. The compilation sequence makes a significant difference in the quality of the generated code; in particular, we know that a single universal compilation sequence does not produce the best results over all programs [1, 2, 5, 6]. Three questions arise in customizing compilation sequences: (1) What is the incremental benefit of using a customized sequence instead of a universal sequence? (2) What is the average computational cost of constructing a customized sequence? (3) When does the benefit exceed the cost? To answer these questions, we must develop a good understanding of how quality of the generated code varies with the choice of compilation sequence over the entire space of sequences for a given program. In particular, we need to know (1) What percentage of the set of possible compilation sequences falls within a specified neighborhood of the true optimum sequence? (2) How are these nearly optimal sequences distributed in the sequence space? Do good sequences cluster in particular regions of the space? Or are they distributed evenly? (3) Is the sequence space riddled with shallow local minima? Can random sampling in the space reliably achieve near-optimal solutions? More importantly, we need to know if there are structural properties shared by compilation sequence spaces for a broad range of pro-
Practical Run-time Adaptation with Procedure Cloning to Enable Continuous Collective Compilation
"... Iterative feedback-directed optimization is now a popular technique to obtain better performance and code size improvements for statically compiled programs over the default settings in a compiler. The offline evaluation of multiple optimization strategies for a given program is a potentially costly ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Iterative feedback-directed optimization is now a popular technique to obtain better performance and code size improvements for statically compiled programs over the default settings in a compiler. The offline evaluation of multiple optimization strategies for a given program is a potentially costly operation. The number of iterations typically grows with the complexity of the program transformation search space, and with the number of input datasets used for performance assessment. In addition, as the behavior of a program can vary considerably across different datasets, it is often preferable to generate different optimization versions, covering the full spectrum of the program’s representative datasets. Continuous and collective optimization are targeted at these issues. Continuous optimization searches for the best program transformation at run-time, taking advantages of the phase behavior of programs to evaluate multiple optimization versions within a single run, and dynamically adapting to changing execution contexts. Collective optimization interleaves optimization iterations with program executions
Improving WCET by applying a WC code-positioning optimization
- ACM Transactions on Architecture and Code Optimization
, 2005
"... Applications in embedded systems often need to meet specified timing constraints. It is advantageous to not only calculate the Worst-Case Execution Time (WCET) of an application, but to also perform transformations which reduce the WCET since an application with a lower WCET will be less likely to v ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Applications in embedded systems often need to meet specified timing constraints. It is advantageous to not only calculate the Worst-Case Execution Time (WCET) of an application, but to also perform transformations which reduce the WCET since an application with a lower WCET will be less likely to violate its timing constraints. Some processors incur a pipeline delay whenever an instruction transfers control to a target that is not the next sequential instruction. Code positioning optimizations attempt to reduce these delays by positioning the basic blocks to minimize the number of unconditional jumps and taken conditional branches that occur. Traditional code positioning algorithms use profile data to find the frequently executed edges between basic blocks, then minimize the transfers of control along these edges to reduce the Average Case Execution Time (ACET). This paper introduces a WCET code positioning optimization, driven by the worst-case (WC) path information from a timing analyzer, to reduce the WCET instead of ACET. This WCET optimization changes the layout of the code in memory to reduce the branch penalties along the WC paths. Unlike the frequency of edges in traditional profile-driven code positioning, the WC path may change after code positioning decisions are made. Thus, WCET code positioning is inherently more challenging than ACET code positioning. The experimental results show that this optimization typically finds the optimal layout of the basic blocks with the minimal WCET. The results show over a 7 % reduction in WCET is achieved after code positioning is performed.
Using de-optimization to re-optimize code
- In Proceedings of the EMSOFT Conference
, 2004
"... ii To Mom, Dad, and Frank... iii ACKNOWLEDGMENTS I am very grateful for the help of my advisor, Dr. David Whalley. Without you, this thesis would not have been possible. Thank you for believing in me, as well as inspiring me to work hard to achieve my goals. I would also like to thank the other memb ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
ii To Mom, Dad, and Frank... iii ACKNOWLEDGMENTS I am very grateful for the help of my advisor, Dr. David Whalley. Without you, this thesis would not have been possible. Thank you for believing in me, as well as inspiring me to work hard to achieve my goals. I would also like to thank the other members of the Compilers Group (Prasad Kulkarni, Bill Kreahling, Clint Whaley, Wankang Zhao) for their assistance. This work would have been extraordinarily difficult without your insight and your friendship. I would like to extend a big thanks to my family and friends for their unwavering love and support. You may not understand all of the complexities involved in my research, but I certainly learned that you are always willing to listen to me. I am truly blessed to have each of you in my life.
Automatic Library Generation for BLAS3 on GPUs
"... High-performance libraries, the performancecritical building blocks for high-level applications, will assume greater importance on modern processors as they become more complex and diverse. However, automatic library generators are still immature, forcing library developers to manually tune library ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
High-performance libraries, the performancecritical building blocks for high-level applications, will assume greater importance on modern processors as they become more complex and diverse. However, automatic library generators are still immature, forcing library developers to manually tune library to meet their performance objectives. We are developing a new script-controlled compilation framework to help domain experts reduce much of the tedious and error-prone nature of manual tuning, by enabling them to leverage their expertise and reuse past optimization experiences. We focus on demonstrating improved performance and productivity obtained through using our framework to tune BLAS3 routines on three GPU platforms: up to 5.4x speedups over the CUBLAS achieved on NVIDIA GeForce 9800, 2.8x on GTX285, and 3.4x on Fermi Tesla C2050. Our results highlight the potential benefits of exploiting domain expertise and the relations between different routines (in terms of their algorithms and data structures).
Improving WCET by Applying Worst-Case Path Optimizations ∗
"... It is advantageous to perform compiler optimizations that attempt to lower the worst-case execution time (WCET) of an embedded application since tasks with lower WCETs are easier to schedule and more likely to meet their deadlines. Compiler writers in recent years have used profile information to de ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
It is advantageous to perform compiler optimizations that attempt to lower the worst-case execution time (WCET) of an embedded application since tasks with lower WCETs are easier to schedule and more likely to meet their deadlines. Compiler writers in recent years have used profile information to detect the frequently executed paths in a program and there has been considerable effort to develop compiler optimizations to improve these paths in order to reduce the average-case execution time (ACET). In this paper, we describe an approach to reduce the WCET by adapting and applying optimizations designed for frequent paths to the worst-case (WC) paths in an application. Instead of profiling to find the frequent paths, our WCET path optimization uses feedback from a timing analyzer to detect the WC paths in a function. Since these path-based optimizations may increase code size, the subsequent effects on the WCET due to these optimizations is measured to ensure that the worst-case path optimizations actually improve the WCET before committing to a code size increase. We evaluate these WC path optimizations and present results showing the decrease in WCET versus the increase in code size.

