Results 1 - 10
of
31
Compiler optimization-space exploration
- In Proceedings of the international symposium on Code generation and optimization
, 2003
"... To meet the demands of modern architectures, optimizing compilers must incorporate an ever larger number of increasingly complex transformation algorithms. Since code transformations may often degrade performance or interfere with subsequent transformations, compilers employ predictive heuristics to ..."
Abstract
-
Cited by 87 (1 self)
- Add to MetaCart
To meet the demands of modern architectures, optimizing compilers must incorporate an ever larger number of increasingly complex transformation algorithms. Since code transformations may often degrade performance or interfere with subsequent transformations, compilers employ predictive heuristics to guide optimizations by predicting their effects a priori. Unfortunately, the unpredictability of optimization interaction and the irregularity of today’s wide-issue machines severely limit the accuracy of these heuristics. As a result, compiler writers may temper high variance optimizations with overly conservative heuristics or may exclude these optimizations entirely. While this process results in a compiler capable of generating good average code quality across the target benchmark set, it is at the cost of missed optimization opportunities in individual code segments. To replace predictive heuristics, researchers have proposed compilers which explore many optimization options, selecting the best one a posteriori. Unfortunately, these existing iterative compilation techniques are not practical for reasons of compile time and applicability. In this paper, we present the Optimization-Space Exploration (OSE) compiler organization, the first practical iterative compilation strategy applicable to optimizations in general-purpose compilers. Instead of replacing predictive heuristics, OSE uses the compiler writer’s knowledge encoded in the heuristics to select a small number of promising optimization alternatives for a given code segment. Compile time is limited by evaluating only these alternatives for hot code segments using a general compiletime performance estimator. An OSE-enhanced version of Intel’s highly-tuned, aggressively optimizing production compiler for IA-64 yields a significant performance improvement, more than 20 % in some cases, on Itanium for SPEC codes. 1.
A Practical Method for Quickly Evaluating Program Optimizations
- In Proceedings of the International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2005
, 2005
"... This article aims at making iterative optimization practical and usable by speeding up the evaluation of a large range of optimizations. Instead of using a full run to evaluate a single program optimization, we take advantage of periods of stable performance, called phases. For that purpose, we prop ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
This article aims at making iterative optimization practical and usable by speeding up the evaluation of a large range of optimizations. Instead of using a full run to evaluate a single program optimization, we take advantage of periods of stable performance, called phases. For that purpose, we propose a low-overhead phase detection scheme geared toward fast optimization space pruning, using code instrumentation and versioning implemented in a production compiler. Our approach is driven by simplicity and practicality. We show that a simple phase detection scheme can be sufficient for optimization space pruning. We also show it is possible to search for complex optimizations at run-time without resorting to sophisticated dynamic compilation frameworks. Beyond iterative optimization, our approach also enables one to quickly design selftuned applications.
Probabilistic source-level optimisation of embedded programs
- In Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES
, 2005
"... Efficient implementation of DSP applications is critical for many embedded systems. Optimising C compilers for embedded processors largely focus on code generation and instruction scheduling which, with their growing maturity, are providing diminishing returns. This paper empirically evaluates anoth ..."
Abstract
-
Cited by 22 (14 self)
- Add to MetaCart
Efficient implementation of DSP applications is critical for many embedded systems. Optimising C compilers for embedded processors largely focus on code generation and instruction scheduling which, with their growing maturity, are providing diminishing returns. This paper empirically evaluates another approach, namely source-level transformations and the probabilistic feedback-driven search for “good ” transformation sequences within a large optimisation space. This novel approach combines two selection methods: one based on exploring the optimisation space, the other focused on localised search of good areas. This technique was applied to the UTDSP benchmark suite on two digital signal and multimedia processors (Analog Devices TigerSHARC TS-101, Philips TriMedia TM-1100) and an embedded processor derived from a popular general-purpose processor architecture (Intel Celeron 400). On average, our approach gave a factor of 1.71 times improvement across all platforms equivalent to an average 41 % reduction in execution time, outperforming existing approaches. In certain cases a speedup of up to ≈ 7 was found for individual benchmarks.
Automatic tuning of whole applications using direct search and a performance-based transformation system
- In Proceedings of the Los Alamos Computer Science Institute Second Annual Symposium
, 2004
"... Abstract. In many cases, simple analytical models used by traditional compilers are no longer able to yield effectively optimized code for complex programs because of the enormous complexity of processor architectures. A promising alternative approach for optimizing applications effectively has been ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Abstract. In many cases, simple analytical models used by traditional compilers are no longer able to yield effectively optimized code for complex programs because of the enormous complexity of processor architectures. A promising alternative approach for optimizing applications effectively has been the use of search-based empirical methods. The success of empirically tuned library generators such as ATLAS has shown that this strategy can be effective for domain-specific programs. However, to date there has been no general-purpose tool for effective empirical optimization of whole programs. The main obstacle to this approach has been the need for evaluating a prohibitively large number of alternative program variants. To address this problem, we have developed a prototype tool for automatic application tuning that uses loop-level performance feedback and a direct search strategy to guide search for the best set of optimization parameters. Experiments on four different architectures show that direct search can be an effective technique for finding good values for transformation parameters in a reasonable time. 1
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
- In Micro-42
, 2009
"... Heterogeneous multiprocessors are growingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
Heterogeneous multiprocessors are growingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify this mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on heterogeneous multiprocessors. We have implemented it in our experimental heterogeneous programming system called Qilin. Our results demonstrate that, for a set of important computation kernels, automatic adaptive mapping achieves a speedup of 9.3x on average over the best serial implementation by judiciously distributing works over the CPU and GPU, which is 69 % and 33 % faster than using the CPU or GPU alone, respectively. In addition, adaptive mapping is within 94 % of the speedup of the best manual mapping found via exhaustive searching. To the best of our knowledge, Qilin is the first and only system to date that has such capability. 1.
Toward a Systematic, Pragmatic and Architecture-Aware Program Optimization Process for Complex Processors
- In ACM Supercomputing’04
, 2004
"... Because processor architectures are increasingly complex, it is increasingly difficult to embed accurate machine models within compilers. As a result, compiler efficiency tends to decrease. Currently, the trend is on top-down approaches: static compilers are progressively augmented with information ..."
Abstract
-
Cited by 12 (10 self)
- Add to MetaCart
Because processor architectures are increasingly complex, it is increasingly difficult to embed accurate machine models within compilers. As a result, compiler efficiency tends to decrease. Currently, the trend is on top-down approaches: static compilers are progressively augmented with information from the architecture as in profile-based, iterative or dynamic compilation techniques. However, for the moment, fairly elementary architectural information is used. In this article, we adopt a bottom-up approach to the architecture complexity issue: we assume we know everything about the behavior of the program on the architecture. We present a manual but systematic process for optimizing a program on a complex processor architecture using extensive dynamic analysis, and we find that a small set of run-time information is sufficient to drive an efficient process. We have experimentally observed on an Alpha 21264 that this approach can yield significant performance improvement on Spec benchmarks, beyond peak Spec. We are currently using this approach for optimizing customer applications.
A Polyhedral Approach to Ease the Composition of Program Transformations
- in: Euro-Par’04, no. 3149 in LNCS
, 2004
"... We wish to extend the effectiveness of loop-restructuring compilers by improving the robustness of loop transformations and easing their composition in long sequences. We propose a formal and practical framework for program transformation. Our framework is very well suited for iterative optimization ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
We wish to extend the effectiveness of loop-restructuring compilers by improving the robustness of loop transformations and easing their composition in long sequences. We propose a formal and practical framework for program transformation. Our framework is very well suited for iterative optimization techniques searching not only for the appropriate parameters of a given transformation, but for the program transformations themselves, and especially for compositions of program transformations. This framework is based on a unified polyhedral representation of loops and statements. The key to our framework is to clearly separate the impact of each program transformation on the following three components: the iteration domain, the iteration schedule and the memory access functions. Our techniques have been implemented on top of Open64/ORC.
Midatasets: Creating the conditions for a more realistic evaluation of iterative optimization
- In Proceedings of the International Conference on High Performance Embedded Architectures & Compilers (HiPEAC
, 2007
"... Abstract. Iterative optimization has become a popular technique to obtain improvements over the default settings in a compiler for performance-critical applications, such as embedded applications. An implicit assumption, however, is that the best configuration found for any arbitrary data set will w ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
Abstract. Iterative optimization has become a popular technique to obtain improvements over the default settings in a compiler for performance-critical applications, such as embedded applications. An implicit assumption, however, is that the best configuration found for any arbitrary data set will work well with other data sets that a program uses. In this article, we evaluate that assumption based on 20 data sets per benchmark of the MiBench suite. We find that, though a majority of programs exhibit stable performance across data sets, the variability can significantly increase with many optimizations. However, for the best optimization configurations, we find that this variability is in fact small. Furthermore, we show that it is possible to find a compromise optimization configuration across data sets which is often within 5 % of the best possible configuration for most data sets, and that the iterative process can converge in less than 20 iterations (for a population of 200 optimization configurations). All these conclusions have significant and positive implications for the practical utilization of iterative optimization. 1
Fast compiler optimisation evaluation using code-feature based performance prediction
- In CF
, 2007
"... Performance tuning is an important and time consuming task which may have to be repeated for each new application and platform. Although iterative optimisation can automate this process, it still requires many executions of different versions of the program. As execution time is frequently the limit ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Performance tuning is an important and time consuming task which may have to be repeated for each new application and platform. Although iterative optimisation can automate this process, it still requires many executions of different versions of the program. As execution time is frequently the limiting factor in the number of versions or transformed programs that can be considered, what is needed is a mechanism that can automatically predict the performance of a modified program without actually having to run it. This paper presents a new machine learning based technique to automatically predict the speedup of a modified program using a performance model based on the code features of the tuned programs. Unlike previous approaches it does not require any prior learning over a benchmark suite. Furthermore, it can be used to predict the performance of any tuning and is not restricted to a prior seen transformation space. We show that it can deliver predictions with a high correlation coefficient and can be used to dramatically reduce the cost of search.
A Tuning Framework for Software-Managed Memory Hierarchies
"... Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine’s particular characteristics. A large program on a multi-level machine can easily expose tens or hundr ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine’s particular characteristics. A large program on a multi-level machine can easily expose tens or hundreds of inter-dependent parameters which require tuning, and manually searching the resultant large, non-linear space of program parameters is a tedious process of trial-and-error. In this paper we present a general framework for automatically tuning general applications to machines with software-managed memory hierarchies. We evaluate our framework by measuring the performance of benchmarks that are tuned for a range of machines with different memory hierarchy configurations: a cluster of Intel P4 Xeon processors, a single Cell processor, and a cluster of Sony Playstation3’s.

