Results 1 - 10
of
25
A practical automatic polyhedral parallelizer and locality optimizer
- In PLDI ’08: Proceedings of the ACM SIGPLAN 2008 conference on Programming language design and implementation
, 2008
"... We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical mod ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical model-driven automatic transformation in the polyhedral model.Unlike previous polyhedral frameworks, our approach is an end-to-end fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high performance for local and parallel execution on multi-cores, when compared with state-of-the-art compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.
Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping
- In PLDI
, 2009
"... Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a skilled expert programmer could achieve. We have ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach, resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection we overcome the limitations of static analysis, enabling us to identify more application parallelism and only rely on the user for final approval. In addition, we replace the traditional target-specific and inflexible mapping heuristics with a machine-learning based prediction mechanism, resulting in better mapping decisions while providing more scope for adaptation to different target architectures.
A Tuning Framework for Software-Managed Memory Hierarchies
"... Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine’s particular characteristics. A large program on a multi-level machine can easily expose tens or hundr ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine’s particular characteristics. A large program on a multi-level machine can easily expose tens or hundreds of inter-dependent parameters which require tuning, and manually searching the resultant large, non-linear space of program parameters is a tedious process of trial-and-error. In this paper we present a general framework for automatically tuning general applications to machines with software-managed memory hierarchies. We evaluate our framework by measuring the performance of benchmarks that are tuned for a range of machines with different memory hierarchy configurations: a cluster of Intel P4 Xeon processors, a single Cell processor, and a cluster of Sony Playstation3’s.
The Polyhedral Model Is More Widely Applicable Than You Think
"... Abstract. The polyhedral model is a powerful framework for automatic optimization and parallelization. It is based on an algebraic representation of programs, allowing to construct and search for complex sequences of optimizations. This model is now mature and reaches production compilers. The main ..."
Abstract
-
Cited by 8 (7 self)
- Add to MetaCart
Abstract. The polyhedral model is a powerful framework for automatic optimization and parallelization. It is based on an algebraic representation of programs, allowing to construct and search for complex sequences of optimizations. This model is now mature and reaches production compilers. The main limitation of the polyhedral model is known to be its restriction to statically predictable, loop-based program parts. This paper removes this limitation, allowing to operate on general data-dependent control-flow. We embed control and exit predicates as first-class citizens of the algebraic representation, from program analysis to code generation. Complementing previous (partial) attempts in this direction, our work concentrates on extending the code generation step and does not compromise the expressiveness of the model. We present experimental evidence that our extension is relevant for program optimization and parallelization, showing performance improvements on benchmarks that were thought to be out of reach of the polyhedral model. 1
A Note on the Performance Distribution of Affine Schedules
"... Abstract. Iterative optimization has been shown to improve the performance of benchmarks significantly, but its application involves challenges such as the requirement of an expressive search space and the design of efficient search techniques. In this paper, we apply iterative optimization to the p ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. Iterative optimization has been shown to improve the performance of benchmarks significantly, but its application involves challenges such as the requirement of an expressive search space and the design of efficient search techniques. In this paper, we apply iterative optimization to the problem of optimizing in the polyhedral model, a powerful algebraic representation of any static control program, by using affine multidimensional schedules to represent arbitrarily complex transformation sequences. We propose to study the performance distribution of the search space of affine multidimensional schedules built specifically to guarantee legality and uniqueness of each program version. We extensively study the optimization of 5 representative benchmarks in this representation, and highlight a series of static and dynamic characteristics of the search space. We show how the space can be decoupled into subspaces, which can be statically ordered with respect to their impact on performance. Finally, we present a practical search method leveraging these properties to traverse the search space, yielding a 32.56 % speedup on eight representative kernels.
Loop Transformation Recipes for Code Generation and Auto-Tuning
"... Abstract. In this paper, we describe transformation recipes, which provide a high-level interface to the code transformation and code generation capability of a compiler. These recipes can be generated by compiler decision algorithms or savvy software developers. This interface is part of an auto-tu ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. In this paper, we describe transformation recipes, which provide a high-level interface to the code transformation and code generation capability of a compiler. These recipes can be generated by compiler decision algorithms or savvy software developers. This interface is part of an auto-tuning framework that explores a set of different implementations of the same computation and automatically selects the best-performing implementation. Along with the original computation, a transformation recipe specifies a range of implementations of the computation resulting from composing a set of high-level code transformations. In our system, an underlying polyhedral framework coupled with transformation algorithms takes this set of transformations, composes them and automatically generates correct code. We first describe an abstract interface for transformation recipes, which we propose to facilitate interoperability with other transformation frameworks. We then focus on the specific transformation recipe interface used in our compiler and present performance results on its application to kernel and library tuning and tuning of key computations in high-end applications. We also show how this framework can be used to generate and auto-tune parallel OpenMP or CUDA code from a high-level specification. 1
Automating the Generation of Composed Linear Algebra Kernels
"... Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However, tuning the BLAS in isolation misses opportunities for memory optimization that result from composing multiple subprograms. Because it is not practical to create a library of all BLAS combinations, we have developed a domain-specific compiler that generates them on demand. In this paper, we describe a novel algorithm for compiling linear algebra kernels and searching for the best combination of optimization choices. We also present a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization. We report experimental results showing speedups of up to 130 % relative to the GotoBLAS on an AMD Opteron and up to 137 % relative to MKL on an Intel Core 2. 1.
Intelligent Compilers
"... Abstract—The industry is now in agreement that the future of architecture design lies in multiple cores. As a consequence, all computer systems today, from embedded devices to petascale computing systems, are being developed using multicore processors. Although researchers in industry and academia a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—The industry is now in agreement that the future of architecture design lies in multiple cores. As a consequence, all computer systems today, from embedded devices to petascale computing systems, are being developed using multicore processors. Although researchers in industry and academia are exploring many different multicore hardware design choices, most agree that developing portable software that achieves high performance on multicore processors is a major unsolved problem. We now see a plethora of architectural features, with little consensus on how the computation, memory, and communication structures in multicore systems will be organized. The wide disparity in hardware systems available has made it nearly impossible to write code that is portable in functionality while still taking advantage of the performance potential of each system. In this paper, we propose exploring the viability of developing intelligent compilers, focusing on key components that will allow application portability while still achieving high performance. I.
Int J Parallel Prog DOI 10.1007/s10766-010-0132-7 ACOTES Project: Advanced Compiler Technologies for Embedded Streaming
, 2009
"... © Springer Science+Business Media, LLC 2010 Abstract Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. Howeve ..."
Abstract
- Add to MetaCart
© Springer Science+Business Media, LLC 2010 Abstract Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes
Speeding Up Nek5000 with Autotuning and Specialization Jaewook Shin
"... Autotuning technology has emerged recently as a systematic process for evaluating alternative implementations of a computation, in order to select the best-performing solution for a particular architecture. Specialization optimizes code customized to a particular class of input data set. In this pap ..."
Abstract
- Add to MetaCart
Autotuning technology has emerged recently as a systematic process for evaluating alternative implementations of a computation, in order to select the best-performing solution for a particular architecture. Specialization optimizes code customized to a particular class of input data set. In this paper, we demonstrate how compilerbased autotuning that incorporates specialization for expected data set sizes of key computations can be used to speed up Nek5000, a spectral-element code. Nek5000 makes heavy use of what are effectively Basic Linear Algebra Subroutine (BLAS) calls, but for very small matrices. Through autotuning and specialization, we can achieve significant performance gains over hand-tuned libraries (e.g., Goto, ATLAS, and ACML BLAS). Additional performance gains are obtained from using higher-level compiler optimizations that aggregate multiple BLAS calls. We demonstrate more than 2.2X performance gains on an Opteron over the original manually tuned implementation, and speedups of up to 1.26X on the entire

