Results 1 - 10
of
29
A practical automatic polyhedral parallelizer and locality optimizer
- In PLDI ’08: Proceedings of the ACM SIGPLAN 2008 conference on Programming language design and implementation
, 2008
"... We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical mod ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical model-driven automatic transformation in the polyhedral model.Unlike previous polyhedral frameworks, our approach is an end-to-end fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high performance for local and parallel execution on multi-cores, when compared with state-of-the-art compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.
Code and Data Transformations for Improving Shared Cache Performance on SMT Processors
- In Proceedings of the 5th International Symposium on High Performance Computing
, 2003
"... Abstract. Simultaneous multithreaded processors use shared on-chip caches, which yield better cost-performance ratios. Sharing a cache between simultaneously executing threads causes excessive conflict misses. This paper proposes software solutions for dynamically partitioning the shared cache of an ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Abstract. Simultaneous multithreaded processors use shared on-chip caches, which yield better cost-performance ratios. Sharing a cache between simultaneously executing threads causes excessive conflict misses. This paper proposes software solutions for dynamically partitioning the shared cache of an SMT processor, via the use of three methods originating in the optimizing compilers literature: dynamic tiling, copying and block data layouts. The paper presents an algorithm that combines these transformations and two runtime mechanisms to detect cache sharing between threads and react to it at runtime. The first mechanism uses minimal kernel extensions and the second mechanism uses information collected from the processor hardware counters. Our experimental results show that for regular, perfect loop nests, these transformations are very effective in coping with shared caches. When the caches are shared between threads from the same address space, performance is improved by 16–29 % on average. Similar improvements are observed when the caches are shared between threads from different address spaces. To our knowledge, this is the first work to present an all-software approach for managing shared caches on SMT processors. It is also one of the first performance and program optimization studies conducted on a commercial SMT-based multiprocessor using Intel’s hyperthreading technology.
PLuTo: A practical and fully automatic polyhedral program optimization system
- IN: PROCEEDINGS OF THE ACM SIGPLAN 2008 CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION (PLDI 08
, 2008
"... We present the design and implementation of a fully automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytica ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
We present the design and implementation of a fully automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical model-driven automatic transformation in the polyhedral model – far beyond what is possible by current production compilers. Unlike previous works, our approach is an end-to-end fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. We also address generation of tiled code for multiple statement domains of arbitrary dimensionalities under (statement-wise) affine transformations – an issue that has not been addressed previously. Experimental results from the implemented system show very high speedups for local and parallel execution on multi-cores over state-of-the-art compiler frameworks from the research community as well as the best native compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.
Dynamic Multi Phase Scheduling for Heterogeneous Clusters
- Proc. of the 20th IEEE Int. Parallel & Distributed Processing Symposium (IPDPS 2006
, 2006
"... Distributed computing systems are a viable and less expensive alternative to parallel computers. However, concurrent programming methods in distributed systems have not been studied as extensively as for parallel computers. Some of the main research issues are how to deal with scheduling and load ba ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Distributed computing systems are a viable and less expensive alternative to parallel computers. However, concurrent programming methods in distributed systems have not been studied as extensively as for parallel computers. Some of the main research issues are how to deal with scheduling and load balancing of such a system which may consist of heterogeneous computers. In the past, a variety of dynamic scheduling schemes suitable for parallel loops (with independent iterations) on heterogeneous computer clusters have been obtained and studied. However, no study of dynamic schemes for loops with iteration dependencies has been reported so far. In this work we study the problem of scheduling loops with iteration dependencies for heterogeneous (dedicated and non-dedicated) clusters. The presence of iteration dependencies incurs an extra degree of difficulty and makes the development of such schemes quite a challenge. We extend three well known dynamic schemes (CS, TSS and DTSS) by introducing synchronization points at certain intervals so that processors compute in pipelined fashion. Our scheme is called dynamic multi-phase scheduling (DMPS) and we apply it to loops with iteration dependencies. We implemented our new scheme on a network of heterogeneous computers and study its performance. Through extensive testing on two real-life applications (heat equation and Floyd-Steinberg algorithm), we show that the proposed method is efficient for parallelizing nested loops with dependencies on heterogeneous systems. Index Terms- heterogeneous distributed systems, loop scheduling, dynamic algorithms, dependence loops, pipelined execution. 1
Towards optimal multi-level tiling for stencil computations
- 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS
, 2007
"... Stencil computations form the performance-critical core of many applications. Tiling and parallelization are two important optimizations to speed up stencil computations. Many tiling and parallelization strategies are applicable to a given stencil computation. The best strategy depends not only on t ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Stencil computations form the performance-critical core of many applications. Tiling and parallelization are two important optimizations to speed up stencil computations. Many tiling and parallelization strategies are applicable to a given stencil computation. The best strategy depends not only on the combination of the two techniques, but also on many parameters: tile and loop sizes in each dimension; computation-communication balance of the code; processor architecture; message startup costs; etc. The best choices can only be determined through design-space exploration, which is extremely tedious and error prone to do via exhaustive experimentation. We characterize the space of multi-level tilings and parallelizations for 2D/3D Gauss-Siedel stencil computation. A systematic exploration of a part of this space enabled us to derive a design which is up to a factor of two faster than the standard implementation. 1.
PrimeTile: A Parametric Multi-Level Tiler for Imperfect Loop Nests
"... Tiling is a crucial loop transformation for generating high performance code on modern architectures. Efficient generation of multi-level tiled code is essential for maximizing data reuse in systems with deep memory hierarchies. Tiled loops with parametric tile sizes (not compile-time constants) fac ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Tiling is a crucial loop transformation for generating high performance code on modern architectures. Efficient generation of multi-level tiled code is essential for maximizing data reuse in systems with deep memory hierarchies. Tiled loops with parametric tile sizes (not compile-time constants) facilitate runtime feedback and dynamic optimizations used in iterative compilation and automatic tuning. Previous parametric multi-level tiling approaches have been restricted to perfectly nested loops, where all assignment statements are contained inside the innermost loop of a loop nest. Previous solutions to tiling for imperfect loop nests have only handled fixed tile sizes. In this paper, we present an approach to parametric multi-level tiling of imperfectly nested loops. The tiling technique generates loops that iterate over full rectangular tiles, making them amenable to compiler optimizations such as register tiling. Experimental results using a number of computational benchmarks demonstrate the effectiveness of the developed tiling approach. 1 1
Optimizaiton of dense matrix multiplication on ibm cyclops-64: Challenges and experiences
- In proceedings of the 12th International European Conference on Parallel Processing (Euro-Par
, 2006
"... Abstract. This paper presents a study of performance optimization of dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published on how to optimize dense matrix applications on shared memory architecture with multi-level caches, little has been reported on ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published on how to optimize dense matrix applications on shared memory architecture with multi-level caches, little has been reported on the applicability of the existing methods to the new generation of multi-core architectures like C64. For such architectures a more economical use of on-chip storage resources appears to discourage the use of caches, while providing tremendous on-chip memory bandwidth per storage area. This paper presents an in-depth case study of a collection of well known optimization methods and tries to re-engineer them to address the new challenges and opportunities provided by this emerging class of multi-core chip architectures. Our study demonstrates that efficiently exploiting the memory hierarchy is the key to achieving good performance. The main contributions of this paper include: (a) identifying a set of key optimizations for C64-like architectures, and (b) exploring a practical order of the optimizations, which yields good performance for applications like matrix multiplication. 1
Parameterized Tiled Loops for Free
, 2007
"... Parameterized tiled loops—where the tile sizes are not fixed at compile time, but remain symbolic parameters until later—are quite useful for iterative compilers and “auto-tuners” that produce highly optimized libraries and codes. Tile size parameterization could also enable optimizations such as re ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Parameterized tiled loops—where the tile sizes are not fixed at compile time, but remain symbolic parameters until later—are quite useful for iterative compilers and “auto-tuners” that produce highly optimized libraries and codes. Tile size parameterization could also enable optimizations such as register tiling to become dynamic optimizations. Although it is easy to generate such loops for (hyper) rectangular iteration spaces tiled with (hyper) rectangular tiles, many important computations do not fall into this restricted domain. Parameterized tile code generation for the general case of convex iteration spaces being tiled by (hyper) rectangular tiles has in the past been solved with bounding box approaches or symbolic Fourier Motzkin approaches. However, both approaches have less than ideal code generation efficiency and resulting code quality. We present the theoretical foundations, implementation, and experimental validation of a simple, unified technique for generating parameterized tiled code. Our code generation efficiency is comparable to all existing code generation techniques including those for fixed tile sizes, and the resulting code is as efficient as, if not more than, all previous techniques. Thus the technique provides parameterized tiled loops for free! Our “one-size-fits-all” solution, which is available as open source software can be adapted for use in production compilers.
Data Cache Locking for Tight Timing Calculations
"... Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality. In addition, caches are a source of unpred ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality. In addition, caches are a source of unpredictability, resulting in programs sometimes behaving in a different way than expected. Detailed information about the number of cache misses and their causes allows us to predict cache behavior and to detect bottlenecks. Small modifications in the source code may change memory patterns, thereby altering the cache behavior. Code transformations which take the cache behavior into account might result in a high cache performance improvement. However, cache memory behavior is very hard to predict, thus making the task of optimizing and timing cache behavior very difficult. This article proposes and evaluates a new compiler framework that times cache behavior for multitasking systems. Our method explores the use of cache partitioning and dynamic cache locking to provide worst-case performance estimates in a safe and tight way for multitasking systems. We use cache partitioning, which divides the cache among tasks to eliminate inter-task
Mobile pipelines: Parallelizing left-looking algorithms using navigational programming
- In Proc., 12th Int’l Conf. on High Perf. Comp. - HiPC 2005
, 2005
"... Abstract. Parallelizing a sequential algorithm—i.e., manually or automatically converting it into an equivalent parallel distributed algorithm—is an important problem. Ideally, the parallel algorithm should preserve the computational structure of the original sequential algorithm, display a high deg ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. Parallelizing a sequential algorithm—i.e., manually or automatically converting it into an equivalent parallel distributed algorithm—is an important problem. Ideally, the parallel algorithm should preserve the computational structure of the original sequential algorithm, display a high degree of parallelism, have low communication overhead, and be scalable. The difficulty of accomplishing this for a particular sequential algorithm depends on the nature of the algorithm and the specific model of parallelism under consideration. Generally, the so called “right-looking ” algorithms are easy to parallelize because they tend to have a significant amount of data parallelism. In these algorithms, data is “eagerly ” propagated to and consumed by subsequent computations immediately after it is produced, and so it does not need to be kept in temporary storage for extended periods of time. In this paper we consider the class of “left-looking ” sequential algorithms [1], which cannot be easily parallelized using the message-passing or distributed shared memory models. These algorithms are characterized by “lazy ” propagation of data. We show that these algorithms can be directly parallelized using the Navigational Programming model. We present performance data demonstrating the effectiveness of our approach. 1

