Results 1 - 10
of
16
Compiler Optimizations for Eliminating Barrier Synchronization
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By ..."
Abstract
-
Cited by 91 (13 self)
- Add to MetaCart
(Show Context)
This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By exploiting compiletime computation partitions, communication analysis can eliminate barrier synchronization or replace it with less expensive forms of synchronization. We show computation partitions and data communication can be represented as systems of symbolic linear inequalities for high flexibility and precision. These optimizations has been implemented in the Stanford SUIF compiler. We extensively evaluate their performance using standard benchmark suites. Experimental results show barrier synchronization is reduced 29% on averageand by several orders of magnitude for certain programs. 1 Introduction Parallel machines with shared address spaces and coherent caches provide an attracti...
Program and Data Transformations for Efficient Execution on Distributed Memory Architectures
, 1993
"... This report is concerned with the efficient execution of array computation on Distributed Memory Architectures by applying compiler-directed program and data transformations. By translating a sub-set of a single-assignment language, Sisal, into a linear algebraic framework it is possible to transfor ..."
Abstract
-
Cited by 21 (7 self)
- Add to MetaCart
This report is concerned with the efficient execution of array computation on Distributed Memory Architectures by applying compiler-directed program and data transformations. By translating a sub-set of a single-assignment language, Sisal, into a linear algebraic framework it is possible to transform a program so as to reduce load imbalance and non-local memory access. A new test is presented which allows the construction of transformations to reduce load imbalance. By a new expression of data alignment, transformations to reduce non-local access are derived. A new pre-fetching procedure, which prevents redundant non-local accesses, is presented and forms the basis of a new data partitioning methodology. By applying these transformations in a straightforward manner to some well known scientific programs, it is shown that this approach is competitive with hand-crafted methods. Preface The author graduated from Aston University in 1987 with an upper second B.Sc.(Hons.) in Computationa...
Static Analysis to Reduce Synchronization Costs in Data-Parallel Programs
- IN PROC. ACM CONFERENCE ON PRINCIPLES OF PROGRAMMING LANGUAGES
, 1996
"... For a program with sufficient parallelism, reducing synchronization costs is one of the most important objectives for achieving efficient execution on any parallel machine. This paper presents a novel methodology for reducing synchronization costs of programs compiled for SPMD execution. This method ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
For a program with sufficient parallelism, reducing synchronization costs is one of the most important objectives for achieving efficient execution on any parallel machine. This paper presents a novel methodology for reducing synchronization costs of programs compiled for SPMD execution. This methodology combines data flow analysis with communication analysis to determine the ordering between production and consumption of data on different processors, which helps in identifying redundant synchronization. The resulting framework is more powerful than any that have been previously presented, as it provides the first algorithm that can eliminate synchronization messages even from computations that need communication. We show that several commonly occurring computation patterns such as reductions and stencil computations with reciprocal producer-consumer relationship between processors lend themselves well to this optimization, an observation that is confirmed by an examination of some HPF...
Theory, Techniques, And Experiments In Solving Recurrences In Computer Programs
, 1997
"... ... work. In the sixth chapter, we consider the application of these same techniques focused on obtaining parallelism in outer time-stepping loops. In the final chapter, we draw this work to a conclusion and discuss future directions in parallelizing compiler technology. ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
... work. In the sixth chapter, we consider the application of these same techniques focused on obtaining parallelism in outer time-stepping loops. In the final chapter, we draw this work to a conclusion and discuss future directions in parallelizing compiler technology.
The Role of Associativity and Commutativity in the Detection and Transformation of Loop-Level Parallelism
- ICS '98 PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING
, 1998
"... The study of theoretical and practical issues in automatic parallelization across application and language boundaries is an appropriate and timely task. In this paper, we discuss theory and techniques that we have determined useful in parallelizing recurrences and reductions in computer pro ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
The study of theoretical and practical issues in automatic parallelization across application and language boundaries is an appropriate and timely task. In this paper, we discuss theory and techniques that we have determined useful in parallelizing recurrences and reductions in computer programs. We present a framework for understanding such parallelism based on an approach which models loop bodies as coalescing loop operators. Within this framework we distinguish between associative coalescing loop operators and associative and commutative coalescing loop operators. We present the result of the application of this theory in a case study of a modern C++ semantic retrieval application drawn from the digital library field.
Compiler Algorithms for Event Variable Synchronization
- In Proceedings of the 1991 ACM International Conference on Supercomputing
, 1991
"... Event variable synchronization is a well-known mechanism for enforcing data dependences in a program that runs in parallel on a shared memory multiprocessor. This paper presents compiler algorithms to automatically generate event variable synchronization code. Previously published algorithms dealt w ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
(Show Context)
Event variable synchronization is a well-known mechanism for enforcing data dependences in a program that runs in parallel on a shared memory multiprocessor. This paper presents compiler algorithms to automatically generate event variable synchronization code. Previously published algorithms dealt with single parallel loops in which dependence distances are constant and known by the compiler. However, loops in real application programs are often arbitrarily nested. Moreover, compilers are often unable to determine dependence distances. In contrast, our algorithms generate synchronization code based directly on array subscripts and do not require constant distances in data dependences. The algorithms are designed for arbitrarily nested loops, including triangular or trapezoidal loops. 1 Introduction On shared memory multiprocessors, the performance of scientific and engineering programs can often be improved by running DO loop iterations in parallel. Some recent simulation studies rep...
Compile-time Synchronization Optimizations for Software DSMs
, 1998
"... Software distributed-shared-memory (DSM) systems provide a desirable target for parallelizing compilers due to their flexibility. However, studies show synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for elimi ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
(Show Context)
Software distributed-shared-memory (DSM) systems provide a desirable target for parallelizing compilers due to their flexibility. However, studies show synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating synchronization overhead in software DSMs, developing new algorithms to handle situations found in practice. We evaluate the contributions of synchronization elimination algorithms based on 1) dependence analysis, 2) communication analysis, 3) exploiting coherence protocols in software DSMs, and 4) aggressive expansion of parallel SPMD regions. We also found suppressing expensive parallelism to be useful for one application. Experiments indicate these techniques eliminate almost all parallel task invocations, and reduce the number of barriers executed by 66% on average. On a 16 processor IBM SP-2, speedups are improved on average by 35%, and are tripled for some applications.
Eliminating Barrier Synchronization for Compiler-Parallelized Codes on Software DSMs
- International Journal of Parallel Programming
, 1998
"... Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronization and load imb ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating barrier synchronization overhead in software DSMs. Our compile-time barrier elimination algorithm extends previous techniques in three ways: 1) we perform inexpensive communication analysis through local subscript analysis when using chunk iteration partitioning for parallel loops, 2) we exploit delayed updates in lazy-release-consistency DSMs to eliminate barriers guarding only anti-dependences, 3) when possible we replace barriers with customized nearest-neighbor synchronization. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance ...
Reducing synchronization overhead for compiler-parallelized codes on software DSMs
- Languages and Compilers for Parallel Computing, Tenth International Workshop, LCPC'97, volume 1366 of Lecture Notes in Computer Science
, 1997
"... Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronizationand load imba ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
(Show Context)
Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronizationand load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating barrier synchronization overhead in software DSMs. Our compile-time barrier elimination algorithm extends previous techniques in three ways: 1) we perform inexpensive communication analysis through local subscript analysis when using chunk iteration partitioning for parallel loops, 2) we exploit delayed updates in lazy-release-consistency DSMs to eliminate barriers guarding only anti-dependences, 3) when possible we replace barriers with customized nearest-neighbor synchronization. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance by 20 % on average and by up to 60 % for some applications. 1
LAMPVIEW: A LOOP-AWARE TOOLSET FOR FACILITATING PARALLELIZATION
"... A continual growth of the number of transistors per unit area coupled with diminishing returns from traditional microarchitectural and clock frequency improvements has led processor manufacturers to place multiple cores on a single chip. However, only multi-threaded code can fully take advantage of ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
A continual growth of the number of transistors per unit area coupled with diminishing returns from traditional microarchitectural and clock frequency improvements has led processor manufacturers to place multiple cores on a single chip. However, only multi-threaded code can fully take advantage of the new multicore processors; legacy single-threaded code does not benefit. Many approaches to parallelization have been explored, including both manual and automatic techniques. Unfortunately, research in this area is impeded by the innate difficulty of exploring code by hand for new possible parallelization schemes. Regardless of whether it is a researcher attempting to discover possible automatic techniques or a programmer trying to make manual parallelization, the benefits of good dependence information are substantial. This thesis provides a profiling and analysis toolset aimed at easing a programmer or researcher’s effort in finding parallelism. The toolset, The Loop-Aware Memory Profile Viewing System (LAMPView), is developed in three parts. The first part is a multi-frontend, multi-target compiler pass written to instrument the code with calls to the Loop-Aware Memory Profiling (LAMP) library. The compile-time