Results 1 -
9 of
9
Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs
- Lecture Notes in Computer Science 589. Proceedings of the Fourth Workshop on Languages and Compilers for Parallel Computing
, 1991
"... . This paper discusses the techniques used to hand-parallelize, for the Alliant FX/80, four Fortran programs from the Perfect-Benchmark suite. The paper also includes the execution times of the programs before and after the transformations. The four programs considered here were not effectively par ..."
Abstract
-
Cited by 88 (25 self)
- Add to MetaCart
. This paper discusses the techniques used to hand-parallelize, for the Alliant FX/80, four Fortran programs from the Perfect-Benchmark suite. The paper also includes the execution times of the programs before and after the transformations. The four programs considered here were not effectively parallelized by the automatic translators available to the authors. However, most of the techniques used for hand parallelization, and perhaps all of them, have wide applicability and can be incorporated into existing translators. 1. Introduction It is by now widely accepted that in many real-life applications, supercomputers have been unable to deliver a reasonable fraction of their peak performance. An illustration of this is provided by the Perfect Benchmark programs [3], many of which effectively use less than 1% of the computational resources available in the most powerful supercomputers. While it is apparent that the reason for this dismal behavior is sometimes the result of the machine ...
Compiler Optimizations for Eliminating Barrier Synchronization
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By ..."
Abstract
-
Cited by 75 (13 self)
- Add to MetaCart
This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By exploiting compiletime computation partitions, communication analysis can eliminate barrier synchronization or replace it with less expensive forms of synchronization. We show computation partitions and data communication can be represented as systems of symbolic linear inequalities for high flexibility and precision. These optimizations has been implemented in the Stanford SUIF compiler. We extensively evaluate their performance using standard benchmark suites. Experimental results show barrier synchronization is reduced 29% on averageand by several orders of magnitude for certain programs. 1 Introduction Parallel machines with shared address spaces and coherent caches provide an attracti...
Compile-time Synchronization Optimizations for Software DSMs
, 1998
"... Software distributed-shared-memory (DSM) systems provide a desirable target for parallelizing compilers due to their flexibility. However, studies show synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for elimi ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Software distributed-shared-memory (DSM) systems provide a desirable target for parallelizing compilers due to their flexibility. However, studies show synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating synchronization overhead in software DSMs, developing new algorithms to handle situations found in practice. We evaluate the contributions of synchronization elimination algorithms based on 1) dependence analysis, 2) communication analysis, 3) exploiting coherence protocols in software DSMs, and 4) aggressive expansion of parallel SPMD regions. We also found suppressing expensive parallelism to be useful for one application. Experiments indicate these techniques eliminate almost all parallel task invocations, and reduce the number of barriers executed by 66% on average. On a 16 processor IBM SP-2, speedups are improved on average by 35%, and are tripled for some applications.
Reducing synchronization overhead for compiler-parallelized codes on software DSMs
- Languages and Compilers for Parallel Computing, Tenth International Workshop, LCPC'97, volume 1366 of Lecture Notes in Computer Science
, 1997
"... Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronizationand load imba ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronizationand load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating barrier synchronization overhead in software DSMs. Our compile-time barrier elimination algorithm extends previous techniques in three ways: 1) we perform inexpensive communication analysis through local subscript analysis when using chunk iteration partitioning for parallel loops, 2) we exploit delayed updates in lazy-release-consistency DSMs to eliminate barriers guarding only anti-dependences, 3) when possible we replace barriers with customized nearest-neighbor synchronization. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance by 20 % on average and by up to 60 % for some applications. 1
An Empirical Study on DOACROSS Loops
- In Supercomputing '91
, 1991
"... Loop-iteration level parallelism is one of the most common forms of parallelism being exploited by optimizing compilers and parallel machines. In this study, we selected 6 large application programs and used an executiondriven simulation technique from MaxPar [5] to identify and to measure the effec ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Loop-iteration level parallelism is one of the most common forms of parallelism being exploited by optimizing compilers and parallel machines. In this study, we selected 6 large application programs and used an executiondriven simulation technique from MaxPar [5] to identify and to measure the effectiveness of concurrent DOACROSS loops execution. It was found that executing DOACROSS loops serially can significantly degrade the performance for some of the programs. We also measured and studied the characteristics of those cross-iteration dependences in DOACROSS loops and measured the capability of a state-of-the-art parallelizing compiler, KAP, in identifying and eliminating cross-iteration dependences. 1 Introduction Experiences have shown that applying parallelizing techniques to application programs can yield drastically different results which can range from excellent to embarrassingly poor. It is quite natural that we ask the following questions: Does the poor performance come fro...
Eliminating Barrier Synchronization for Compiler-Parallelized Codes on Software DSMs
- International Journal of Parallel Programming
, 1998
"... Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronization and load imb ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating barrier synchronization overhead in software DSMs. Our compile-time barrier elimination algorithm extends previous techniques in three ways: 1) we perform inexpensive communication analysis through local subscript analysis when using chunk iteration partitioning for parallel loops, 2) we exploit delayed updates in lazy-release-consistency DSMs to eliminate barriers guarding only anti-dependences, 3) when possible we replace barriers with customized nearest-neighbor synchronization. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance ...
Compiler techniques for concurrent multithreading with hardware speculation support
- In Proceedings of the 9th Workshop on Languages and Compilers for Parallel Computing
, 1996
"... Abstract. Recently proposed concurrent multithreading architectures employ sophisticated hardware to support speculation on control and data dependences as well as run-time data dependence check, which enables parallelization of program regions such as while-loops which previously were ignored. The ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. Recently proposed concurrent multithreading architectures employ sophisticated hardware to support speculation on control and data dependences as well as run-time data dependence check, which enables parallelization of program regions such as while-loops which previously were ignored. The new architectures demand compilers to put more emphasis on the formation and selection of parallel threads. Compilers also play an important role in reducing the cost of run-time data dependence check. This paper discusses these new issues. 1
Tests des D'ependances et Transformations de Programme
, 1993
"... The parallelization of sequential programs requires several stages : analysis of dependence relations, representation of these dependences and application of transformations using this representation to find a parallel schedule for the program instructions. The success of parallelization depends on ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
The parallelization of sequential programs requires several stages : analysis of dependence relations, representation of these dependences and application of transformations using this representation to find a parallel schedule for the program instructions. The success of parallelization depends on the precision of the dependences test and dependence representation used. In this thesis, we present and compare different dependence test algorithms and different data dependence abstractions. The algorithm of the PIPS parallelizer is based on a approximate feasibility test using Fourier-Motzkin elimination. Our experiments show that, in practice, it is accurate enough for treating dependences systems, and that its practical complexity is polynomial. Different dependence abstractions have different precision. For deciding whether a transformation is legal, several abstractions are admissible, meaning they contain enough information for knowing if this transformation is legal. The minimal a...
Efficient Machine-Independent Programming of High-Performance Multiprocessors
, 1995
"... mainly because the cost of interprocessor communication is too great compared to computation and local memory accesses [74, 77]. To achieve high performance, COSMIC will perform communicationanalysis and apply optimizations for locality, synchronization, communication, and memory system effects. COS ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
mainly because the cost of interprocessor communication is too great compared to computation and local memory accesses [74, 77]. To achieve high performance, COSMIC will perform communicationanalysis and apply optimizations for locality, synchronization, communication, and memory system effects. COSMIC follows two basic guidelines. First, it uses compilation techniques for message-passing machines to retain most of the benefits of explicit messages. Second, it exploits architectural and operating system support available in shared-memory multiprocessors to improve flexibility and performance. A novel characteristic of COSMIC will be its ability to take advantage of the multiple coherence protocols and hybrid message-passing support found in software Distributed-Shared-Memory (DSM) systems and Flexible-Shared-Memory (FSM) machines. To evaluate the impact on the performance on scientific applications, I will test COSM

