Results 1 - 10
of
11
Data and Computation Transformations for Multiprocessors
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimiza ..."
Abstract
-
Cited by 156 (14 self)
- Add to MetaCart
Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimization algorithm consists of two steps. The first step chooses the parallelization and computation assignment such that synchronization and data sharing are minimized. The second step then restructures the layout of the data in the shared address space with an algorithm that is based on a new data transformation framework. We ran our compiler on a set of application programs and measured their performance on the Stanford DASH multiprocessor. Our results show that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance. 1 Introduction In the last decade, microprocessor speeds have been steadily improving at a rate of 50% to 100% every year[16]. Meanwh...
SUIF Explorer: an interactive and interprocedural parallelizer
, 1999
"... The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus mini ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus minimizing the number of spurious dependences requiring attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmer in the parallelization process using a set of sophisticated visualization techniques. This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables. 1. Introduction Exploiting coarse-grain parallelism i...
Automatic Selection of Dynamic Data Partitioning Schemes for Distributed-Memory Multicomputers
- in Proceedings of the 8th Workshop on Languages and Compilers for Parallel Computing
, 1995
"... . For distributed-memory multicomputers such as the Intel Paragon, the IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of the data partitioning for a given application is crucial to obtaining high performance. This task has traditionally been the user's responsibility, but in ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
. For distributed-memory multicomputers such as the Intel Paragon, the IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of the data partitioning for a given application is crucial to obtaining high performance. This task has traditionally been the user's responsibility, but in recent years much effort has been directed to automating the selection of data partitioning schemes. Several researchers have proposed systems that are able to produce data distributions that remain in effect for the entire execution of an application. For complex programs, however, such static data distributions may be insufficient to obtain acceptable performance. The selection of distributions that dynamically change over the course of a program's execution adds another dimension to the data partitioning problem. In this paper, we present a technique that can be used to automatically determine which partitionings are most beneficial over specific sections of a program while taking into a...
Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed Memory Machines
- PACT '97
, 1997
"... Distributed memory message passing machines can deliver scalable performance but are difficult to program. Shared memory machines, on the other hand, are easier to program but obtaining scalable performance with large number of processors is difficult. Recently, some scalable architectures based on ..."
Abstract
-
Cited by 15 (9 self)
- Add to MetaCart
Distributed memory message passing machines can deliver scalable performance but are difficult to program. Shared memory machines, on the other hand, are easier to program but obtaining scalable performance with large number of processors is difficult. Recently, some scalable architectures based on logically-shared physically-distributed memory have been designed and implemented. While some of the performance issues like parallelism and locality are common to the different parallel architectures, issues such as data decomposition are unique to specific types of architectures. One of the most important challenges compiler writers face is to design compilation techniques that can work on a variety of architectures. In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. Our optimization algorithm does the following: (1) transforms loop nests such that, where possible, the outermost loops can be run in parallel across processors; (2) decomposes each array across processors; (3) optimizes interprocessor communication by vectorizing it whenever possible; and (4) optimizes locality (cache performance) by assigning appropriate storage layout for each array. Depending on the underlying hardware system, some or all of these steps can be applied in a unified framework. We present simulation results for cache miss rates, and empirical results on SUN SPARCstation 5, IBM SP-2, SGI Challenge and Convex Exemplar to validate the effectiveness of our approach on different architectures.
Automatic Alignment of Array Data and Processes To Reduce Communication Time on DMPPs
, 1995
"... This paper investigates the problem of aligning data and processes in a distributed-memory implementation. We present complete algorithms for compile-time analysis, the necessary program restructuring, and subsequent code-generation, and discuss their complexity. We finally evaluate the practical us ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
This paper investigates the problem of aligning data and processes in a distributed-memory implementation. We present complete algorithms for compile-time analysis, the necessary program restructuring, and subsequent code-generation, and discuss their complexity. We finally evaluate the practical usefulness by quantitative experimentation. The technique presented analyzes complete programs, including branches, loops, and nested parallelism. Alignment is determined with respect to offset, stride, and general axis relations. Both placement of data and processes are computed in a unifying framework based on an extended preference graph and its analysis. Furthermore, dynamic redistribution and replication are considered in the same technique. The experimental results are very encouraging. The optimization algorithms implemented in the Modula-2* compiler improved the execution times of the programs by over 40% on a MasPar MP-1 with 16384 processors. This paper appeared in: Proceedings of th...
Unified Compilation Techniques for Shared and Distributed Address Space Machines
- IN PROCEEDINGS OF THE 1995 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING
, 1995
"... Parallel machines with shared address spaces are easy to program because they provide hardware support that allows each processor to transparently access non-local data. However, obtaining scalable performance can be difficult due to memory access and synchronization overhead. In this paper, we use ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Parallel machines with shared address spaces are easy to program because they provide hardware support that allows each processor to transparently access non-local data. However, obtaining scalable performance can be difficult due to memory access and synchronization overhead. In this paper, we use profiling and simulation studies to identify the sources of parallel overhead. We demonstrate that compilation techniques for distributed address space machines can be very effective when used in compilers for shared address space machines. Automatic data decomposition can co-locate data and computation to improve locality. Data reorganization transformations can reduce harmful cache effects. Communication analysis can eliminate barrier synchronization. We present a set of unified compilation techniques that exemplify this convergence in compilers for shared and distributed address space machines, and illustrate their effectiveness using two example applications.
Modeling Data-Parallel Programs with the Alignment-Distribution Graph
, 1994
"... We present an intermediate representation of a program called the Alignment-Distribution Graph that exposes the communication requirements of the program. The representation exploits ideas developed in the static single assignment form of programs, but is tailored for communication optimization. I ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
We present an intermediate representation of a program called the Alignment-Distribution Graph that exposes the communication requirements of the program. The representation exploits ideas developed in the static single assignment form of programs, but is tailored for communication optimization. It serves as the basis for algorithms that map the array data and program computation to the nodes of a distributed-memory parallel computer so as to minimize completion time. We describe the details of the representation, explain its construction from source text, show its use in modeling communication cost, outline several algorithms for determining mappings that approximately minimize residual communication, and compare it with other related intermediate representations of programs.
Compiler Optimizations for Parallel Sparse Programs with Array Intrinsics of Fortran 90
, 1999
"... In our recent work, we have been working on providing parallel sparse supports for array intrinsics of Fortran 90. Our supporting library uses a two-level design. In the low-level routines, it requires the input sparse matrices to be specified with compression/distribution schemes for array function ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In our recent work, we have been working on providing parallel sparse supports for array intrinsics of Fortran 90. Our supporting library uses a two-level design. In the low-level routines, it requires the input sparse matrices to be specified with compression/distribution schemes for array functions. In the high-level representations, sparse array functions are overloaded with Fortran 90 array intrinsic interfaces so that programmers need not concern about low-level details. This raises a very interesting optimization problem in the strategies to transform high-level representations to low-level routines by automatic selections and supplies of distribution and compression schemes for sparse arrays. In this paper, we propose solutions to this optimization problem. The optimization problem is shown to be NP-hard. We develop a heuristic algorithm based on annotated program graphs, and the algorithm is shown to be practical. Experimental results on an IBM SP-2 show that the selection algo...
Array Operation Synthesis to Optimize HPF programs
- In Proceedings of the 25th International Conference on Parallel Processing (ICPP'96
"... An increasing number of programming languages, such as Fortran 90, HPF, and APL, are providing a rich set of intrinsic array functions and array expressions. These constructs which constitute an important part of data parallel languages provide excellent opportunities for compiler optimizations. The ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
An increasing number of programming languages, such as Fortran 90, HPF, and APL, are providing a rich set of intrinsic array functions and array expressions. These constructs which constitute an important part of data parallel languages provide excellent opportunities for compiler optimizations. The synthesis of consecutive array operations or array expressions into a composite access function of the source arrays at compile time has been shown[2] that it can reduce the redundant data movement, temporary storage usage, and loop synchronization overhead on flat shared memory parallel machines with uniform memory accesses. However, it remains open how the synthesis scheme can be incorporated into optimizing HPF programs on distributed memory parallel machines by taking into account communication costs. In this paper, we propose solutions to address this open problem. We first apply the array synthesis scheme (developed earlier by us for Fortran 90 programs) to HPF programs and demonstrat...
pC*: Efficient and Portable Runtime Support for Data-Parallel Languages
, 1996
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 CHAPTER 1: SO WHAT'S THIS ALL ABOUT : : : : : : : : : : : : : : : : : : 2 CHAPTER 2: INTRODUCTION TO C* AND PC* : : : : : : : : : : : : : : : : : 8 2.1 Overview of C* : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 CHAPTER 1: SO WHAT'S THIS ALL ABOUT : : : : : : : : : : : : : : : : : : 2 CHAPTER 2: INTRODUCTION TO C* AND PC* : : : : : : : : : : : : : : : : : 8 2.1 Overview of C* : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 2.1.1 Shape and Parallel Execution : : : : : : : : : : : : : : : : : : : 8 2.1.2 Communication and Position Addressing : : : : : : : : : : : : : 10 2.1.3 Contextualization : : : : : : : : : : : : : : : : : : : : : : : : : 13 2.1.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15 2.2 The pC* Implementation of C* : : : : : : : : : : : : : : : : : : : : : : 16 2.2.1 Genesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16 2.2.2 Basic Implementation Model : : : : : : : : : : : : : : : : : : : 17 2.2.3 Current Status : : : : : : : : : : : : : : : : : : : : : : : : : : 19 2.3 Related Parallel and Data-Parallel Systems : : : : : : : : : : : : : : : : : 20 CHAP...

