Results 1 - 10
of
73
Maximizing Multiprocessor Performance with the SUIF Compiler
, 1996
"... This paper presents an overview of the SUIF compiler, which automatically parallelizes and optimizes sequential programs for shared-memory multiprocessors. We describe new technology in this system for locating coarse-grain parallelism and for optimizing multiprocessor memory behavior essential to ..."
Abstract
-
Cited by 229 (22 self)
- Add to MetaCart
This paper presents an overview of the SUIF compiler, which automatically parallelizes and optimizes sequential programs for shared-memory multiprocessors. We describe new technology in this system for locating coarse-grain parallelism and for optimizing multiprocessor memory behavior essential to obtaining good multiprocessor performance. These techniques have a significant impact on the performance of half of the NAS and SPECfp95 benchmark suites. In particular, we achieve the highest SPECfp95 ratio to date of 63.9 on an eight-processor 440MHz Digital AlphaServer. 1 Introduction Affordable shared-memory multiprocessors can potentially deliver supercomputer-like performance to the general public. Today, these machines are mainly used in a multiprogramming mode, increasing system throughput by running several independent applications in parallel. The multiple processors can also be used together to accelerate the execution of single applications. Automatic parallelization is a promis...
Using Integer Sets for Data-Parallel Program Analysis and Optimization
- In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation
, 1998
"... In this paper, we describe our experience with using an abstract integer-set framework to develop the Rice dHPF compiler, a compiler for High Performance Fortran. We present simple, yet general formulations of the major computation partitioning and communication analysis tasks as well as a number of ..."
Abstract
-
Cited by 55 (30 self)
- Add to MetaCart
In this paper, we describe our experience with using an abstract integer-set framework to develop the Rice dHPF compiler, a compiler for High Performance Fortran. We present simple, yet general formulations of the major computation partitioning and communication analysis tasks as well as a number of important optimizations in terms of abstract operations on sets of integer tuples. This approach has made it possible to implement a comprehensive collection of advanced optimizations in dHPF, and to do so in the context of a more general computation partitioning model than previous compilers. One potential limitation of the approach is that the underlying class of integer set problems is fundamentally unable to represent HPF data distributions on a symbolic number of processors. We describe how we extend the approach to compile codes for a symbolic number of processors, without requiring any changes to the set formulations for the above optimizations. We show experimentally that the set re...
Approaches for Integrating Task and Data Parallelism
- IEEE Concurrency
, 1998
"... Languages that support both task and data parallelism are highly general and can exploit both forms of parallelism within a single application. However, integrating the two forms of parallelism cleanly and within a coherent programming model is difficult. This paper describes four languages (Fx, ..."
Abstract
-
Cited by 38 (4 self)
- Add to MetaCart
Languages that support both task and data parallelism are highly general and can exploit both forms of parallelism within a single application. However, integrating the two forms of parallelism cleanly and within a coherent programming model is difficult. This paper describes four languages (Fx, Opus, Orca, and Braid) that try to achieve such an integration and identifies several problems. The main problems are how to support both SPMD and MIMD style programs, how to organize the address space of a parallel program, and how to design the integrated model such that it can be implemented efficiently. Keywords: Parallel programming systems, task parallelism, data parallelism, shared objects, coordination languages, Fx, Opus, Braid, Orca, HPF. Introduction Most parallel programming systems are based either on task parallelism or on data parallelism. Task parallelism (also known as control or process parallelism) allows the programmer to define different types of processes. These ...
Advanced Compilation Techniques in the PARADIGM Compiler for Distributed-Memory Multicomputers
, 1995
"... The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. A previous implementation of the compiler based on the PTD representation allowed symbolic array sizes, affine loop ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. A previous implementation of the compiler based on the PTD representation allowed symbolic array sizes, affine loop bounds and array subscripts, and variable number of processors, provided that arrays were single- or multi-dimensionally block distributed. The techniques presented here extend the compiler to also accept multidimensional cyclic and block-cyclic distributions within a uniform symbolic framework. These extensions demand more sophisticated symbolic manipulation capabilities. A novel aspect of our approach is to meet this demand by interfacing PARADIGM with a powerful off-the-shelf symbolic package, Mathematica(TM). This paper describes some of the Mathematica(TM) routines that performs various transformations, shows how they are invoked and used by the compiler to overcome the new challenges, and presents experimental results for code involving cyclic and block-cyclic arrays as evidence of the feasibility of the approach.
Efficient Symbolic Analysis for Parallelizing Compilers and Performance Estimators
, 1998
"... . Symbolic analysis is of paramount importance for parallelizing compilers and performance estimators to examine symbolic expressions with program unknowns such as machine and problem sizes and to solve queries based on systems of constraints (equalities and inequalities) . This paper describes nove ..."
Abstract
-
Cited by 24 (7 self)
- Add to MetaCart
. Symbolic analysis is of paramount importance for parallelizing compilers and performance estimators to examine symbolic expressions with program unknowns such as machine and problem sizes and to solve queries based on systems of constraints (equalities and inequalities) . This paper describes novel techniques for counting the number of solutions to a system of constraints, simplifying systems of constraints, computing lower and upper bounds of symbolic expressions, and determining the relationship between symbolic expressions. All techniques target wide classes of linear and non-linear symbolic expressions and systems of constraints. Our techniques have been implemented and are used as part of a parallelizing compiler and a performance estimator to support analysis and optimization of parallel programs. Various examples and experiments demonstrate the effectiveness of our symbolic analysis techniques. Keywords: Program analysis, symbolic expressions, comparing symbolic expressions, ...
Transformations to parallel codes for communication-computation overlap
- In Supercomputing 2005
, 2005
"... This paper presents program transformations directed toward improving communication-computation overlap in parallel programs that use MPI’s collective operations. Our transformations target a wide variety of applications focusing on scientific codes with computation loops that exhibit limited depend ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
This paper presents program transformations directed toward improving communication-computation overlap in parallel programs that use MPI’s collective operations. Our transformations target a wide variety of applications focusing on scientific codes with computation loops that exhibit limited dependence among iterations. We include guidance for developers for transforming an application code in order to exploit the communicationcomputation overlap available in the underlying cluster, as well as a discussion of the performance improvements achieved by our transformations. We present results from a detailed study of the effect of the problem and message size, level of communication-computation overlap, and amount of communication aggregation on runtime performance in a cluster environment based on an RDMA-enabled network. The targets of our study are two scientific codes written by domain scientists, but the applicability of our work extends far beyond the scope of these two applications. 1.
Algorithms for Supporting Compiled Communication
- IEEE Transactions on Parallel and Distributed Systems
, 2003
"... In this paper, we investigate the compiler algorithms to support compiled communication in multiprocessor environments and study the benefits of compiled communication assuming that the underlying network is an all-optical Time-Division-Multiplexing (TDM) network. We present an experimental compil ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
In this paper, we investigate the compiler algorithms to support compiled communication in multiprocessor environments and study the benefits of compiled communication assuming that the underlying network is an all-optical Time-Division-Multiplexing (TDM) network. We present an experimental compiler, E-SUIF, that supports compiled communication for High Performance Fortran (HPF) like programs on all-optical TDM networks, describe and evaluate the compiler algorithms used in E-SUIF. We further demonstrate the effectiveness of compiled communication on all-optical TDM networks by comparing the performance of compiled communication with that of a traditional communication method using a number of application programs.
Integrating Task and Data Parallelism Using Shared Objects
, 1996
"... Supporting both task and data parallelism in one programming system is useful, since many applications need both types of parallelism. We present a programming model that integrates task and data parallelism using shared objects. The model is a generalization of shared objects in Orca. Orca is a tas ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
Supporting both task and data parallelism in one programming system is useful, since many applications need both types of parallelism. We present a programming model that integrates task and data parallelism using shared objects. The model is a generalization of shared objects in Orca. Orca is a task parallel language that uses shared objects for communication between processes and for storing shared (possibly replicated) data. Our new model also uses shared objects for partitioning of shared data and for distribution of work in a data parallel way. Data parallelism is introduced by executing operations on a partitioned object in parallel. The paper describes the design of the new model, its implementation, and its usage for parallel applications that use mixed task and data parallelism. 1 Introduction Most parallel programming systems are based either on data parallelism or on task parallelism. The advantage of data parallelism is that it is easy to use. The programmer merely specifi...
A Global Communication Optimization Technique Based on Data-Flow Analysis and Linear Algebra
, 1998
"... Reducing communication overhead is extremely important in distributed-memory message-passing architectures. In this paper, we present a technique to improve communication that considers data access patterns of the entire program. Our approach is based on a combination of traditional data-flow analys ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Reducing communication overhead is extremely important in distributed-memory message-passing architectures. In this paper, we present a technique to improve communication that considers data access patterns of the entire program. Our approach is based on a combination of traditional data-flow analysis and a linear algebra framework, and works on structured programs with conditional statements and nested loops but without arbitrary goto statements. The distinctive features of the solution are the accuracy in keeping communication set information, support for general alignments and distributions including block-cyclic distributions and the ability to simulate some of the previous approaches with suitable modifications. We also show how optimizations such as message vectorization, message coalescing and redundancy elimination are supported by our framework. Experimental results on several benchmarks show that our technique is effective in reducing the number of messages (an average of 32%...

