Results 1 - 10
of
70
The PARADIGM Compiler for Distributed-Memory Message Passing Multicomputers
- IEEE Computer
, 1994
"... The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. In addition to performing traditional compiler optimizations, PARADIGM is unique in that it addresses many other is ..."
Abstract
-
Cited by 98 (9 self)
- Add to MetaCart
The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. In addition to performing traditional compiler optimizations, PARADIGM is unique in that it addresses many other issues within a unified platform: automatic data distribution, synthesis of high-level communication, communication optimizations, irregular computations, functional and data parallelism, and multithreaded execution. This paper describes the techniques used and provides experimental evidence of their effectiveness. 1 Introduction Distributed-memory massively parallel multicomputers can provide the high levels of performance required to solve the Grand Challenge computational science problems [16]. Distributed-memory multicomputers such as the Intel iPSC/860, the Intel Paragon, the IBM SP-1 and the Thinking Machines CM-5 offer significant advantages over shared-memory multiprocessors in terms...
Task Parallelism in a High Performance Fortran Framework
- IEEE Parallel and Distributed Technology
, 1994
"... High Performance Fortran (HPF) has emerged as a standard dialect of Fortran for data parallel computing. However, for a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. We present the design and implementation ..."
Abstract
-
Cited by 83 (18 self)
- Add to MetaCart
High Performance Fortran (HPF) has emerged as a standard dialect of Fortran for data parallel computing. However, for a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. We present the design and implementation of a Fortran compiler that integrates task and data parallelism in an HPF framework. A small set of simple directives allow users to express task parallel programs in a variety of domains. The user identifies opportunities for task parallelism, and the compiler handles task creation and management, as well as communication between tasks. Since a unified compiler handles both task parallelism and data parallelism, existing data parallel programs and libraries can serve as the building blocks for constructing larger task parallel programs. This paper concludes with a description of several parallel application kernels that were developed with the compiler. The examples demonstrate that exploi...
An Implementation Of The Hamlyn Sender-Managed Interface Architecture
- In Proc. of the 2nd Symp. on Operating Systems Design and Implementation
, 1996
"... Introduction Processors are rapidly getting faster, and message-passing multicomputer interconnections are doing the same, thanks to recent developments in Gb/s links and lowlatency packet switches. But the cost of passing messages between applications also includes the overhead of crossing interfa ..."
Abstract
-
Cited by 72 (1 self)
- Add to MetaCart
Introduction Processors are rapidly getting faster, and message-passing multicomputer interconnections are doing the same, thanks to recent developments in Gb/s links and lowlatency packet switches. But the cost of passing messages between applications also includes the overhead of crossing interfaces between the operating system (OS), a device driver, and the hardware, which can be orders of magnitude more than the cost of moving a message's bits across the wires. Hamlyn is an architecture for processor-interconnection interfaces that addresses this difficulty. It achieves both low latency and high bandwidth, isolates applications from each other's mistakes, and supplies a rich set of message-delivery semantics. It does so by exploiting several techniques: . Sender-based memory management. Senders, not receivers, choose the destination memory address at which messages are deposited. This means that messages are sent only when the sender knows th
Generating Communication for Array Statements: Design, Implementation, and Evaluation
- Design, Implementation, and Evaluation,” J. Parallel and Distributed Computing
, 1994
"... Array statements as included in Fortran 90 or High Performance Fortran (HPF) are a wellaccepted way to specify data parallelism in programs. When generating code for such a data parallel program for a private memory parallel system, the compiler must determine when array elements must be moved from ..."
Abstract
-
Cited by 65 (12 self)
- Add to MetaCart
Array statements as included in Fortran 90 or High Performance Fortran (HPF) are a wellaccepted way to specify data parallelism in programs. When generating code for such a data parallel program for a private memory parallel system, the compiler must determine when array elements must be moved from one processor to another. This paper describes a practical method to compute the set of array elements that are to be moved; it covers all the distributions that are included in HPF: block, cyclic, and block-cyclic. This method is the foundation for an efficient protocol for modern private memory parallel systems: for each block of data to be sent, the sender processor computes the local address in the receiver's address space, and the address is then transmitted together with the data. This strategy increases the communication load but reduces the overhead on the receiving processor. We implemented this optimization in an experimental Fortran compiler, and this paper reports an empirical ev...
Parallel Performance Prediction using Lost Cycles Analysis
- IN PROCEEDINGS OF SUPERCOMPUTING '94
, 1994
"... Most performance debugging and tuning of parallel programs is based on the "measure-modify" approach, which is heavily dependent on detailed measurements of programs during execution. This approach is extremely time-consuming and does not lend itself to predicting performance under varying condition ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
Most performance debugging and tuning of parallel programs is based on the "measure-modify" approach, which is heavily dependent on detailed measurements of programs during execution. This approach is extremely time-consuming and does not lend itself to predicting performance under varying conditions. Analytic modeling and scalability analysis provide predictive power, but are not widely used inpractice, due primarily to their emphasis on asymptotic behavior and the difficulty of developing accurate models that work for real-world programs. In this paper we describe a set of tools for performance tuning of parallel programs that bridges this gap between measurement and modeling. Our approach is based on lost cycles analysis, which involves measurement and modeling of all sources of overhead in a parallel program. We first describe a tool for measuring overheads in parallel programs that we have incorporated into the runtime environment for Fortran programs on the Kendall Square KSR1. We then describe a tool that ts these overhead measurements to analytic forms. We illustrate the use of these tools by analyzing the performance tradeoffs among parallel implementations of 2D FFT. These examples show how our tools enable programmers to develop accurate performance models of parallel applications without requiring extensive performance modeling expertise.
Optimal Latency-Throughput Tradeoffs for Data Parallel Pipelines
- In Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (Padua
, 1996
"... This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also ref ..."
Abstract
-
Cited by 55 (7 self)
- Add to MetaCart
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains including digital signal processing, image processing, and computer vision. The parameters of the performance of stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which the data sets are processed). These two criterion are distinct since multiple data sets can be pipelined or processed in parallel. We present a new algorithm to determine a processor mapping of a chain of tasks that optimizes the latency in the presence of throughput constraints, and discuss optimization of the throughput with latency constraints. The problem formulation uses a general ...
A Compilation System That Integrates High Performance Fortran and Fortran M
- In Proceeding of 1994 Scalable High Performance Computing Conference (Knoxville, TN
, 1994
"... Task parallelism and data parallelism are often seen as mutually exclusive approaches to parallel programming. Yet there are important classes of application, for example in multidisciplinary simulation and command and control, that would benefit from an integration of the two approaches. In this pa ..."
Abstract
-
Cited by 55 (12 self)
- Add to MetaCart
Task parallelism and data parallelism are often seen as mutually exclusive approaches to parallel programming. Yet there are important classes of application, for example in multidisciplinary simulation and command and control, that would benefit from an integration of the two approaches. In this paper, we describe a programming system that we are developing to explore this sort of integration. This system builds on previous work on task-parallel and data-parallel Fortran compilers to provide an environment in which the task-parallel language Fortran M can be used to coordinate data-parallel High Performance Fortran tasks. We use an image-processing problem to illustrate the issues that arise when building an integrated compilation system of this sort. 1 Introduction In data-parallel programming, programs apply a sequence of operations identically to all or most elements of a large data structure; in task-parallel programming, programs consist of a set of (potentially dissimilar) para...
Automatic Generation of Efficient Array Redistribution Routines for Distributed Memory Multicomputers
, 1995
"... Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribut ..."
Abstract
-
Cited by 53 (4 self)
- Add to MetaCart
Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called PITFALLS and then discuss algorithms for redistribution based on this representation. One of the significant contributions of this work is being able to handle arbitrary source and target processor sets while performing redistribution. Another important contribution is the ability to handle an arbitrary number of dimensions for the array involved in the redistribution in a scalable manner. Our implementation of these techniques is based on an MPI-like communication library. The results presented show the low overheads for our redistribution algorithm as compared to naive runtime methods.
A Software Architecture for Multidisciplinary Applications: Integrating Task and Data Parallelism
, 1994
"... Data parallel languages such as Vienna Fortran and HPF can be successfully applied to a wide range of numerical applications. However, many advanced scientific and engineering applications are of a multidisciplinary and heterogeneous nature and thus do not fit well into the data parallel paradigm. ..."
Abstract
-
Cited by 52 (8 self)
- Add to MetaCart
Data parallel languages such as Vienna Fortran and HPF can be successfully applied to a wide range of numerical applications. However, many advanced scientific and engineering applications are of a multidisciplinary and heterogeneous nature and thus do not fit well into the data parallel paradigm. In this paper we present new Fortran 90 language extensions to fill this gap. Tasks can be spawned as asynchronous activities in a homogeneous or heterogeneous computing environment; they interact by sharing access to Shared Data Abstractions (SDAs). SDAs are an extension of Fortran 90 modules, representing a pool of common data, together with a set of methods for controlled access to these data and a mechanism for providing persistent storage. Our language supports the integration of data and task parallelism as well as nested task parallelism and thus can be used to express multidisciplinary applications in a natural and efficient way.

