Results 1 -
9 of
9
The PARADIGM Compiler for Distributed-Memory Message Passing Multicomputers
- IEEE Computer
, 1994
"... The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. In addition to performing traditional compiler optimizations, PARADIGM is unique in that it addresses many other is ..."
Abstract
-
Cited by 98 (9 self)
- Add to MetaCart
The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. In addition to performing traditional compiler optimizations, PARADIGM is unique in that it addresses many other issues within a unified platform: automatic data distribution, synthesis of high-level communication, communication optimizations, irregular computations, functional and data parallelism, and multithreaded execution. This paper describes the techniques used and provides experimental evidence of their effectiveness. 1 Introduction Distributed-memory massively parallel multicomputers can provide the high levels of performance required to solve the Grand Challenge computational science problems [16]. Distributed-memory multicomputers such as the Intel iPSC/860, the Intel Paragon, the IBM SP-1 and the Thinking Machines CM-5 offer significant advantages over shared-memory multiprocessors in terms...
Automatic Generation of Efficient Array Redistribution Routines for Distributed Memory Multicomputers
, 1995
"... Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribut ..."
Abstract
-
Cited by 53 (4 self)
- Add to MetaCart
Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called PITFALLS and then discuss algorithms for redistribution based on this representation. One of the significant contributions of this work is being able to handle arbitrary source and target processor sets while performing redistribution. Another important contribution is the ability to handle an arbitrary number of dimensions for the array involved in the redistribution in a scalable manner. Our implementation of these techniques is based on an MPI-like communication library. The results presented show the low overheads for our redistribution algorithm as compared to naive runtime methods.
A Convex Programming Approach for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers
, 1994
"... Compilers have focussed on the exploitation of one of functional or data parallelism in the past. The PARADIGM compiler project at the University of Illinois is among the #rst to incorporate techniques for simultaneous exploitation of both. The work in this paper describes the techniques used in the ..."
Abstract
-
Cited by 29 (8 self)
- Add to MetaCart
Compilers have focussed on the exploitation of one of functional or data parallelism in the past. The PARADIGM compiler project at the University of Illinois is among the #rst to incorporate techniques for simultaneous exploitation of both. The work in this paper describes the techniques used in the PARADIGM compiler and analyzes the optimality of these techniques. It is the #rst of its kind to use realistic cost models and includes data transfer costs which all previous researchers have neglected. Preliminary results on the CM-5 show the e#cacy of our methods and the signi#cant advantages of using functional and data parallelism together for execution of real applications. 1. INTRODUCTION Distributed memory multicomputers such as the Intel Paragon, the IBM SP-1 and the Thinking Machines CM-5 o#er signi#cant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately,to extract all that computational power from these machines, users have to write e#...
Automatic Selection of Dynamic Data Partitioning Schemes for Distributed-Memory Multicomputers
- in Proceedings of the 8th Workshop on Languages and Compilers for Parallel Computing
, 1995
"... . For distributed-memory multicomputers such as the Intel Paragon, the IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of the data partitioning for a given application is crucial to obtaining high performance. This task has traditionally been the user's responsibility, but in ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
. For distributed-memory multicomputers such as the Intel Paragon, the IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of the data partitioning for a given application is crucial to obtaining high performance. This task has traditionally been the user's responsibility, but in recent years much effort has been directed to automating the selection of data partitioning schemes. Several researchers have proposed systems that are able to produce data distributions that remain in effect for the entire execution of an application. For complex programs, however, such static data distributions may be insufficient to obtain acceptable performance. The selection of distributions that dynamically change over the course of a program's execution adds another dimension to the data partitioning problem. In this paper, we present a technique that can be used to automatically determine which partitionings are most beneficial over specific sections of a program while taking into a...
Processor Tagged Descriptors: A Data Structure for Compiling for Distributed-Memory Multicomputers
- the Proceedings of the Parallel Architectures and Compiler Technology Conference
, 1994
"... The computation partitioning, communication analysis, and optimization phases performed during compilation for distributed-memory multicomputers require an efficient way of describing distributed sets of iterations and regions of data. Processor Tagged Descriptors (PTDs) provide these capabilities t ..."
Abstract
-
Cited by 14 (9 self)
- Add to MetaCart
The computation partitioning, communication analysis, and optimization phases performed during compilation for distributed-memory multicomputers require an efficient way of describing distributed sets of iterations and regions of data. Processor Tagged Descriptors (PTDs) provide these capabilities through a single set representation parameterized by the processor location for each dimension of a virtual mesh. A uniform representation is maintained for every processor in the mesh, whether it is a boundary or an interior node. As a result, operations on the sets are very efficient because the effect on all processors in a dimension can be captured in a single symbolic operation. In addition, PTDs are easily extended to an arbitrary number of dimensions, necessary for describing iteration sets in multiply nested loops as well as sections of multidimensional arrays. Using the symbolic features of PTDs it is also possible to generate code for variable numbers of processors, thereby allowi...
A Framework for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers
, 1994
"... Recent research efforts have shown the benefits of integrating functional and data parallelism over using either pure data parallelism or pure functional parallelism. The work in this paper presents a theoretical framework for deciding on a good execution strategy for a given program based on the av ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Recent research efforts have shown the benefits of integrating functional and data parallelism over using either pure data parallelism or pure functional parallelism. The work in this paper presents a theoretical framework for deciding on a good execution strategy for a given program based on the available functional and data parallelism in the program. The framework is based on assumptions about the form of computation and communication cost functions for multicomputer systems. We present mathematical functions for these costs and show that these functions are realistic. The framework also requires specification of the available functional and data parallelism for a given problem. For this purpose, we have developed a graphical programming tool. Currently, we have tested our approach using three benchmark programs on the Thinking Machines CM-5 and Intel Paragon. Results presented show that the approach is very effective and can provide a two- to three-fold increase in speedups over ap...
Run-Time Selection of Block Size in Pipelined Parallel Programs
- In Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
, 1999
"... Parallelizing compiler technology has improved in recent years. One area in which compilers have made progress is in handling DOACROSS loops, where crossprocessor data dependencies can inhibit efficient parallelization. In regular DOACROSS loops, where dependencies can be determined at compile time, ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Parallelizing compiler technology has improved in recent years. One area in which compilers have made progress is in handling DOACROSS loops, where crossprocessor data dependencies can inhibit efficient parallelization. In regular DOACROSS loops, where dependencies can be determined at compile time, a useful parallelization technique is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next processor in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult for a compiler to make, is critical to the efficiency of the program. Compilers typically use a static estimation of workload, which cannot always produce an effective block size. This paper describes a flexible run-time approach to choosing the block size. Our system takes measurements during the first iteration of the program and then uses the results to build an execution model and choose an appropriate bl...
Accurately Selecting Block Size At Run Time in Pipelined Parallel Programs
- International Journal of Parallel Programming
, 2000
"... Loops that contain cross-processor data dependencies, known as DOACROSS loops, are often found in scienti c programs. E ciently parallelizing such loops is importantyet nontrivial. One useful parallelization technique for DOACROSS loops is pipelining, where each processor (node) performs its computa ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Loops that contain cross-processor data dependencies, known as DOACROSS loops, are often found in scienti c programs. E ciently parallelizing such loops is importantyet nontrivial. One useful parallelization technique for DOACROSS loops is pipelining, where each processor (node) performs its computation in blocks � after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size � its choice, although di cult to make statically, is important for e cient execution. This paper describes a exible run-time approach tochoosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the rst two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system nds an e ective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our run-time analysis outperform those that use static block sizesbyasmuch as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is su ciently amortized.
An Overview of the PARADIGM Compiler for Distributed-Memory Multicomputers
, 1995
"... Distributed-memory multicomputers such as the the Intel Paragon, the IBM SP-2, and the Thinking Machines CM-5 offer significant advantages over shared-memory multiprocessors in terms of cost and scalability. Unfortunately, extracting all the computational power from these machines requires users to ..."
Abstract
- Add to MetaCart
Distributed-memory multicomputers such as the the Intel Paragon, the IBM SP-2, and the Thinking Machines CM-5 offer significant advantages over shared-memory multiprocessors in terms of cost and scalability. Unfortunately, extracting all the computational power from these machines requires users to write efficient software for them, which is a laborious process. The PARADIGM compiler project provides an automated means to parallelize programs, written using a serial programming model, for efficient execution on distributed-memory multicomputers. In addition to performing traditional compiler optimizations, PARADIGM is unique in that it addresses many other issues within a unified platform: automatic data distribution, communication optimizations, support for irregular computations, exploitation of functional and data parallelism, and multithreaded execution. This paper describes the techniques used and provides experimental evidence of their effectiveness on the Intel Paragon, the IBM ...

