Results 1 - 10
of
25
The PARADIGM Compiler for Distributed-Memory Message Passing Multicomputers
- IEEE Computer
, 1994
"... The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. In addition to performing traditional compiler optimizations, PARADIGM is unique in that it addresses many other is ..."
Abstract
-
Cited by 98 (9 self)
- Add to MetaCart
The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. In addition to performing traditional compiler optimizations, PARADIGM is unique in that it addresses many other issues within a unified platform: automatic data distribution, synthesis of high-level communication, communication optimizations, irregular computations, functional and data parallelism, and multithreaded execution. This paper describes the techniques used and provides experimental evidence of their effectiveness. 1 Introduction Distributed-memory massively parallel multicomputers can provide the high levels of performance required to solve the Grand Challenge computational science problems [16]. Distributed-memory multicomputers such as the Intel iPSC/860, the Intel Paragon, the IBM SP-1 and the Thinking Machines CM-5 offer significant advantages over shared-memory multiprocessors in terms...
Optimal Latency-Throughput Tradeoffs for Data Parallel Pipelines
- In Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (Padua
, 1996
"... This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also ref ..."
Abstract
-
Cited by 55 (7 self)
- Add to MetaCart
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains including digital signal processing, image processing, and computer vision. The parameters of the performance of stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which the data sets are processed). These two criterion are distinct since multiple data sets can be pipelined or processed in parallel. We present a new algorithm to determine a processor mapping of a chain of tasks that optimizes the latency in the presence of throughput constraints, and discuss optimization of the throughput with latency constraints. The problem formulation uses a general ...
Automatic Generation of Efficient Array Redistribution Routines for Distributed Memory Multicomputers
, 1995
"... Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribut ..."
Abstract
-
Cited by 53 (4 self)
- Add to MetaCart
Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called PITFALLS and then discuss algorithms for redistribution based on this representation. One of the significant contributions of this work is being able to handle arbitrary source and target processor sets while performing redistribution. Another important contribution is the ability to handle an arbitrary number of dimensions for the array involved in the redistribution in a scalable manner. Our implementation of these techniques is based on an MPI-like communication library. The results presented show the low overheads for our redistribution algorithm as compared to naive runtime methods.
The Internet Backplane Protocol: Storage in the Network
, 1999
"... For distributed and network applications, efficient management of program state is critical to performance and functionality. To support domain- and application-specific optimization of data movement, we have developed the Internet Backplane Protocol (IBP) for controlling storage that is implemented ..."
Abstract
-
Cited by 47 (9 self)
- Add to MetaCart
For distributed and network applications, efficient management of program state is critical to performance and functionality. To support domain- and application-specific optimization of data movement, we have developed the Internet Backplane Protocol (IBP) for controlling storage that is implemented as part the network fabric itself. IBP allows an application to control intermediate data staging operations explicitly as data is communicated between processes. As such, the application can exploit locality and manage scarce buffer resources effectively. In this paper, we discuss the development of IBP, the implementation of a prototype system for managing network storage, and a preliminary deployment as part of the Internet-2 Distributed Storage Initiative. 1 Introduction The proliferation of applications that are performance limited by network speeds leads us to explore new ways to exploit data locality in distributed settings. Currently, standard networking protocols (such as TCP/IP)...
Optimal Mapping of Sequences of Data Parallel Tasks
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... Many applications in a variety of domains including digital signal processing, image processing, and computer vision are composed of a sequence of tasks that act on a stream of input data sets in a pipelined manner. Recent research has established that these applications are best mapped to a massive ..."
Abstract
-
Cited by 47 (7 self)
- Add to MetaCart
Many applications in a variety of domains including digital signal processing, image processing, and computer vision are composed of a sequence of tasks that act on a stream of input data sets in a pipelined manner. Recent research has established that these applications are best mapped to a massively parallel machine by dividing the tasks into modulesand assigninga subset of the available processors to each module. This paper addresses the problem of optimally mapping such applications onto a massively parallel machine. We formulate the problem of optimizing throughput in task pipelines and present two new solution algorithms. The formulation uses a general and realistic model for inter-task communication, takes memory constraints into account, and addresses the entire problem of mapping which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The first algorithm is based on dynamic programming and finds the optimal mappin...
A Framework for Exploiting Task- and Data-Parallelism on Distributed Memory Multicomputers
- IEEE Transactions on Parallel and Distributed Systems
, 1997
"... offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler a ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications–the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster. A practical implementation of a task and data parallel scheme of execution for an application on a distributed memory multicomputer also involves data redistribution. This data redistribution causes an overhead. However, as our experimental results show, this overhead is not a problem; execution of a program using task and data parallelism together can be significantly faster than its execution using data parallelism alone. This makes our proposed optimization practical and extremely useful.
A New Model for Integrated Nested Task and Data Parallel Programming
, 1997
"... High Performance Fortran (HPF) has emerged as a standard language for data parallel computing. However, a wide variety of scientific applications are best programmed by a combination of task and data parallelism. Therefore, a good model of task parallelism is important for continued success of HPF f ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
High Performance Fortran (HPF) has emerged as a standard language for data parallel computing. However, a wide variety of scientific applications are best programmed by a combination of task and data parallelism. Therefore, a good model of task parallelism is important for continued success of HPF for parallel programming. This paper presents a task parallelism model that is simple, elegant, and relatively easy to implement in an HPF environment. Task parallelism is exploited by mechanisms for dividing processors into subgroups and mapping computations and data onto processor subgroups. This model of task parallelism has been implemented in the Fx compiler at Carnegie Mellon University. The paper addresses the main issues in compiling integrated task and data parallel programs and reports on the use of this model for programming various flat and nested task structures. Performance results are presented for a set of programs spanning signal processing, image processing, computer vision ...
Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 2000
"... This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to a ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains, including digital signal processing, image processing, and computer vision. The parameters of the performance for such stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which data sets are processed). These two criteria are distinct since multiple data sets can be pipelined or processed in parallel. The central contribution of this research is a new algorithm to determine a processor mapping for a chain of tasks that optimizes latency in the presence of a throughput constraint. We also discuss how this algorithm can be applied to solve the converse problem of o...
Processor Tagged Descriptors: A Data Structure for Compiling for Distributed-Memory Multicomputers
- the Proceedings of the Parallel Architectures and Compiler Technology Conference
, 1994
"... The computation partitioning, communication analysis, and optimization phases performed during compilation for distributed-memory multicomputers require an efficient way of describing distributed sets of iterations and regions of data. Processor Tagged Descriptors (PTDs) provide these capabilities t ..."
Abstract
-
Cited by 14 (9 self)
- Add to MetaCart
The computation partitioning, communication analysis, and optimization phases performed during compilation for distributed-memory multicomputers require an efficient way of describing distributed sets of iterations and regions of data. Processor Tagged Descriptors (PTDs) provide these capabilities through a single set representation parameterized by the processor location for each dimension of a virtual mesh. A uniform representation is maintained for every processor in the mesh, whether it is a boundary or an interior node. As a result, operations on the sets are very efficient because the effect on all processors in a dimension can be captured in a single symbolic operation. In addition, PTDs are easily extended to an arbitrary number of dimensions, necessary for describing iteration sets in multiply nested loops as well as sections of multidimensional arrays. Using the symbolic features of PTDs it is also possible to generate code for variable numbers of processors, thereby allowi...
A Framework for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers
, 1994
"... Recent research efforts have shown the benefits of integrating functional and data parallelism over using either pure data parallelism or pure functional parallelism. The work in this paper presents a theoretical framework for deciding on a good execution strategy for a given program based on the av ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Recent research efforts have shown the benefits of integrating functional and data parallelism over using either pure data parallelism or pure functional parallelism. The work in this paper presents a theoretical framework for deciding on a good execution strategy for a given program based on the available functional and data parallelism in the program. The framework is based on assumptions about the form of computation and communication cost functions for multicomputer systems. We present mathematical functions for these costs and show that these functions are realistic. The framework also requires specification of the available functional and data parallelism for a given problem. For this purpose, we have developed a graphical programming tool. Currently, we have tested our approach using three benchmark programs on the Thinking Machines CM-5 and Intel Paragon. Results presented show that the approach is very effective and can provide a two- to three-fold increase in speedups over ap...

