Results 1 - 10
of
19
Task Parallelism in a High Performance Fortran Framework
- IEEE Parallel and Distributed Technology
, 1994
"... High Performance Fortran (HPF) has emerged as a standard dialect of Fortran for data parallel computing. However, for a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. We present the design and implementation ..."
Abstract
-
Cited by 83 (18 self)
- Add to MetaCart
High Performance Fortran (HPF) has emerged as a standard dialect of Fortran for data parallel computing. However, for a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. We present the design and implementation of a Fortran compiler that integrates task and data parallelism in an HPF framework. A small set of simple directives allow users to express task parallel programs in a variety of domains. The user identifies opportunities for task parallelism, and the compiler handles task creation and management, as well as communication between tasks. Since a unified compiler handles both task parallelism and data parallelism, existing data parallel programs and libraries can serve as the building blocks for constructing larger task parallel programs. This paper concludes with a description of several parallel application kernels that were developed with the compiler. The examples demonstrate that exploi...
Optimal Latency-Throughput Tradeoffs for Data Parallel Pipelines
- In Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (Padua
, 1996
"... This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also ref ..."
Abstract
-
Cited by 55 (7 self)
- Add to MetaCart
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains including digital signal processing, image processing, and computer vision. The parameters of the performance of stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which the data sets are processed). These two criterion are distinct since multiple data sets can be pipelined or processed in parallel. We present a new algorithm to determine a processor mapping of a chain of tasks that optimizes the latency in the presence of throughput constraints, and discuss optimization of the throughput with latency constraints. The problem formulation uses a general ...
A Compilation System That Integrates High Performance Fortran and Fortran M
- In Proceeding of 1994 Scalable High Performance Computing Conference (Knoxville, TN
, 1994
"... Task parallelism and data parallelism are often seen as mutually exclusive approaches to parallel programming. Yet there are important classes of application, for example in multidisciplinary simulation and command and control, that would benefit from an integration of the two approaches. In this pa ..."
Abstract
-
Cited by 55 (12 self)
- Add to MetaCart
Task parallelism and data parallelism are often seen as mutually exclusive approaches to parallel programming. Yet there are important classes of application, for example in multidisciplinary simulation and command and control, that would benefit from an integration of the two approaches. In this paper, we describe a programming system that we are developing to explore this sort of integration. This system builds on previous work on task-parallel and data-parallel Fortran compilers to provide an environment in which the task-parallel language Fortran M can be used to coordinate data-parallel High Performance Fortran tasks. We use an image-processing problem to illustrate the issues that arise when building an integrated compilation system of this sort. 1 Introduction In data-parallel programming, programs apply a sequence of operations identically to all or most elements of a large data structure; in task-parallel programming, programs consist of a set of (potentially dissimilar) para...
Optimal Mapping of Sequences of Data Parallel Tasks
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... Many applications in a variety of domains including digital signal processing, image processing, and computer vision are composed of a sequence of tasks that act on a stream of input data sets in a pipelined manner. Recent research has established that these applications are best mapped to a massive ..."
Abstract
-
Cited by 47 (7 self)
- Add to MetaCart
Many applications in a variety of domains including digital signal processing, image processing, and computer vision are composed of a sequence of tasks that act on a stream of input data sets in a pipelined manner. Recent research has established that these applications are best mapped to a massively parallel machine by dividing the tasks into modulesand assigninga subset of the available processors to each module. This paper addresses the problem of optimally mapping such applications onto a massively parallel machine. We formulate the problem of optimizing throughput in task pipelines and present two new solution algorithms. The formulation uses a general and realistic model for inter-task communication, takes memory constraints into account, and addresses the entire problem of mapping which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The first algorithm is based on dynamic programming and finds the optimal mappin...
Communication and Memory Requirements as the Basis for Mapping Task and Data Parallel Programs
, 1994
"... For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploiting task and data parallelism in a single compilerframework, and such a compiler can map a single ..."
Abstract
-
Cited by 30 (7 self)
- Add to MetaCart
For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploiting task and data parallelism in a single compilerframework, and such a compiler can map a single source program in many different ways onto a parallel machine. The tradeoffs between task and data parallelism are complex and depend on the characteristics of the program to be executed, most significantly the memory and communication requirements, and the performance parameters of the target parallel machine. In this paper, we present a framework to isolate and examine the specific characteristics of programs that determine the performance for different mappings. Our focus is on applications that process a stream of input, and whose computation structure is fairly static and predictable. We describe three such applications that were developed with our compiler: fast Fourier transforms, nar...
Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 2000
"... This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to a ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains, including digital signal processing, image processing, and computer vision. The parameters of the performance for such stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which data sets are processed). These two criteria are distinct since multiple data sets can be pipelined or processed in parallel. The central contribution of this research is a new algorithm to determine a processor mapping for a chain of tasks that optimizes latency in the presence of a throughput constraint. We also discuss how this algorithm can be applied to solve the converse problem of o...
TOP-C: A Task-Oriented Parallel C Interface
- IN 5 TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING (HPDC-5
, 1996
"... The goal of this work is to simplify parallel applications development, and thus ease the learning barriers faced by non-experts. It is especially useful where there is little data-parallelism to be recognized by a compiler. The applications programmer need learn the intricacies of only one primary ..."
Abstract
-
Cited by 20 (12 self)
- Add to MetaCart
The goal of this work is to simplify parallel applications development, and thus ease the learning barriers faced by non-experts. It is especially useful where there is little data-parallelism to be recognized by a compiler. The applications programmer need learn the intricacies of only one primary subroutine in order to get the full benefits of the parallel interface. The applications programmer defines a high level concept, the task, that depends only on his application, and not on any particular parallel library. The task is defined by its three phases: (a) the task input, (b) sequential code to execute the task, and (c) any modifications of global variables that occur as a result of the task. In particular, side effects (which change global variable values) must not occur in phase (b). Forcing the user to re-organize his computation in these terms allows us to present the applications programmer with a single global environment visible to all processors (whether on a SMP or a NOW architecture), in the context of a master-slave architecture. Both a
Integrating Task and Data Parallelism with the Collective Communication Archetype
, 1994
"... A parallel program archetype aids in the development of reliable, efficient parallel applications with common computation/communication structures by providing stepwise refinement methods and code libraries specific to the structure. The methods and libraries help in transforming a sequential progra ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
A parallel program archetype aids in the development of reliable, efficient parallel applications with common computation/communication structures by providing stepwise refinement methods and code libraries specific to the structure. The methods and libraries help in transforming a sequential program into a parallel program via a sequence of refinement steps that help maintain correctness while refining the program to obtain the appropriate level of granularity for a target machine. The specific archetype discussed here deals with the integration of task and data parallelism by using collective (or group) communication. This archetype has been used to develop several applications. 1 Introduction Archetypes. Many parallel applications share common features in design, testing, debugging, performance tuning, and program structuring. A parallel program archetype is an abstraction that embodies common features shared by parallel applications within a domain. An archetype aids the develop...
Practical Task-Oriented Parallelism for Gaussian Elimination in Distributed Memory
- in Distributed Memory", Linear Algebra and its Applications
, 1998
"... This paper discusses a methodology for easily and efficiently parallelizing sequential algorithms in linear algebra using cost-effective networks of workstations, where the algorithm lends itself to parallelism. A particular target architecture of interest is the academic student laboratory, which t ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
This paper discusses a methodology for easily and efficiently parallelizing sequential algorithms in linear algebra using cost-effective networks of workstations, where the algorithm lends itself to parallelism. A particular target architecture of interest is the academic student laboratory, which typically contains many networked computers that lay idle at night. A case is made for why a task-oriented approach lends itself to the twin goals of programming ease and run-time efficiency. The approach is then described in the context of TOP-C (Task-Oriented Parallel C), an example of a system to support task-oriented parallelism. In this system, the programmer is relieved of lower level concerns such as latency, bandwidth, and message passing protocols, so as to better concentrate on higher level issues of task granularity and reduction of communication traffic. Gaussian elimination is chosen as the main example, since this algorithm is both widely used and sufficiently interesting to req...

