Results 1  10
of
53
Optimal LatencyThroughput Tradeoffs for Data Parallel Pipelines
 In Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (Padua
, 1996
"... This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also ref ..."
Abstract

Cited by 58 (7 self)
 Add to MetaCart
(Show Context)
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains including digital signal processing, image processing, and computer vision. The parameters of the performance of stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which the data sets are processed). These two criterion are distinct since multiple data sets can be pipelined or processed in parallel. We present a new algorithm to determine a processor mapping of a chain of tasks that optimizes the latency in the presence of throughput constraints, and discuss optimization of the throughput with latency constraints. The problem formulation uses a general ...
A Framework for Exploiting Task and DataParallelism on Distributed Memory Multicomputers
 IEEE Transactions on Parallel and Distributed Systems
, 1997
"... offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler a ..."
Abstract

Cited by 34 (0 self)
 Add to MetaCart
(Show Context)
offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and runtime support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications–the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster. A practical implementation of a task and data parallel scheme of execution for an application on a distributed memory multicomputer also involves data redistribution. This data redistribution causes an overhead. However, as our experimental results show, this overhead is not a problem; execution of a program using task and data parallelism together can be significantly faster than its execution using data parallelism alone. This makes our proposed optimization practical and extremely useful.
A New Model for Integrated Nested Task and Data Parallel Programming
, 1997
"... High Performance Fortran (HPF) has emerged as a standard language for data parallel computing. However, a wide variety of scientific applications are best programmed by a combination of task and data parallelism. Therefore, a good model of task parallelism is important for continued success of HPF f ..."
Abstract

Cited by 30 (8 self)
 Add to MetaCart
(Show Context)
High Performance Fortran (HPF) has emerged as a standard language for data parallel computing. However, a wide variety of scientific applications are best programmed by a combination of task and data parallelism. Therefore, a good model of task parallelism is important for continued success of HPF for parallel programming. This paper presents a task parallelism model that is simple, elegant, and relatively easy to implement in an HPF environment. Task parallelism is exploited by mechanisms for dividing processors into subgroups and mapping computations and data onto processor subgroups. This model of task parallelism has been implemented in the Fx compiler at Carnegie Mellon University. The paper addresses the main issues in compiling integrated task and data parallel programs and reports on the use of this model for programming various flat and nested task structures. Performance results are presented for a set of programs spanning signal processing, image processing, computer vision ...
Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 2000
"... This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to a ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains, including digital signal processing, image processing, and computer vision. The parameters of the performance for such stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which data sets are processed). These two criteria are distinct since multiple data sets can be pipelined or processed in parallel. The central contribution of this research is a new algorithm to determine a processor mapping for a chain of tasks that optimizes latency in the presence of a throughput constraint. We also discuss how this algorithm can be applied to solve the converse problem of o...
Minimizing Execution Time in MPI Programs on an EnergyConstrained, PowerScalable Cluster
, 2006
"... Recently, the highperformance computing community has realized that power is a performancelimiting factor. One reason for this is that supercomputing centers have limited power capacity and machines are starting to hit that limit. In addition, the cost of energy has become increasingly significant ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
Recently, the highperformance computing community has realized that power is a performancelimiting factor. One reason for this is that supercomputing centers have limited power capacity and machines are starting to hit that limit. In addition, the cost of energy has become increasingly significant, and the heat produced by higherenergy components tends to reduce their reliability. One way to reduce power (and therefore energy) requirements is to use highperformance cluster nodes that are frequency and voltagescalable (e.g., AMD64 processors). The problem we address in this paper is: given a target program, a powerscalable cluster, and an upper limit for energy consumption, choose a schedule (number of nodes and CPU frequency) that simultaneously (1) satisfies an external upper limit for energy consumption and (2) minimizes execution time. There are too many schedules for an exhaustive search. Therefore, we find a schedule through a novel combination of performance modeling, performance prediction, and program execution. Using our technique, we are able to find a nearoptimal schedule for all of our benchmarks in just a handful of partial program executions.
Complexity Results for Throughput and Latency Optimization of Replicated and Dataparallel Workflow
 ALGORITHMICA
, 2007
"... Mapping applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline or fork graphs. Several antagonist criteria should be optimized for workflow applications, such as throughput and latency (or a combination). In this paper, we consider a si ..."
Abstract

Cited by 14 (11 self)
 Add to MetaCart
Mapping applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline or fork graphs. Several antagonist criteria should be optimized for workflow applications, such as throughput and latency (or a combination). In this paper, we consider a simplified model with no communication cost, and we provide an exhaustive list of complexity results for different problem instances. Pipeline or fork stages can be replicated in order to increase the throughput by sending consecutive data sets onto different processors. In some cases, stages can also be dataparallelized, i.e. the computation of one single data set is shared between several processors. This leads to a decrease of the latency and an increase of the throughput. Some instances of this simple model are shown to be NPhard, thereby exposing the inherent complexity of the mapping problem. We provide polynomial algorithms for other problem instances. Altogether, we provide solid theoretical foundations for the study of monocriterion or bicriteria mapping optimization problems.
A Programming Environment for Packetprocessing Systems: Design Considerations
 In Workshop on Network Processors & Applications  NP3
, 2004
"... In this paper, we describe the vision and the design of a programming environment, called ShangriLa, aimed at making future generations of packetprocessing systems – multicore, lightweight threaded hardware in general, and network processor (NP)based systems in particular – as easily programmab ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
(Show Context)
In this paper, we describe the vision and the design of a programming environment, called ShangriLa, aimed at making future generations of packetprocessing systems – multicore, lightweight threaded hardware in general, and network processor (NP)based systems in particular – as easily programmable as today’s workstations and servers. Our environment consists of: (1) a domainspecific programming language for specifying packetprocessing applications, (2) a compiler that incorporates profileguided techniques for mapping packetprocessing applications onto complex packetprocessing system architectures, and (3) a runtime system that dynamically adapts resource allocations to create systems that are robust against attacks and that optimize performance and power consumption for the current network conditions. We justify our design and articulate the challenges in designing each of these components. 1.
A Mapping Methodology for Designing Software Task Pipelines for Embedded Signal Processing
 In the 3rd International Workshop on Embedded HPC Systems and Applications (EHPC’98) at the 12th International Parallel Processing Sysposium (IPPS’98
, 1998
"... . In this paper, we present a methodology for mapping an Embedded Signal Processing (ESP) application onto HPC platforms such that the throughput performance is maximized. Previous approaches used a linear pipelined execution model which restrict the mapping choices. We show that the "optim ..."
Abstract

Cited by 9 (7 self)
 Add to MetaCart
(Show Context)
. In this paper, we present a methodology for mapping an Embedded Signal Processing (ESP) application onto HPC platforms such that the throughput performance is maximized. Previous approaches used a linear pipelined execution model which restrict the mapping choices. We show that the "optimal" solution obtained under that model can be improved, using the proposed execution model. Based on the new model, a threestep task mapping methodology is developed. The methodology is demonstrated by designing Software Task Pipelines for modern radar and sonar signal processing applications. Experimental results show improved performance using our approach over those obtained by previous approaches. 1 Introduction In this paper, we address the problem of maximizing the throughput of an ESP application on a given number of processors of a High Performance Computing (HPC) platform. ESP applications are typically composed of a sequence of computation stages with varying computational comp...
Multicriteria scheduling of pipeline workflows
 IN HETEROPAR’07, THE 6TH INTERNATIONAL WORKSHOP ON ALGORITHMS, MODELS AND TOOLS FOR PARALLEL COMPUTING ON HETEROGENEOUS NETWORKS
, 2007
"... ..."
(Show Context)
Mapping Linear Workflows with Computation/Communication Overlap
"... This paper presents theoretical results related to mapping and scheduling linear workflows onto heterogeneous platforms. We use a realistic architectural model with bounded communication capabilities and full computation/communication overlap. This model is representative of current multithreaded s ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
This paper presents theoretical results related to mapping and scheduling linear workflows onto heterogeneous platforms. We use a realistic architectural model with bounded communication capabilities and full computation/communication overlap. This model is representative of current multithreaded systems. In these workflow applications, the goal is often to maximize throughput or to minimize latency. We present several complexity results related to both these criteria. To be precise, we prove that maximizing the throughput is NPcomplete even for homogeneous platforms and minimizing the latency is NPcomplete for heterogeneous platforms. Moreover, we present an approximation algorithm for throughput maximization for linear chain applications on homogeneous platforms, and an approximation algorithm for latency minimization for linear chain applications on all platforms where communication is homogeneous (the processor speeds can differ). In addition, we present algorithms for several important special cases for linear chain applications. Finally, we consider the implications of adding feedback loops to linear chain applications.