Results 1 - 10
of
25
Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflow
- ALGORITHMICA
, 2007
"... Mapping applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline or fork graphs. Several antagonist criteria should be optimized for workflow applications, such as throughput and latency (or a combination). In this paper, we consider a si ..."
Abstract
-
Cited by 15 (12 self)
- Add to MetaCart
Mapping applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline or fork graphs. Several antagonist criteria should be optimized for workflow applications, such as throughput and latency (or a combination). In this paper, we consider a simplified model with no communication cost, and we provide an exhaustive list of complexity results for different problem instances. Pipeline or fork stages can be replicated in order to increase the throughput by sending consecutive data sets onto different processors. In some cases, stages can also be data-parallelized, i.e. the computation of one single data set is shared between several processors. This leads to a decrease of the latency and an increase of the throughput. Some instances of this simple model are shown to be NP-hard, thereby exposing the inherent complexity of the mapping problem. We provide polynomial algorithms for other problem instances. Altogether, we provide solid theoretical foundations for the study of mono-criterion or bi-criteria mapping optimization problems.
Broadcast trees for heterogeneous platforms
- 19th International Parallel and Distributed Processing Symposium (IPDPS’05
, 2005
"... Laboratoire de l'Informatique du Paralle'lisme E'cole Normale Supe'rieure de LyonUnite ' Mixte de Recherche CNRS-INRIA-ENS LYON-UCBL no 5668 ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Laboratoire de l'Informatique du Paralle'lisme E'cole Normale Supe'rieure de LyonUnite ' Mixte de Recherche CNRS-INRIA-ENS LYON-UCBL no 5668
Optimizing Latency and Reliability of Pipeline Workflow Applications
, 2008
"... Mapping applications onto heterogeneous platforms is a difficult challenge, even for simple application patterns such as pipeline graphs. The problem is even more complex when processors are subject to failure during the execution of the application. In this paper, we study the complexity of a bi-cr ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
Mapping applications onto heterogeneous platforms is a difficult challenge, even for simple application patterns such as pipeline graphs. The problem is even more complex when processors are subject to failure during the execution of the application. In this paper, we study the complexity of a bi-criteria mapping which aims at optimizing the latency (i.e., the response time) and the reliability (i.e., the probability that the computation will be successful) of the application. Latency is minimized by using faster processors, while reliability is increased by replicating computations on a set of processors. However, replication increases latency (additional communications, slower processors). The application fails to be executed only if all the processors fail during execution. While simple polynomial algorithms can be found for fully homogeneous platforms, the problem becomes NP-hard when tackling heterogeneous platforms. This is yet another illustration of the additional complexity added by heterogeneity.
Multi-criteria scheduling of pipeline workflows
- In HeteroPar’07, the 6th International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks
, 2007
"... apport de recherche ISSN 0249-6399 ISRN INRIA/RR--6232--FR+ENGMulti-criteria scheduling of pipeline workflows ..."
Abstract
-
Cited by 7 (7 self)
- Add to MetaCart
apport de recherche ISSN 0249-6399 ISRN INRIA/RR--6232--FR+ENGMulti-criteria scheduling of pipeline workflows
MatrixProduct on Heterogeneous Master-Worker Platforms
"... This paper is focused on designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses th ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
This paper is focused on designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative:- Centralized data. We assume that all matrix files originate from, and must be returned to, the master. The master distributes data and computations to the workers (while in ScaLAPACK, input and output matrices are supposed to be equally distributed among participating resources beforehand). Typically, our approach is useful in the context of speeding up MATLAB or SCILAB clients running on a server (which acts as the master and initial repository of files).- Heterogeneous star-shaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. Also, the workers are connected to the master by links of different capacities. This framework is realistic when deploying the application from the server, which is responsible for enrolling authorized resources.- Limited memory. As we investigate the parallelization of large problems, we cannot assume that full matrix column blocks can be stored in the worker memories and be re-used for subsequent updates (as in ScaLAPACK). We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on a platform at our site. The experiments show that our matrix-product algorithm has smaller execution times than existing ones, while it also uses fewer resources.
Mapping Linear Workflows with Computation/Communication Overlap
"... This paper presents theoretical results related to mapping and scheduling linear workflows onto heterogeneous platforms. We use a realistic architectural model with bounded communication capabilities and full computation/communication overlap. This model is representative of current multi-threaded s ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
This paper presents theoretical results related to mapping and scheduling linear workflows onto heterogeneous platforms. We use a realistic architectural model with bounded communication capabilities and full computation/communication overlap. This model is representative of current multi-threaded systems. In these workflow applications, the goal is often to maximize throughput or to minimize latency. We present several complexity results related to both these criteria. To be precise, we prove that maximizing the throughput is NP-complete even for homogeneous platforms and minimizing the latency is NP-complete for heterogeneous platforms. Moreover, we present an approximation algorithm for throughput maximization for linear chain applications on homogeneous platforms, and an approximation algorithm for latency minimization for linear chain applications on all platforms where communication is homogeneous (the processor speeds can differ). In addition, we present algorithms for several important special cases for linear chain applications. Finally, we consider the implications of adding feedback loops to linear chain applications.
On the complexity of mapping linear chain applications onto heterogeneous platforms
- Parallel Processing Letters (PPL
, 2009
"... In this paper, we explore the problem of mapping simple application patterns onto large-scale heterogeneous platforms. An important optimization criteria that should be considered in such a framework is the latency, or makespan, which measures the response time of the system in order to process one ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
In this paper, we explore the problem of mapping simple application patterns onto large-scale heterogeneous platforms. An important optimization criteria that should be considered in such a framework is the latency, or makespan, which measures the response time of the system in order to process one single data set entirely. We focus in this work on linear chain applications, which are representative of a broad class of real-life applications. For such applications, we can consider one-to-one mappings, in which each stage is mapped onto a single processor. However, in order to reduce the communication cost, it seems natural to group stages into intervals. The interval mapping problem can be solved in a straightforward way if the platform has homogeneous communications: the whole chain is grouped into a single interval, which in turn is mapped onto the fastest processor. But the problem becomes harder when considering a fully heterogeneous platform. Indeed, we prove the NP-completeness of this problem. Furthermore, we prove that neither the interval mapping problem nor the similar one-to-one mapping problem can be approximated by any constant factor (unless P=NP).
Realistic Models and Efficient Algorithms for Fault Tolerant Scheduling on Heterogeneous Platforms
, 2008
"... Most list scheduling heuristics rely on a simple platform model where communication contention is not taken into account. In addition, it is generally assumed that processors in the systems are completely safe. To schedule precedence graphs in a more realistic framework, we introduce an efficient fa ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Most list scheduling heuristics rely on a simple platform model where communication contention is not taken into account. In addition, it is generally assumed that processors in the systems are completely safe. To schedule precedence graphs in a more realistic framework, we introduce an efficient fault tolerant scheduling algorithm that is both contentionaware and capable of supporting ε arbitrary fail-silent (fail-stop) processor failures. We focus on a bi-criteria approach, where we aim at minimizing the total execution time, or latency, given a fixed number of failures supported in the system. Our algorithm has a low time complexity, and drastically reduces the number of additional communications induced by the replication mechanism. Experimental results fully demonstrate the usefulness of the proposed algorithm, which leads to efficient execution schemes while guaranteeing a prescribed level of fault tolerance.
Revisiting matrix product on master-worker platforms
, 2006
"... This paper is aimed at designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
This paper is aimed at designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative:- Centralized data. We assume that all matrix files originate from, and must be returned to, the master. The master distributes both data and computations to the workers (while in ScaLA-PACK, input and output matrices are initially distributed among participating resources). Typically, our approach is useful in the context of speeding up MATLAB or SCILAB clients running on a server (which acts as the master and initial repository of files).- Heterogeneous star-shaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. Also, the workers are connected to the master by links of different capacities. This framework is realistic when deploying the application from the server, which is responsible for enrolling authorized resources.- Limited memory. Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and re-used for subsequent updates (as in ScaLAPACK). The amount of memory available in each worker is expressed as a given number mi of buffers, where a buffer can store a square block of matrix elements. The size q of these square blocks is chosen so as to harness the power of Level 3 BLAS routines: q = 80 or 100 on most platforms. We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on various platforms at École Normale Supérieure de Lyon and the University of Tennessee. However, we point out that in this first version of the report, experiments are limited to homogeneous platforms. 1 1

