## A Fast Static Scheduling Algorithm for DAGs on an Unbounded Number of Processors (1991)

Citations: | 26 - 3 self |

### BibTeX

@INPROCEEDINGS{Yang91afast,

author = {Tao Yang and Apostolos Gerasoulis},

title = {A Fast Static Scheduling Algorithm for DAGs on an Unbounded Number of Processors},

booktitle = {},

year = {1991},

pages = {633--642}

}

### Years of Citing Articles

### OpenURL

### Abstract

Scheduling parallel tasks on an unbounded number of completely connected processors when communication overhead is taken into account is NP-complete. Assuming that task duplication is not allowed, we propose a fast heuristic algorithm, called the dominant sequence clustering algorithm (DSC), for this scheduling problem. The DSC algorithm is superior to several other algorithms from the literature in terms of both computational complexity and parallel time. We present experimental results for scheduling general directed acyclic task graphs (DAGs) and compare the performance of several algorithms. Moreover, we show that DSC is optimum for special classes of DAGs such as join, fork and coarse grain tree graphs. 1 Introduction Scheduling parallel tasks with precedence relations over distributed memory multiprocessors has been found to be much more difficult than the classical scheduling problem, see Graham [14] and Lenstra and Kan [15]. This is because data transferring between processor...

### Citations

10953 |
Computers and Intractability: A Guide to the Theory of NP–completeness
- Garey, Johnson
(Show Context)
Citation Context ...t increase. The parallel time for a clustering with r clusters is determined by allocating r clusters onto r processors and finding the best execution ordering. This computation is still NP-complete, =-=[7]-=-. Sarkar gives an approximation algorithm which uses the level information, i.e. the longest path length from a task to the exit node, to execute the critical tasks first. For the example shown in Fig... |

404 | Bounds on multiprocessing timing anomalies
- Graham
- 1969
(Show Context)
Citation Context .... 1 Introduction Scheduling parallel tasks with precedence relations over distributed memory multiprocessors has been found to be much more difficult than the classical scheduling problem, see Graham =-=[14]-=- and Lenstra and Kan [15]. This is because data transferring between processors requires substantial transmission delays. Without imposing a limitation on the number of processors, the scheduling prob... |

166 |
Performance Computer Architecture
- Stone
(Show Context)
Citation Context ...AG coarse grain if g(G)s1, otherwise fine grain. If �� k = R and c i;k = C then the grain of every task and the granularity of the DAG reduces to the ratio R=C which is the same as Stone's definit=-=ion [26]-=-. For coarse grain DAGs each task receives or sends a small amount of communication compared to the computation of its adjacent tasks. In [11], we prove the following theorem: Theorem 4.1 For any nonl... |

142 |
Introduction to Parallel and Vector Solution of Linear Systems
- Ortega
- 1988
(Show Context)
Citation Context ...ly connected, [9, 13, 18, 25, 28]. The two-step method is: 1. Perform task clustering 2. Schedule the clusters on p physical processors Such an approach is widely used in parallel numerical computing =-=[8, 9, 21, 24]-=-. Clustering has also been used in VLSI processor array design where it is known as the processor projection, [20]. A common cost function for clustering is the minimization of the parallel time. A co... |

137 |
Towards an architecture-independent analysis of parallel algorithms
- Papadimitriou, Yannakakis
- 1990
(Show Context)
Citation Context ...unication overhead is solvable in a polynomial time. When communication overhead is present, however, the problem becomes NP-complete, see Sarkar [25], Chretienne [2] and Papadimitriou and Yannakakis =-=[22]-=-. When task duplication is allowed and there is a sufficient number of completely-connected processors, Papadimitriou and Yannakakis [22] have proposed an approximate algorithm whose performance is wi... |

120 | A comparison of clustering heuristics for scheduling DAGs on multiprocessors - Gerasoulis, Yang - 1992 |

119 |
A General Approach to Mapping of Parallel Computation upon Multiprocessor Architectures
- Kim, Browne
- 1988
(Show Context)
Citation Context ...at the completion of its execution the data are send in parallel to successor tasks. Heuristic scheduling algorithms for arbitrary task graphs have been proposed in the literature, see Kim and Browne =-=[18]-=-, Sarkar [25], Wu and Gajski [28]. However, the computational complexity for most of these algorithms is too high and also those algorithms cannot determine the optimum schedule for primitive DAGs. Ou... |

102 |
Practical multiprocessor scheduling algorithms for efficient parallel processing
- Kasahara, Narita
- 1984
(Show Context)
Citation Context ... node lists, a partial free list PFL and a free list FL sorted in a descending order of their task priorities. The tie resolution strategy follows the most immediate successors first (MISF) principle =-=[16]-=-. Function head(L) returns the first node in the sorted list L, which is the task with the highest priority. If L = fg, head(L) = NULL and priority(NULL) = 0. Notice that when a free task n x is sched... |

100 |
Partitionning and Scheduling Parallel Programs for Execution on Multiprocessors
- Sarkar
- 1989
(Show Context)
Citation Context ...f processors, the scheduling problem that ignores communication overhead is solvable in a polynomial time. When communication overhead is present, however, the problem becomes NP-complete, see Sarkar =-=[25]-=-, Chretienne [2] and Papadimitriou and Yannakakis [22]. When task duplication is allowed and there is a sufficient number of completely-connected processors, Papadimitriou and Yannakakis [22] have pro... |

98 | On the granularity and clustering of directed acyclic task graphs
- Gerasoulis, Yang
- 1993
(Show Context)
Citation Context ...hat linear clustering preserves the parallelism embedded in the DAG, but nonlinear clustering reduces parallelism by sequentializing parallel tasks. A further discussion of this issue can be found in =-=[11]-=-. 2.1 Successive clustering refinements Most of the previously developed clustering algorithms can be considered as performing a sequence of clustering refinements [10, 12]. At the initial step, each ... |

80 |
Grain size determination for parallel processing
- Kruatrachue, Lewis
- 1988
(Show Context)
Citation Context ...ng and Chow [1], Colin and Chretienne [5] have also proposed polynomial optimal algorithms for special classes of task graphs in which communication is smaller than computation. Kruatrachue and Lewis =-=[19]-=- have given an algorithm that uses task duplication. It is not clear, however, whether the task duplication assumption is practical because it could result in a considerable increase in the space comp... |

26 | Clustering Task Graphs for Message Passing Architectures
- Gerasoulis, Venugopal, et al.
- 1990
(Show Context)
Citation Context .... Our approach for solving the scheduling problem is to view the scheduling process as a clustering procedure with the goal of minimizing the overall parallel time, see Gerasoulis, Venugopal and Yang =-=[10]-=- and Gerasoulis and Yang [12]. Section 2 introduces the concept of clustering. Section 2.1 discusses a framework that considers a clustering procedure as performing a sequence of successive clustering... |

25 |
A polynomial algorithm to optimally schedule tasks on a virtual distributed system under tree-like precedence constraints
- Chretienne
- 1989
(Show Context)
Citation Context ...n assumption is practical because it could result in a considerable increase in the space complexity. 3 Supported by a Grant No. DMS-8706122 from NSF. When task duplication is not allowed, Chretienne =-=[3]-=-, Anger, Hwang and Chow [1] have proposed special algorithms to derive an optimal schedule for a coarse grain tree DAG. Less progress has been made in developing an optimum algorithm for graphs other ... |

25 |
Matrix Factorization on a Hypercube Multiprocessor
- Geist, Heath
- 1985
(Show Context)
Citation Context ...ly connected, [9, 13, 18, 25, 28]. The two-step method is: 1. Perform task clustering 2. Schedule the clusters on p physical processors Such an approach is widely used in parallel numerical computing =-=[8, 9, 21, 24]-=-. Clustering has also been used in VLSI processor array design where it is known as the processor projection, [20]. A common cost function for clustering is the minimization of the parallel time. A co... |

23 |
Scheduling with sufficient loosely coupled processors
- ANGER, HWANG, et al.
- 1990
(Show Context)
Citation Context ... a sufficient number of completely-connected processors, Papadimitriou and Yannakakis [22] have proposed an approximate algorithm whose performance is within 50% of the optimum. Anger, Hwang and Chow =-=[1]-=-, Colin and Chretienne [5] have also proposed polynomial optimal algorithms for special classes of task graphs in which communication is smaller than computation. Kruatrachue and Lewis [19] have given... |

20 |
Task Scheduling Over Distributed Memory Machines
- Chretienne
- 1988
(Show Context)
Citation Context ... scheduling problem that ignores communication overhead is solvable in a polynomial time. When communication overhead is present, however, the problem becomes NP-complete, see Sarkar [25], Chretienne =-=[2]-=- and Papadimitriou and Yannakakis [22]. When task duplication is allowed and there is a sufficient number of completely-connected processors, Papadimitriou and Yannakakis [22] have proposed an approxi... |

20 |
CPM scheduling with small communication delays and task duplication
- Colin, Chrétienne
- 1991
(Show Context)
Citation Context ...mpletely-connected processors, Papadimitriou and Yannakakis [22] have proposed an approximate algorithm whose performance is within 50% of the optimum. Anger, Hwang and Chow [1], Colin and Chretienne =-=[5]-=- have also proposed polynomial optimal algorithms for special classes of task graphs in which communication is smaller than computation. Kruatrachue and Lewis [19] have given an algorithm that uses ta... |

14 |
A Programming Aid for Hypercube Architectures
- Wu, Gajski
(Show Context)
Citation Context ...ation is not allowed 2. The number of available processors is unlimited 3. The processors are completely connected 4. The static macro dataflow model of computation, see Sarkar [25] and Wu and Gajski =-=[28]-=-. The task execution is triggered by the arrival of all data and at the completion of its execution the data are send in parallel to successor tasks. Heuristic scheduling algorithms for arbitrary task... |

12 |
Two new NP-complete scheduling problems with communication delays ans unlimited number of processors
- Picouleau
- 1995
(Show Context)
Citation Context ...em shows that an optimal linear clustering will give the optimum solution for scheduling a coarse grain DAG. Using the above theorem and the NP-completeness results for coarse grain DAGs by Picouleau =-=[23]-=-, we can show that the linear clustering problem is NP-complete for the minimization of the parallel time cost function. Fortunately, for coarse grain DAGs, DSC or any other linear clustering algorith... |

11 |
Partitioning programs for parallel execution
- Girkar, Polychronopoulos
(Show Context)
Citation Context ...e same processor. Task clustering has been used in the two-step method for scheduling tasks on parallel architectures with a bounded number of processors which may or may not be completely connected, =-=[9, 13, 18, 25, 28]-=-. The two-step method is: 1. Perform task clustering 2. Schedule the clusters on p physical processors Such an approach is widely used in parallel numerical computing [8, 9, 21, 24]. Clustering has al... |

10 |
Parallel Gaussian elimination on an MIMD computer
- Cosnard, Marrakchi, et al.
- 1988
(Show Context)
Citation Context ...or Sarkar's vs. O((v + e) log v) for DSC. 5.2 Choleski Decomposition DAG In the second example we use a well known numerical computing DAG, the Choleski decomposition (CD) DAG given in Cosnard et al. =-=[6]-=-, Gerasoulis and Nelken [9]. The natural clustering is a special clustering widely used in the literature for the solution of CD in hypercube architectures, see Saad [24], Ortega [21], Geist and Heath... |

8 |
A General Approach to Multiprocessor Scheduling
- Kim
- 1988
(Show Context)
Citation Context ...s of a DAG are in and out-trees. Therefore, by studying the DSC performance on such structures we can further understand its behavior. Other general clustering algorithms proposed by Sarkar [25], Kim =-=[17]-=-, Kim and Browne [18], and Wu and Gajski [28] have complexity O(v 2 ) or higher. As we will show DSC attains the optimum for the above mentioned primitive structures. As far as we know no other genera... |

7 |
Complexity of tree scheduling with interprocessor communication delays
- Chretienne
- 1990
(Show Context)
Citation Context ...thm is for forks only. For a comparison, the DSC costs O(m log m) for the fork. 4.3 Performances on in and out-trees Scheduling in and out-trees is still NP-complete in general as shown by Chretienne =-=[4]-=- and DSC will not give the optimal solution. However, DSC will yield the optimal solution for a coarse grain in-tree. Theorem 4.3 DSC gives the optimal solution for a coarse grain in-tree. Proof: We c... |

7 |
Static scheduling for linear algebra DAGs
- Gerasoulis, Nelken
- 1989
(Show Context)
Citation Context ...e same processor. Task clustering has been used in the two-step method for scheduling tasks on parallel architectures with a bounded number of processors which may or may not be completely connected, =-=[9, 13, 18, 25, 28]-=-. The two-step method is: 1. Perform task clustering 2. Schedule the clusters on p physical processors Such an approach is widely used in parallel numerical computing [8, 9, 21, 24]. Clustering has al... |

6 |
Dominant Sequence Clustering Heuristic Algorithm for Multiprocessors.” Report, 1990. A. Gerasoulis, Tao Yang. “DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors
- Gerasoulis, Yang
- 1994
(Show Context)
Citation Context ... here and simply show its correctness for the initial step (i = 0) where SG = PFL = fg and FL = fall entry nodesg. Then PT 0 = maxfpriority(n x )g = level(n x ) = the length of the critical path. See =-=[27]-=- for details. The last theorem is used by DSC to identify edges in DS incrementally at each step 4 . There are two cases, either there is a DS that goes through USG or there is not. If not, that indic... |

3 |
A.H.G Rinnooy Kan, "Complexity of Scheduling under Precedence Constraints
- Lenstra
- 1978
(Show Context)
Citation Context ...ng parallel tasks with precedence relations over distributed memory multiprocessors has been found to be much more difficult than the classical scheduling problem, see Graham [14] and Lenstra and Kan =-=[15]-=-. This is because data transferring between processors requires substantial transmission delays. Without imposing a limitation on the number of processors, the scheduling problem that ignores communic... |

2 |
Gaussian Elimination on Hypercubes," Parallel Algorithms and Architectures
- Saad
- 1986
(Show Context)
Citation Context ...ly connected, [9, 13, 18, 25, 28]. The two-step method is: 1. Perform task clustering 2. Schedule the clusters on p physical processors Such an approach is widely used in parallel numerical computing =-=[8, 9, 21, 24]-=-. Clustering has also been used in VLSI processor array design where it is known as the processor projection, [20]. A common cost function for clustering is the minimization of the parallel time. A co... |