Results 1  10
of
21
Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors
, 1999
"... Devices]: Modes of ComputationParallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported ..."
Abstract

Cited by 208 (4 self)
 Add to MetaCart
Devices]: Modes of ComputationParallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported by the Hong Kong Research Grants Council under contract numbers HKUST 734/96E, HKUST 6076/97E, and HKU 7124/99E. Authors' addresses: Y.K. Kwok, Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong; email: ykwok@eee.hku.hk; I. Ahmad, Department of Computer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. 2000 ACM 03600300/99/12000406 $5.00 ACM Computing Surveys, Vol. 31, No. 4, December 1999 1.
Provably efficient scheduling for languages with finegrained parallelism
 IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract

Cited by 81 (23 self)
 Add to MetaCart
Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
Models of Machines and Computation for Mapping in Multicomputers
, 1993
"... It is now more than a quarter of a century since researchers started publishing papers on mapping strategies for distributing computation across the computation resource of multiprocessor systems. There exists a large body of literature on the subject, but there is no commonlyaccepted framework ..."
Abstract

Cited by 78 (1 self)
 Add to MetaCart
It is now more than a quarter of a century since researchers started publishing papers on mapping strategies for distributing computation across the computation resource of multiprocessor systems. There exists a large body of literature on the subject, but there is no commonlyaccepted framework whereby results in the field can be compared. Nor is it always easy to assess the relevance of a new result to a particular problem. Furthermore, changes in parallel computing technology have made some of the earlier work of less relevance to current multiprocessor systems. Versions of the mapping problem are classified, and research in the field is considered in terms of its relevance to the problem of programming currently available hardware in the form of a distributed memory multiple instruction stream multiple data stream computer: a multicomputer.
Communication Complexity for Parallel DivideandConquer
 In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science
, 1991
"... This paper studies the relationship between parallel computation cost and communication cost for performing divideandconquer (D&C) computations on a parallel system of p processors. The parallel computation cost is the maximal number of the D&C nodes that any processor in the parallel syst ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
This paper studies the relationship between parallel computation cost and communication cost for performing divideandconquer (D&C) computations on a parallel system of p processors. The parallel computation cost is the maximal number of the D&C nodes that any processor in the parallel system may expand, whereas the communication cost is the total number of cross nodes. A cross node is a node which is generated by one processor but expanded by another processor. A new scheduling algorithm is proposed, whose parallel computation cost and communication cost are at most dN=pe and pdh, respectively, for any D&C computation tree with N nodes, height h, and degree d. Also, lower bounds on the communication cost are derived. In particular, it is shown that for each scheduling algorithm and for each positive ffl C ! 1, which can be arbitrarily close to 0, there are values of N , h, d, p, and ffl T (? 0), for which if the parallel computation cost is between N=p (the minimum) and (1 + ffl T ...
A Comparison of Heuristics for Scheduling DAGs on Multiprocessors
 in Proceedings of the Eighth International Parallel Processing Symposium
, 1994
"... Many algorithms to schedule DAGs on multiprocessors have been proposed, but there has been little work done to determine their effectiveness. Since multiprocessor scheduling is an NPhard problem, no exact tractable algorithm exists, and no baseline is available from which to compare the resulting ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
Many algorithms to schedule DAGs on multiprocessors have been proposed, but there has been little work done to determine their effectiveness. Since multiprocessor scheduling is an NPhard problem, no exact tractable algorithm exists, and no baseline is available from which to compare the resulting schedules. This paper is an attempt to quantify the differences in a few of the heuristics. The empiracle performance of five heuristics is compared when they are applied to ten specific DAGs which represent program dependence graphs of important applications. The comparison is made between a graph based method, a list scheduling technique and three critical path mathods. 1. Introduction One of the primary problems in creating efficient programs for multiprocessor systems with distributed memory is to partition the program into tasks that can be assigned to different processors for parallel execution. If a high degree of parallelism is the objective, a greater amount of communication will b...
Scheduling Problems in Parallel Query Optimization
, 1995
"... We introduce a class of novel multiprocessor scheduling problems that arise in the optimization of SQL queries for parallel machines. These consist of scheduling a tree of interdependent communicating operators while exploiting both interoperator and intraoperator parallelism. We develop algorithm ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
We introduce a class of novel multiprocessor scheduling problems that arise in the optimization of SQL queries for parallel machines. These consist of scheduling a tree of interdependent communicating operators while exploiting both interoperator and intraoperator parallelism. We develop algorithms for the specific problem of scheduling a Pipelined Operator Tree in which all operators run in parallel using interoperator parallelism. Weights associated with nodes and edges represent respectively the cost of operators and communication. Communication cost is incurred only if adjacent operators are assigned different processors. The optimization problem is to assign operators to processors so as to minimize the maximum processor load. We develop two approximation algorithms for this NPhard problem. The faster algorithm has a performance ratio of 3.56 while the slower algorithm has a ratio of 2.87. 1 Introduction Exploiting parallel execution [DG92, Val93] to speed up database querie...
Competitive Implementation of Parallel Programs
"... We apply the methodology of competitive analysis of algorithms to the implementation of programs on parallel machines. We consider the problem of finding the best online distributed scheduling strategy that executes in parallel an unknown directed acyclic graph ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
We apply the methodology of competitive analysis of algorithms to the implementation of programs on parallel machines. We consider the problem of finding the best online distributed scheduling strategy that executes in parallel an unknown directed acyclic graph
The Design and Analysis of BulkSynchronous Parallel Algorithms
, 1998
"... The model of bulksynchronous parallel (BSP) computation is an emerging paradigm of generalpurpose parallel computing. This thesis presents a systematic approach to the design and analysis of BSP algorithms. We introduce an extension of the BSP model, called BSPRAM, which reconciles sharedmemory s ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
The model of bulksynchronous parallel (BSP) computation is an emerging paradigm of generalpurpose parallel computing. This thesis presents a systematic approach to the design and analysis of BSP algorithms. We introduce an extension of the BSP model, called BSPRAM, which reconciles sharedmemory style programming with efficient exploitation of data locality. The BSPRAM model can be optimally simulated by a BSP computer for a broad range of algorithms possessing certain characteristic properties: obliviousness, slackness, granularity. We use BSPRAM to design BSP algorithms for problems from three large, partially overlapping domains: combinatorial computation, dense matrix computation, graph computation. Some of the presented algorithms are adapted from known BSP algorithms (butterfly dag computation, cube dag computation, matrix multiplication). Other algorithms are obtained by application of established nonBSP techniques (sorting, randomised list contraction, Gaussian elimination without pivoting and with column pivoting, algebraic path computation), or use original techniques specific to the BSP model (deterministic list contraction, Gaussian elimination with nested block pivoting, communicationefficient multiplication of Boolean matrices, synchronisationefficient shortest paths computation). The asymptotic BSP cost of each algorithm is established, along with its BSPRAM characteristics. We conclude by outlining some directions for future research.
A ProcessorTimeMinimal Systolic Array for Cubical Mesh Algorithms
"... Using a directed acyclic graph (dag) model of algorithms, the paper focuses on timeminimal multiprocessor schedules that use as few processors as possible. Such a processortimeminimal scheduling of an algorithm’s dag first is illustrated using a triangular shaped 2D directed mesh (representing, f ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
Using a directed acyclic graph (dag) model of algorithms, the paper focuses on timeminimal multiprocessor schedules that use as few processors as possible. Such a processortimeminimal scheduling of an algorithm’s dag first is illustrated using a triangular shaped 2D directed mesh (representing, for example, an algorithm for solving a triangular system of linear equations). Then, algorithms represented by an n × n × n directed mesh are investigated. This cubical directed mesh is fundamental; it represents the standard algorithm for computing matrix product as well as many other algorithms. Completion of the cubical mesh requires 3n − 2 steps. It is shown that the number of processing elements needed to achieve this time bound is at least ⌈3n 2 /4⌉. A systolic array for the cubical directed mesh is then presented. It completes the mesh using the minimum number of steps and exactly ⌈3n 2 /4 ⌉ processing elements: it is processortimeminimal. The systolic array’s topology is that of a hexagonally shaped, cylindrically connected 2D directed mesh.
Task Graph Performance Bounds Through Comparison Methods
, 2001
"... When a parallel computation is represented in a formalism that imposes seriesparallel structure on its task graph, it becomes amenable to automated analysis and scheduling. Unfortunately, its execution time will usually also increase as precedence constraints are added to ensure seriesparallel str ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
When a parallel computation is represented in a formalism that imposes seriesparallel structure on its task graph, it becomes amenable to automated analysis and scheduling. Unfortunately, its execution time will usually also increase as precedence constraints are added to ensure seriesparallel structure. Bounding the slowdown ratio would allow an informed tradeoff between the benefits of a restrictive formalism and its cost in loss of performance. This dissertation deals with seriesparallelising task graphs by adding precedence constraints to a task graph, to make the resulting task graph seriesparallel. The weak bounded slowdown conjecture for seriesparallelising task graphs is introduced. This states that the slowdown is bounded if information about the workload can be used to guide the selection of which precedence constraints to add. A theory of best seriesparallelisations is developed to investigate this conjecture. Partial evidence is presented that the weak slowdown bound is likely to be 4/3, and this bound is shown to be tight.