Results 1  10
of
30
Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors
, 1999
"... Devices]: Modes of ComputationParallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported ..."
Abstract

Cited by 326 (5 self)
 Add to MetaCart
Devices]: Modes of ComputationParallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported by the Hong Kong Research Grants Council under contract numbers HKUST 734/96E, HKUST 6076/97E, and HKU 7124/99E. Authors' addresses: Y.K. Kwok, Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong; email: ykwok@eee.hku.hk; I. Ahmad, Department of Computer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. 2000 ACM 03600300/99/12000406 $5.00 ACM Computing Surveys, Vol. 31, No. 4, December 1999 1.
Efficient Scheduling of Arbitrary Task Graphs to Multiprocessors using A Parallel Genetic Algorithm
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1997
"... Given a parallel program represented by a task graph, the objective of a scheduling algorithm is to minimize the overall execution time of the program by properly assigning the nodes of the graph to the processors. This multiprocessor scheduling problem is NPcomplete even with simplifying assumptio ..."
Abstract

Cited by 44 (7 self)
 Add to MetaCart
Given a parallel program represented by a task graph, the objective of a scheduling algorithm is to minimize the overall execution time of the program by properly assigning the nodes of the graph to the processors. This multiprocessor scheduling problem is NPcomplete even with simplifying assumptions, and becomes more complex under relaxed assumptions such as arbitrary precedence constraints, and arbitrary task execution and communication times. The present literature on this topic is a large repertoire of heuristics that produce good solutions in a reasonable amount of time. These heuristics, however, have restricted applicability in a practical environment because they have a number of fundamental problems including high time complexity, lack of scalability, and no performance guarantee with respect to optimal solutions. Recently, genetic algorithms (GAs) have been widely reckoned as a useful vehicle for obtaining high quality or even optimal solutions for a broad range of combinato...
Highthroughput bayesian computing machine with reconfigurable hardware
 in FPGA ’10: Proceedings of the 18th annual ACM/SIGDA international
"... We use reconfigurable hardware to construct a high throughput Bayesian computing machine (BCM) capable of evaluating probabilistic networks with arbitrary DAG (directed acyclic graph) topology. Our BCM achieves high throughput by exploiting the FPGA’s distributed memories and abundant hardware struc ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
(Show Context)
We use reconfigurable hardware to construct a high throughput Bayesian computing machine (BCM) capable of evaluating probabilistic networks with arbitrary DAG (directed acyclic graph) topology. Our BCM achieves high throughput by exploiting the FPGA’s distributed memories and abundant hardware structures (such as long carrychains and registers), which enables us to 1) develop an innovative memory allocation scheme based on a maximal matching algorithm that completely avoids memory stalls, 2) optimize and deeply pipeline the logic design of each processing node, and 3) schedule them optimally. The BCM architecture not only can be applied to many important algorithms in artificial intelligence, signal processing, and digital communications, but also has high reusability, i.e., a new application needs not change a BCM’s hardware design, only new task graph processing and code compilation are necessary. Moreover, the throughput of a BCM scales almost linearly with the size of the FPGA on which it is implemented. A Bayesian computing machine with 16 processing nodes was implemented with a Virtex5 FPGA (XCV5LX155T2) on a BEE3 (Berkeley Emulation Engine) platform. For a wide variety of sample Bayesian problems, comparing running the same network evaluation algorithm on a 2.4 GHz Core 2 Duo Intel processor and a GeForce 9400m using the CUDA software package, the BCM demonstrates 80x and 15x speedups respectively, with a peak throughput of 20.4
A nearoptimal solution for the heterogeneous multiprocessor singlelevel voltage setup problem
 in Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS). Los Alamitos
"... A heterogeneous multiprocessor (HeMP) system consists of several heterogeneous processors, each of which is specially designed to deliver the best energysaving performance for a particular category of applications. A lowpower realtime scheduling algorithm is required to schedule tasks on such a s ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
A heterogeneous multiprocessor (HeMP) system consists of several heterogeneous processors, each of which is specially designed to deliver the best energysaving performance for a particular category of applications. A lowpower realtime scheduling algorithm is required to schedule tasks on such a system to minimize its energy consumption and complete all tasks by their deadline. The problem of determining the optimal speed for each processor to minimize the total energy consumption is called the voltage setup problem. This paper provides a nearoptimal solution for the HeMP singlelevel voltage setup problem. To our best knowledge, we are the first work that addresses this problem. Initially, each task is assigned to a processor in a localoptimal manner. We next propose a couple of solutions to reduce energy by migrating tasks between processors. Finally, we determine each processor’s speed by its final workload and the deadline. We conducted a series of simulations to evaluate our algorithms. The results show that the localoptimal partition leads to a considerably better energysaving schedule than a commonlyused homogeneous multiprocessor scheduling algorithm. Furthermore, at all measurable configurations, our energy consumption is at most 3 % more than the optimal value obtained by an exhaustive iteration of all possible tasktoprocessor assignments. In summary, our work is shown to provide a nearoptimal solution at its polynomialtime complexity. 1.
Runtime Data Flow Scheduling of Matrix Computations
, 2009
"... We investigate the scheduling of matrix computations expressed as directed acyclic graphs for sharedmemory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. Wellknown scheduling algorithms su ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
We investigate the scheduling of matrix computations expressed as directed acyclic graphs for sharedmemory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. Wellknown scheduling algorithms such as work stealing have proven time and space bounds, but these bounds do not provide a discernable indicator of performance between different scheduling algorithms and heuristics. We provide a flexible framework for scheduling matrix computations, which we use to empirically quantify different scheduling algorithms. By building software solutions based on hardware techniques through leveraging a cache coherence protocol, we develop a scheduling algorithm that addresses both load balance and data locality simultaneously and show its performance benefits.
Partitioning and Scheduling Using Graph Decomposition
 In Twentyeighth annual ACM symposium on theory of computing
, 1993
"... Automated parallelization of source code is a goal on which many researchers in parallel computing have focused. The increasing availability of parallel computers, the difficulty of creating good parallel programs, and the vast amount of existing serial source code all contribute to the need for aut ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
Automated parallelization of source code is a goal on which many researchers in parallel computing have focused. The increasing availability of parallel computers, the difficulty of creating good parallel programs, and the vast amount of existing serial source code all contribute to the need for automated means of parallelization. This paper centers on the issues of partitioning and scheduling within automatic parallelization, or the creation of appropriatelysized tasks and their assignments to processors. An algorithm is introduced which uses the program dependence graph (PDG) representation of serial programs, and relies on a prior graph decomposition, or parse, for identification of parallelism. The algorithm uses local heuristics to determine the cost effectiveness of each opportunity for parallelization, and creates and schedules tasks accordingly. Keywords partitioning, scheduling, automatic parallelization, graph decomposition, program dependence graph * This research is s...
A modular genetic algorithm for scheduling task graphs
, 2003
"... Abstract. Several genetic algorithms have been designed for the problem of scheduling task graphs onto multiprocessors, the primary distinction among most of them being the chromosomal representation used for a schedule. However, these existing approaches are monolithic as they attempt to scan the e ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Several genetic algorithms have been designed for the problem of scheduling task graphs onto multiprocessors, the primary distinction among most of them being the chromosomal representation used for a schedule. However, these existing approaches are monolithic as they attempt to scan the entire solution space without consideration to techniques that can reduce the complexity of the optimization. In this paper, a genetic algorithm based in a bichromosomal represetnation and capable of being incorporated into a cluster/merging optimization framework is proposed, and it is experimentally shown to outperform a leading genetic algorithm for scheduling.
On task scheduling accuracy: Evaluation methodology and results
 Journal of Supercomputing
, 2004
"... Abstract. Many heuristics based on the directed acyclic graph (DAG) have been proposed for the static scheduling problem. Most of these algorithms apply a simple model of the target system that assumes fully connected processors, a dedicated communication subsystem and no contention for the communi ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Many heuristics based on the directed acyclic graph (DAG) have been proposed for the static scheduling problem. Most of these algorithms apply a simple model of the target system that assumes fully connected processors, a dedicated communication subsystem and no contention for the communication resources. Only a few algorithms consider the network topology and the contention for the communication resources. This article evaluates the accuracy of task scheduling algorithms and thus the appropriateness of the applied models. An evaluation methodology is proposed and applied to a representative set of scheduling algorithms. The obtained results show a significant inaccuracy of the produced schedules. Analyzing these results is important for the development of more appropriate models and more accurate scheduling algorithms.
EnergyEfficient Embedded Software Implementation on Multiprocessor SystemonChip with Multiple Voltages
"... This paper develops energydriven completion ratio guaranteed scheduling techniques for the implementation of embedded software on multiprocessor systems with multiple supply voltages. We leverage application’s performance requirements, uncertainties in execution time, and tolerance for reasonable e ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
This paper develops energydriven completion ratio guaranteed scheduling techniques for the implementation of embedded software on multiprocessor systems with multiple supply voltages. We leverage application’s performance requirements, uncertainties in execution time, and tolerance for reasonable execution failures to scale each processor’s supply voltage at runtime to reduce the multiprocessor system’s total energy consumption. Specifically, we study how to trade the difference between the system’s highest achievable completion ratio Qmax and the required completion ratio Q0 for energy saving. First, we propose a besteffort energy minimization algorithm (BEEM1) that achieves Qmax with the provably minimum energy consumption. We then relax its unrealistic assumption on the application’s real execution time and develop algorithm BEEM2 that only requires the application’s best and worstcase execution times. Finally, we propose a hybrid offline online completion ratio guaranteed energy minimization algorithm (QGEM) that provides the required Q0 with further energy reduction based on the probabilistic distribution of the application’s execution time. We implement the proposed algorithms and verify their energy efficiency on reallife DSP applications and the TGFF random benchmark suite. BEEM1, BEEM2, and QGEM all provide the required completion ratio with average energy reduction of 28.7, 26.4, and 35.8%, respectively.
Extending ICscheduling via the Sweep algorithm
 in: 16th Euromicro Conf. on Parallel, Distributed, and NetworkBased Processing
"... A key challenge when scheduling computations over the Internet is temporal unpredictability: remote “workers ” arrive and depart at unpredictable times and often provide unpredictable computational resources; the time for communication over the Internet is impossible to predict accurately. In respon ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
A key challenge when scheduling computations over the Internet is temporal unpredictability: remote “workers ” arrive and depart at unpredictable times and often provide unpredictable computational resources; the time for communication over the Internet is impossible to predict accurately. In response, earlier research has developed the underpinnings of a theory of how to schedule computations having intertask dependencies in a way that renders tasks eligible for execution at the maximum possible rate. Simulation studies suggest that such scheduling: (a) utilizes resource providers ’ computational resources well, by enhancing the likelihood of having work to allocate to an available client; (b) lessens the likelihood of a computation’s stalling for lack of tasks that are eligible for execution. The applicability of the current version of the theory is limited by its demands on the structure of the dag that models the computation being scheduled—namely, that the dag be decomposable into connected bipartite “buildingblock ” dags. The current paper extends the theory by developing the Sweep Algorithm, which takes a significant step toward removing this