Results 1  10
of
14
SpaceEfficient Scheduling of Multithreaded Computations
 SIAM Journal on Computing
, 1993
"... . This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency ..."
Abstract

Cited by 81 (14 self)
 Add to MetaCart
. This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T 1 is the time required for executing the computation on 1 processor, T1 is the time required by an infinite number of processors, and S 1 is the space required to execute the computation on 1 processor. A computation executed on P processors is timeefficient if the time is O(T 1 =P + T1 ), that is, it achieves linear speedup when P = O(T 1 =T1 ), and it is spaceefficient if it uses O(S 1 P ) total space, that is, the space per processor is within a constant factor of that required for a 1processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultan...
Executing Multithreaded Programs Efficiently
, 1995
"... right to do so. by:::::::::::::::::::::::::::::::::::::::::::::::::::::::: ..."
Abstract

Cited by 70 (8 self)
 Add to MetaCart
right to do so. by::::::::::::::::::::::::::::::::::::::::::::::::::::::::
The data locality of work stealing
 Theory of Computing Systems
, 2000
"... This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validatio ..."
Abstract

Cited by 68 (14 self)
 Add to MetaCart
This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using workstealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computations using work stealing. For the second part of our results, we present a localityguided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%. 1
Communication Complexity for Parallel DivideandConquer
 In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science
, 1991
"... This paper studies the relationship between parallel computation cost and communication cost for performing divideandconquer (D&C) computations on a parallel system of p processors. The parallel computation cost is the maximal number of the D&C nodes that any processor in the parallel system may e ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
This paper studies the relationship between parallel computation cost and communication cost for performing divideandconquer (D&C) computations on a parallel system of p processors. The parallel computation cost is the maximal number of the D&C nodes that any processor in the parallel system may expand, whereas the communication cost is the total number of cross nodes. A cross node is a node which is generated by one processor but expanded by another processor. A new scheduling algorithm is proposed, whose parallel computation cost and communication cost are at most dN=pe and pdh, respectively, for any D&C computation tree with N nodes, height h, and degree d. Also, lower bounds on the communication cost are derived. In particular, it is shown that for each scheduling algorithm and for each positive ffl C ! 1, which can be arbitrarily close to 0, there are values of N , h, d, p, and ffl T (? 0), for which if the parallel computation cost is between N=p (the minimum) and (1 + ffl T ...
Managing Storage for Multithreaded Computations
, 1992
"... Multithreading has become a dominant paradigm in general purpose MIMD parallel computation. To execute a multithreaded computation on a parallel computer, a scheduler must order and allocate threads to run on the individual processors. The scheduling algorithm dramatically affects both the speedup a ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
Multithreading has become a dominant paradigm in general purpose MIMD parallel computation. To execute a multithreaded computation on a parallel computer, a scheduler must order and allocate threads to run on the individual processors. The scheduling algorithm dramatically affects both the speedup attained and the space used when executing the computation. We consider the problem of scheduling multithreaded computations to achieve linear speedup without using significantly more spaceperprocessor than required for a singleprocessor execution. We show that for general multithreaded computations, no scheduling algorithm can simultaneously make efficient use of space and time. In particular, we show that there exist multithreaded computations such that any execution schedule X that achieves Pprocessor execution time T P (X ) T 1 =ae, where T 1 is the minimum possible serial execution time, must use space at least S P (X ) 1 4 (ae \Gamma 1) p T 1 + S 1 , where S 1 is the space use...
Tight Bounds For OnLine Tree Embeddings
 In Proceedings of the 2nd ACMSIAM Symposium On Discrete Algorithms
, 1991
"... . Treestructured computations are relatively easy to process in parallel. As leaf processes are recursively spawned they can be assigned to independent processors in a multicomputer network. However, to achieve good performance the online mapping algorithm must maintain load balance, i.e., distrib ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
. Treestructured computations are relatively easy to process in parallel. As leaf processes are recursively spawned they can be assigned to independent processors in a multicomputer network. However, to achieve good performance the online mapping algorithm must maintain load balance, i.e., distribute processes equitably among processors. Additionally, the algorithm itself must be distributed in nature, and process allocation must be completed via messagepassing with minimal communication overhead. This paper investigates bounds on the performance of deterministic and randomized algorithms for online tree embeddings. In particular, we study tradeo#s between computation overhead (load imbalance) and communication overhead (message congestion). We give a simple technique to derive lower bounds on the congestion that any online allocation algorithm must incur in order to guarantee load balance. This technique works for both randomized and deterministic algorithms. We prove that the a...
Efficient Parallel DivideandConquer for a Class of Interconnection Topologies.
 In Proceedings of the 2nd International Symposium on Algorithms, number 557 in Lecture Notes in Computer Science
, 1991
"... : In this paper, we propose an efficient scheduling algorithm for expanding any divideandconquer (D&C) computation tree on kdimensional mesh, hypercube, and perfect shuffle networks with p processors. Assume that it takes t n time steps to expand one node of the tree and t c time steps to transmit ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
: In this paper, we propose an efficient scheduling algorithm for expanding any divideandconquer (D&C) computation tree on kdimensional mesh, hypercube, and perfect shuffle networks with p processors. Assume that it takes t n time steps to expand one node of the tree and t c time steps to transmit one datum or convey one node. For any D&C computation tree with N nodes, height h, and degree d (maximal number of children of any node), our algorithm requires at most (N=p + h)t n + 'dht c time steps, where ' is O(log 2 p) on a hypercube or perfect shuffle network and is O( k p p) on a n k\Gamma1 \Theta \Delta \Delta \Delta \Theta n 0 mesh network, where n k\Gamma1 = \Delta \Delta \Delta = n 0 = k p p. This algorithm is general in the sense that it does not know the values of N , h, and d, and the shape of the computation tree as well, a priori. Most importantly, we can easily obtain a linear speedup by nearly a factor of p, especially when N AE ph(1 + 'dt c =t n ). 1. Introduction ...
Algorithms for Combinatorial Optimization in Real Time and their Automated Refinement by Genetic Programming
 University of Illinois at UrbanaChampaign
, 1994
"... The goal of this research is to develop a systematic, integrated method of designing efficient search algorithms that solve optimization problems in real time. Search algorithms studied in this thesis comprise metacontrol and primitive search. The class of optimization problems addressed are called ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
The goal of this research is to develop a systematic, integrated method of designing efficient search algorithms that solve optimization problems in real time. Search algorithms studied in this thesis comprise metacontrol and primitive search. The class of optimization problems addressed are called combinatorial optimization problems, examples of which include many NPhard scheduling and planning problems, and problems in operations research and artificialintelligence applications. The problems we have addressed have a welldefined problem objective and a finite set of welldefined problem constraints. In this research, we use statespace trees as problem representations. The approach we have undertaken in designing efficient search algorithms is an engineering approach and consists of two phases: (a) designing generic search algorithms, and (b) improving by geneticsbased machine learning methods parametric heuristics used in the search algorithms designed. Our approach is a systematic method that integrates domain knowledge, search techniques, and automated learning techniques for designing better search algorithms. Knowledge captured in designing one search algorithm can be carried over for designing new ones. iv ACKNOWLEDGEMENTS I express my sincere gratitude to all the people who have helped me in the course of my graduate study. My thesis advisor, Professor Benjamin W. Wah, was always available for discussions and encouraged me to explore new ideas. I am deeply grateful to the committee
Efficient Scheduling of Strict Multithreaded Computations
, 1999
"... In this paper we study the problem of eciently scheduling a wide class of multithreaded computations, called strict; that is, computations in which all dependencies from a thread go to the thread's ancestors in the computation tree. Strict multithreaded computations allow the limited use of synch ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In this paper we study the problem of eciently scheduling a wide class of multithreaded computations, called strict; that is, computations in which all dependencies from a thread go to the thread's ancestors in the computation tree. Strict multithreaded computations allow the limited use of synchronization primitives. We present the rst fully distributed scheduling algorithm which applies to any strict multithreaded computation. The algorithm is asynchronous, online and follows the workstealing paradigm. We prove that our algorithm is ecient not only in terms of its memory requirements and its execution time, but also in terms of its communication complexity. Our analysis applies to both shared and distributed memory machines. More specically, the expected execution time of our algorithm is O(T 1 =P +hT1 ), where T 1 is the minimum serial execution time, T1 is the minimum execution time with an innite number of processors, P is the number of processors and h is the maxi...
unknown title
"... Abstract This paper studies the data locality of the workstealing schedulingalgorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache missesusing work stealing, and introduce a localityguided workstealing algorithm along with experimental va ..."
Abstract
 Add to MetaCart
Abstract This paper studies the data locality of the workstealing schedulingalgorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache missesusing work stealing, and introduce a localityguided workstealing algorithm along with experimental validation.As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires \Theta (n)total operations (work), for which when using workstealing the total number of cache misses on one processor is constant, while evenon two processors the total number of cache misses is \Omega (n). Fornestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd ms e P T1), where m isthe execution time of an instruction incurring a cache miss, s is thesteal time, C is the size of cache, and T1 is the number of nodeson the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computationsusing work stealing. For the second part of our results, we present a localityguidedwork stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity fora processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammedwork loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%.