Results 1 - 10
of
13
Space-Efficient Scheduling of Multithreaded Computations
- SIAM Journal on Computing
, 1993
"... . This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single-processor execution. Utilizing a new graph-theoretic model of multithreaded computation, execution efficiency ..."
Abstract
-
Cited by 72 (11 self)
- Add to MetaCart
. This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single-processor execution. Utilizing a new graph-theoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T 1 is the time required for executing the computation on 1 processor, T1 is the time required by an infinite number of processors, and S 1 is the space required to execute the computation on 1 processor. A computation executed on P processors is time-efficient if the time is O(T 1 =P + T1 ), that is, it achieves linear speedup when P = O(T 1 =T1 ), and it is space-efficient if it uses O(S 1 P ) total space, that is, the space per processor is within a constant factor of that required for a 1-processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultan...
Executing Multithreaded Programs Efficiently
, 1995
"... right to do so. by:::::::::::::::::::::::::::::::::::::::::::::::::::::::: ..."
Abstract
-
Cited by 62 (7 self)
- Add to MetaCart
right to do so. by::::::::::::::::::::::::::::::::::::::::::::::::::::::::
The data locality of work stealing
- Theory of Computing Systems
, 2000
"... This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validatio ..."
Abstract
-
Cited by 48 (11 self)
- Add to MetaCart
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using work-stealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nested-parallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single proces-sor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nested-parallel computations using work stealing. For the second part of our results, we present a locality-guided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the locality-guided work stealing improves the performance of work-stealing up to 80%. 1
Communication Complexity for Parallel Divide-and-Conquer
- In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science
, 1991
"... This paper studies the relationship between parallel computation cost and communication cost for performing divide-and-conquer (D&C) computations on a parallel system of p processors. The parallel computation cost is the maximal number of the D&C nodes that any processor in the parallel system may e ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
This paper studies the relationship between parallel computation cost and communication cost for performing divide-and-conquer (D&C) computations on a parallel system of p processors. The parallel computation cost is the maximal number of the D&C nodes that any processor in the parallel system may expand, whereas the communication cost is the total number of cross nodes. A cross node is a node which is generated by one processor but expanded by another processor. A new scheduling algorithm is proposed, whose parallel computation cost and communication cost are at most dN=pe and pdh, respectively, for any D&C computation tree with N nodes, height h, and degree d. Also, lower bounds on the communication cost are derived. In particular, it is shown that for each scheduling algorithm and for each positive ffl C ! 1, which can be arbitrarily close to 0, there are values of N , h, d, p, and ffl T (? 0), for which if the parallel computation cost is between N=p (the minimum) and (1 + ffl T ...
Managing Storage for Multithreaded Computations
, 1992
"... Multithreading has become a dominant paradigm in general purpose MIMD parallel computation. To execute a multithreaded computation on a parallel computer, a scheduler must order and allocate threads to run on the individual processors. The scheduling algorithm dramatically affects both the speedup a ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Multithreading has become a dominant paradigm in general purpose MIMD parallel computation. To execute a multithreaded computation on a parallel computer, a scheduler must order and allocate threads to run on the individual processors. The scheduling algorithm dramatically affects both the speedup attained and the space used when executing the computation. We consider the problem of scheduling multithreaded computations to achieve linear speedup without using significantly more space-per-processor than required for a single-processor execution. We show that for general multithreaded computations, no scheduling algorithm can simultaneously make efficient use of space and time. In particular, we show that there exist multithreaded computations such that any execution schedule X that achieves P-processor execution time T P (X ) T 1 =ae, where T 1 is the minimum possible serial execution time, must use space at least S P (X ) 1 4 (ae \Gamma 1) p T 1 + S 1 , where S 1 is the space use...
Tight Bounds For On-Line Tree Embeddings
- In Proceedings of the 2nd ACM-SIAM Symposium On Discrete Algorithms
, 1991
"... . Tree-structured computations are relatively easy to process in parallel. As leaf processes are recursively spawned they can be assigned to independent processors in a multicomputer network. However, to achieve good performance the on-line mapping algorithm must maintain load balance, i.e., distrib ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
. Tree-structured computations are relatively easy to process in parallel. As leaf processes are recursively spawned they can be assigned to independent processors in a multicomputer network. However, to achieve good performance the on-line mapping algorithm must maintain load balance, i.e., distribute processes equitably among processors. Additionally, the algorithm itself must be distributed in nature, and process allocation must be completed via message-passing with minimal communication overhead. This paper investigates bounds on the performance of deterministic and randomized algorithms for on-line tree embeddings. In particular, we study trade-o#s between computation overhead (load imbalance) and communication overhead (message congestion). We give a simple technique to derive lower bounds on the congestion that any on-line allocation algorithm must incur in order to guarantee load balance. This technique works for both randomized and deterministic algorithms. We prove that the a...
Efficient Parallel Divide-and-Conquer for a Class of Interconnection Topologies.
- In Proceedings of the 2nd International Symposium on Algorithms, number 557 in Lecture Notes in Computer Science
, 1991
"... : In this paper, we propose an efficient scheduling algorithm for expanding any divide-andconquer (D&C) computation tree on k-dimensional mesh, hypercube, and perfect shuffle networks with p processors. Assume that it takes t n time steps to expand one node of the tree and t c time steps to transmit ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
: In this paper, we propose an efficient scheduling algorithm for expanding any divide-andconquer (D&C) computation tree on k-dimensional mesh, hypercube, and perfect shuffle networks with p processors. Assume that it takes t n time steps to expand one node of the tree and t c time steps to transmit one datum or convey one node. For any D&C computation tree with N nodes, height h, and degree d (maximal number of children of any node), our algorithm requires at most (N=p + h)t n + 'dht c time steps, where ' is O(log 2 p) on a hypercube or perfect shuffle network and is O( k p p) on a n k\Gamma1 \Theta \Delta \Delta \Delta \Theta n 0 mesh network, where n k\Gamma1 = \Delta \Delta \Delta = n 0 = k p p. This algorithm is general in the sense that it does not know the values of N , h, and d, and the shape of the computation tree as well, a priori. Most importantly, we can easily obtain a linear speedup by nearly a factor of p, especially when N AE ph(1 + 'dt c =t n ). 1. Introduction ...
Algorithms for Combinatorial Optimization in Real Time and their Automated Refinement by Genetic Programming
- University of Illinois at Urbana-Champaign
, 1994
"... The goal of this research is to develop a systematic, integrated method of designing efficient search algorithms that solve optimization problems in real time. Search algorithms studied in this thesis comprise meta-control and primitive search. The class of optimization problems addressed are called ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The goal of this research is to develop a systematic, integrated method of designing efficient search algorithms that solve optimization problems in real time. Search algorithms studied in this thesis comprise meta-control and primitive search. The class of optimization problems addressed are called combinatorial optimization problems, examples of which include many NP-hard scheduling and planning problems, and problems in operations research and artificial-intelligence applications. The problems we have addressed have a well-defined problem objective and a finite set of well-defined problem constraints. In this research, we use state-space trees as problem representations. The approach we have undertaken in designing efficient search algorithms is an engineering approach and consists of two phases: (a) designing generic search algorithms, and (b) improving by genetics-based machine learning methods parametric heuristics used in the search algorithms designed. Our approach is a systematic method that integrates domain knowledge, search techniques, and automated learning techniques for designing better search algorithms. Knowledge captured in designing one search algorithm can be carried over for designing new ones. iv ACKNOWLEDGEMENTS I express my sincere gratitude to all the people who have helped me in the course of my graduate study. My thesis advisor, Professor Benjamin W. Wah, was always available for discussions and encouraged me to explore new ideas. I am deeply grateful to the committee
Efficient Scheduling of Strict Multithreaded Computations
, 1999
"... In this paper we study the problem of eciently scheduling a wide class of multithreaded computations, called strict; that is, computations in which all dependencies from a thread go to the thread's ancestors in the computation tree. Strict multithreaded computations allow the limited use of synch ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper we study the problem of eciently scheduling a wide class of multithreaded computations, called strict; that is, computations in which all dependencies from a thread go to the thread's ancestors in the computation tree. Strict multithreaded computations allow the limited use of synchronization primitives. We present the rst fully distributed scheduling algorithm which applies to any strict multithreaded computation. The algorithm is asynchronous, on-line and follows the work-stealing paradigm. We prove that our algorithm is ecient not only in terms of its memory requirements and its execution time, but also in terms of its communication complexity. Our analysis applies to both shared and distributed memory machines. More specically, the expected execution time of our algorithm is O(T 1 =P +hT1 ), where T 1 is the minimum serial execution time, T1 is the minimum execution time with an innite number of processors, P is the number of processors and h is the maxi...
unknown title
"... Abstract This paper studies the data locality of the work-stealing schedulingalgorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache missesusing work stealing, and introduce a locality-guided work-stealing algorithm along with experimental va ..."
Abstract
- Add to MetaCart
Abstract This paper studies the data locality of the work-stealing schedulingalgorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache missesusing work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validation.As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires \Theta (n)total operations (work), for which when using work-stealing the total number of cache misses on one processor is constant, while evenon two processors the total number of cache misses is \Omega (n). Fornested-parallel computations, however, we show that on P proces-sors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd ms e P T1), where m isthe execution time of an instruction incurring a cache miss, s is thesteal time, C is the size of cache, and T1 is the number of nodeson the longest chain of dependences. Based on this we give strong bounds on the total running time of nested-parallel computationsusing work stealing. For the second part of our results, we present a locality-guidedwork stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity fora processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammedwork loads. Furthermore, the locality-guided work stealing improves the performance of work-stealing up to 80%.

