Results 1  10
of
55
The data locality of work stealing
 Theory of Computing Systems
, 2000
"... This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validatio ..."
Abstract

Cited by 113 (17 self)
 Add to MetaCart
(Show Context)
This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using workstealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computations using work stealing. For the second part of our results, we present a localityguided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%. 1
SpaceEfficient Scheduling of Multithreaded Computations
 SIAM Journal on Computing
, 1993
"... . This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency ..."
Abstract

Cited by 113 (19 self)
 Add to MetaCart
. This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T 1 is the time required for executing the computation on 1 processor, T1 is the time required by an infinite number of processors, and S 1 is the space required to execute the computation on 1 processor. A computation executed on P processors is timeefficient if the time is O(T 1 =P + T1 ), that is, it achieves linear speedup when P = O(T 1 =T1 ), and it is spaceefficient if it uses O(S 1 P ) total space, that is, the space per processor is within a constant factor of that required for a 1processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultan...
Parallel Programming using Functional Languages
, 1991
"... I am greatly indebted to Simon Peyton Jones, my supervisor, for his encouragement and technical assistance. His overwhelming enthusiasm was of great support to me. I particularly want to thank Simon and Geoff Burn for commenting on earlier drafts of this thesis. Through his excellent lecturing Cohn ..."
Abstract

Cited by 54 (3 self)
 Add to MetaCart
I am greatly indebted to Simon Peyton Jones, my supervisor, for his encouragement and technical assistance. His overwhelming enthusiasm was of great support to me. I particularly want to thank Simon and Geoff Burn for commenting on earlier drafts of this thesis. Through his excellent lecturing Cohn Runciman initiated my interest in functional programming. I am grateful to Phil Trinder for his simulator, on which mine is based, and Will Partain for his help with LaTex and graphs. I would like to thank the Science and Engineering Research Council of Great Britain for their financial support. Finally, I would like to thank Michelle, whose culinary skills supported me whilst I was writingup.The Imagination the only nation worth defending a nation without alienation a nation whose flag is invisible and whose borders are forever beyond the horizon a nation whose motto is why have one or the other when you can have one the other and both
TIGHT ANALYSES OF TWO LOCAL LOAD BALANCING ALGORITHMS
 SIAM J. COMPUT.
, 1999
"... This paper presents an analysis of the following load balancing algorithm. At each step, each node in a network examines the number of tokens at each of its neighbors and sends a token to each neighbor with at least 2d + 1 fewer tokens, where d is the maximum degree of any node in the network. We ..."
Abstract

Cited by 53 (5 self)
 Add to MetaCart
This paper presents an analysis of the following load balancing algorithm. At each step, each node in a network examines the number of tokens at each of its neighbors and sends a token to each neighbor with at least 2d + 1 fewer tokens, where d is the maximum degree of any node in the network. We show that within O(∆/α) steps, the algorithm reduces the maximum difference in tokens between any two nodes to at most O((d 2 log n)/α), where ∆ is the global imbalance in tokens (i.e., the maximum difference between the number of tokens at any node initially and the average number of tokens), n is the number of nodes in the network, and α is the edge expansion of the network. The time bound is tight in the sense that for any graph with edge expansion α, and for any value ∆, there exists an initial distribution of tokens with imbalance ∆ for which the time to reduce the imbalance to even ∆/2 is at least Ω(∆/α). The bound on the final imbalance is tight in the sense that there exists a class of networks that can be locally balanced everywhere (i.e., the maximum difference in tokens between any two neighbors is at most 2d), while the global imbalance remains Ω((d 2 log n)/α). Furthermore, we show that upon reaching a state with a global imbalance of O((d 2 log n)/α), the time for this algorithm to locally balance the network can be as large as Ω(n 1/2). We extend our analysis to a variant of this algorithm for dynamic and asynchronous
Adaptive Work Stealing with Parallelism Feedback
"... Abstract We present an adaptive workstealing thread scheduler, ASTEAL, for forkjoin multithreaded jobs, like those written using the Cilk multithreaded language or the Hood workstealinglibrary. The ASTEAL algorithm is appropriate for large parallel servers where many jobs share a common multipr ..."
Abstract

Cited by 37 (8 self)
 Add to MetaCart
(Show Context)
Abstract We present an adaptive workstealing thread scheduler, ASTEAL, for forkjoin multithreaded jobs, like those written using the Cilk multithreaded language or the Hood workstealinglibrary. The ASTEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessorresource and in which the number of processors available to a particular job may vary during the job's execution. ASTEALprovides continual parallelism feedback to a job scheduler in the form of processor requests, and the job must adapt its execution to the processors allotted to it. Assuming that the job scheduler never allots any job more processors than requestedby the job's thread scheduler, ASTEAL guarantees that the job completes in nearoptimal time while utilizing at least a constant fraction of the allotted processors. Our analysis models the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the system environment and the job scheduler's administrative policies. We analyze the performance of ASTEAL using &quot;trim analysis, &quot; which allows us to prove that our thread scheduler performs poorly on at most a small number of time steps, while exhibiting nearoptimal behavior on the vast majority.To be precise, suppose that a job has work T1 and criticalpath length T1. On a machine with P processors, ASTEALcompletes the job in expected O(T1/eP + T1 + L lg P) timesteps, where L is the length of a scheduling quantum and ePdenotes the O(T1 + L lg P)trimmed availability. This quantity is the average of the processor availability over all but
An introduction to randomized algorithms
 Discrete Appl Math
, 1991
"... Research conducted over the past fifteen years has amply demonstrated the advantages of algorithms that make random choices in the course of their execution. This paper presents a wide variety of examples intended to illustrate the range of applications of randomized algorithms, and the general prin ..."
Abstract

Cited by 34 (0 self)
 Add to MetaCart
(Show Context)
Research conducted over the past fifteen years has amply demonstrated the advantages of algorithms that make random choices in the course of their execution. This paper presents a wide variety of examples intended to illustrate the range of applications of randomized algorithms, and the general principles and approaches that are of greatest use in their construction. The examples are drawn from many areas, including number theory, algebra, graph theory, pattern matching, selection, sorting, searching, computational geometry, combinatorial enumeration, and parallel and distributed computation. 1. Foreword This paper is derived from a series of three lectures on randomized algorithms presented by the author at a conference on combinatorial mathematics and algorithms held at George Washington University in May, 1989. The purpose of the paper is to convey, through carefully selected examples, an understanding of the nature of randomized algorithms, the range of their applications and the principles underlying their construction. It is not our goal to be encyclopedic, and thus the paper should not be regarded as a comprehensive survey of the subject. This paper would not have come into existence without the magnificent efforts of Professor Rodica Simion, the organizer of the conference at George Washington University. Working from the taperecorded lectures, she created a splendid transcript that served as the first draft of the paper. Were it not for her own reluctance she would be listed as my coauthor.
Randomized load balancing for treestructured computation
 In Proceedings of the Scalable High Performance Computing Conference
, 1994
"... ..."
(Show Context)
Scheduling Threads for Low Space Requirement and Good Locality
 In Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1999
"... The running time and memory requirement of a parallel program with dynamic, lightweight threads depends heavily on the underlying thread scheduler. In this paper, we present a simple, asynchronous, spaceefficient scheduling algorithm for shared memory machines that combines the low scheduling overh ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
(Show Context)
The running time and memory requirement of a parallel program with dynamic, lightweight threads depends heavily on the underlying thread scheduler. In this paper, we present a simple, asynchronous, spaceefficient scheduling algorithm for shared memory machines that combines the low scheduling overheads and good locality of work stealing with the low space requirements of depthfirst schedulers. For a nestedparallel program with depth D and serial space requirement S 1 , we show that the expected space requirement is S 1 +O(K \Delta p \Delta D) on p processors. Here, K is a useradjustable runtime parameter, which provides a tradeoff between running time and space requirement. Our algorithm achieves good locality and low scheduling overheads by automatically increasing the granularity of the work scheduled on each processor. We have implemented the new scheduling algorithm in the context of a native, userlevel implementation of Posix standard threads or Pthreads, and evaluated its p...
An atomic model for messagepassing
 In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1993
"... ..."
Automated Parallelization of Discrete Statespace Generation
 Journal of Parallel and Distributed Computing
, 1997
"... We consider the problem of generating a large statespace in a distributed fashion. Unlike previously proposed solutions that partition the set of reachable states according to a hashing function provided by the user, we explore heuristic methods that completely automate the process. The first step ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
We consider the problem of generating a large statespace in a distributed fashion. Unlike previously proposed solutions that partition the set of reachable states according to a hashing function provided by the user, we explore heuristic methods that completely automate the process. The first step is an initial random walk through the state space to initialize a search tree, duplicated in each processor. Then, the reachability graph is built in a distributed way, using the search tree to assign each newly found state to classes assigned to the available processors. Furthermore, we explore two remapping criteria that attempt to balance memory usage or future workload, respectively. We show how the cost of computing the global snapshot required for remapping will scale up for system sizes in the foreseeable future. An extensive set of results is presented to support our conclusions that remapping is extremely beneficial. 1 Introduction Discrete systems are frequently analyzed by genera...