Results 1 
6 of
6
The data locality of work stealing
 Theory of Computing Systems
, 2000
"... This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validatio ..."
Abstract

Cited by 65 (12 self)
 Add to MetaCart
This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using workstealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computations using work stealing. For the second part of our results, we present a localityguided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%. 1
SpaceEfficient Scheduling of Parallelism with Synchronization Variables
"... Recent work on scheduling algorithms has resulted in provable bounds on the space taken by parallel computations in relation to the space taken by sequential computations. The results for online versions of these algorithms, however, have been limited to computations in which threads can only synchr ..."
Abstract

Cited by 27 (9 self)
 Add to MetaCart
Recent work on scheduling algorithms has resulted in provable bounds on the space taken by parallel computations in relation to the space taken by sequential computations. The results for online versions of these algorithms, however, have been limited to computations in which threads can only synchronize with ancestor or sibling threads. Such computations do not include languages with futures or userspecified synchronization constraints. Here we extend the results to languages with synchronization variables. Such languages include languages with futures, such as Multilisp and Cool, as well as other languages such asid. The main result is an online scheduling algorithm which, given a computation with w work (total operations), synchronizations, d depth (critical path) and s1 sequential space, will run in O(w=p + log(pd)=p + d log(pd)) time and s1 + O(pd log(pd)) space, on a pprocessor crcw pram with a fetchandadd primitive. This includes all time and space costs for both the computation and the scheduler. The scheduler is nonpreemptive in the sense that it will only move a thread if the thread suspends on a synchronization, forks a new thread, or exceeds a threshold when allocating space. For the special case where the computation is a planar graph with lefttoright synchronization edges, the scheduling algorithm can be implemented in O(w=p+d log p) time and s1 + O(pd log p) space. These are the first nontrivial space bounds described for such languages.
A Provably TimeEfficient Parallel Implementation of Full Speculation
 In Proceedings of the 23rd ACM Symposium on Principles of Programming Languages
, 1996
"... Speculative evaluation, including leniency and futures, is often used to produce high degrees of parallelism. Existing speculative implementations, however, may serialize computation because of their implementation of queues of suspended threads. We give a provably efficient parallel implementation ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
Speculative evaluation, including leniency and futures, is often used to produce high degrees of parallelism. Existing speculative implementations, however, may serialize computation because of their implementation of queues of suspended threads. We give a provably efficient parallel implementation of a speculative functional language on various machine models. The implementation includes proper parallelization of the necessary queuing operations on suspended threads. Our target machine models are a butterfly network, hypercube, and PRAM. To prove the efficiency of our implementation, we provide a cost model using a profiling semantics and relate the cost model to implementations on the parallel machine models. 1 Introduction Futures, lenient languages, and several implementations of graph reduction for lazy languages all use speculative evaluation (callbyspeculation [15]) to expose parallelism. The basic idea of speculative evaluation, in this context, is that the evaluation of a...
Scheduling Deterministic Parallel Programs
, 2009
"... are those of the author and should not be interpreted as representing the official policies, either expressed or implied, Deterministic parallel programs yield the same results regardless of how parallel tasks are interleaved or assigned to processors. This drastically simplifies reasoning about the ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
are those of the author and should not be interpreted as representing the official policies, either expressed or implied, Deterministic parallel programs yield the same results regardless of how parallel tasks are interleaved or assigned to processors. This drastically simplifies reasoning about the correctness of these programs. However, the performance of parallel programs still depends upon this assignment of tasks, as determined by a part of the language implementation called the scheduling policy. In this thesis, I define a novel cost semantics for a parallel language that enables programmers to reason formally about different scheduling policies. This cost semantics forms a basis for a suite of prototype profiling tools. These tools allow programmers to simulate and visualize program execution under different scheduling policies and understand how the choice of policy affects application memory use. My cost semantics also provides a specification for implementations of the language. As an example of such an implementation, I have extended MLton, a compiler
Beyond Nested Parallelism: Tight Bounds on WorkStealing Overheads for Parallel Futures
"... Work stealing is a popular method of scheduling finegrained parallel tasks. The performance of work stealing has been extensively studied, both theoretically and empirically, but primarily for the restricted class of nestedparallel (or fully strict) computations. We extend this prior work by consi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Work stealing is a popular method of scheduling finegrained parallel tasks. The performance of work stealing has been extensively studied, both theoretically and empirically, but primarily for the restricted class of nestedparallel (or fully strict) computations. We extend this prior work by considering a broader class of programs that also supports pipelined parallelism through the use of parallel futures. Though the overhead of workstealing schedulers is often quantified in terms of the number of steals, we show that a broader metric, the number of deviations, is a better way to quantify workstealing overhead for less restrictive forms of parallelism, including parallel futures. For such parallelism, we prove bounds on workstealing overheads—scheduler time and cache misses—as a function of the number of deviations. Deviations can occur, for example, when work is stolen or when a future is touched. We also show instances where deviations can occur independently of steals and touches. Next, we prove that, under work stealing, the expected number of deviations is O(P d+td) in a Pprocessor execution of a computation with span d and t touches of futures. Moreover, this bound is existentially tight for any workstealing scheduler that is parsimonious (those where processors steal only when their queues are empty); this class includes all prior workstealing schedulers. We also present empirical measurements of the number of deviations incurred by a classic application of futures, Halstead’s quicksort, using our parallel implementation of ML. Finally, we identify a family of applications that use futures and, in contrast to quicksort, incur significantly smaller overheads.
unknown title
"... Abstract This paper studies the data locality of the workstealing schedulingalgorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache missesusing work stealing, and introduce a localityguided workstealing algorithm along with experimental va ..."
Abstract
 Add to MetaCart
Abstract This paper studies the data locality of the workstealing schedulingalgorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache missesusing work stealing, and introduce a localityguided workstealing algorithm along with experimental validation.As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires \Theta (n)total operations (work), for which when using workstealing the total number of cache misses on one processor is constant, while evenon two processors the total number of cache misses is \Omega (n). Fornestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd ms e P T1), where m isthe execution time of an instruction incurring a cache miss, s is thesteal time, C is the size of cache, and T1 is the number of nodeson the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computationsusing work stealing. For the second part of our results, we present a localityguidedwork stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity fora processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammedwork loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%.