Results 1 -
5 of
5
Space-Efficient Scheduling of Parallelism with Synchronization Variables
"... Recent work on scheduling algorithms has resulted in provable bounds on the space taken by parallel computations in relation to the space taken by sequential computations. The results for online versions of these algorithms, however, have been limited to computations in which threads can only synchr ..."
Abstract
-
Cited by 28 (10 self)
- Add to MetaCart
Recent work on scheduling algorithms has resulted in provable bounds on the space taken by parallel computations in relation to the space taken by sequential computations. The results for online versions of these algorithms, however, have been limited to computations in which threads can only synchronize with ancestor or sibling threads. Such computations do not include languages with futures or user-specified synchronization constraints. Here we extend the results to languages with synchronization variables. Such languages include languages with futures, such as Multilisp and Cool, as well as other languages such asid. The main result is an online scheduling algorithm which, given a computation with w work (total operations), synchronizations, d depth (critical path) and s1 sequential space, will run in O(w=p + log(pd)=p + d log(pd)) time and s1 + O(pd log(pd)) space, on a p-processor crcw pram with a fetch-and-add primitive. This includes all time and space costs for both the computation and the scheduler. The scheduler is non-preemptive in the sense that it will only move a thread if the thread suspends on a synchronization, forks a new thread, or exceeds a threshold when allocating space. For the special case where the computation is a planar graph with left-to-right synchronization edges, the scheduling algorithm can be implemented in O(w=p+d log p) time and s1 + O(pd log p) space. These are the first nontrivial space bounds described for such languages.
A Provably Time-Efficient Parallel Implementation of Full Speculation
- In Proceedings of the 23rd ACM Symposium on Principles of Programming Languages
, 1996
"... Speculative evaluation, including leniency and futures, is often used to produce high degrees of parallelism. Existing speculative implementations, however, may serialize computation because of their implementation of queues of suspended threads. We give a provably efficient parallel implementation ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Speculative evaluation, including leniency and futures, is often used to produce high degrees of parallelism. Existing speculative implementations, however, may serialize computation because of their implementation of queues of suspended threads. We give a provably efficient parallel implementation of a speculative functional language on various machine models. The implementation includes proper parallelization of the necessary queuing operations on suspended threads. Our target machine models are a butterfly network, hypercube, and PRAM. To prove the efficiency of our implementation, we provide a cost model using a profiling semantics and relate the cost model to implementations on the parallel machine models. 1 Introduction Futures, lenient languages, and several implementations of graph reduction for lazy languages all use speculative evaluation (call-by-speculation [15]) to expose parallelism. The basic idea of speculative evaluation, in this context, is that the evaluation of a...
Scheduling Deterministic Parallel Programs
, 2009
"... are those of the author and should not be interpreted as representing the official policies, either expressed or implied, Deterministic parallel programs yield the same results regardless of how parallel tasks are interleaved or assigned to processors. This drastically simplifies reasoning about the ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
are those of the author and should not be interpreted as representing the official policies, either expressed or implied, Deterministic parallel programs yield the same results regardless of how parallel tasks are interleaved or assigned to processors. This drastically simplifies reasoning about the correctness of these programs. However, the performance of parallel programs still depends upon this assignment of tasks, as determined by a part of the language implementation called the scheduling policy. In this thesis, I define a novel cost semantics for a parallel language that enables programmers to reason formally about different scheduling policies. This cost semantics forms a basis for a suite of prototype profiling tools. These tools allow programmers to simulate and visualize program execution under different scheduling policies and understand how the choice of policy affects application memory use. My cost semantics also provides a specification for implementations of the language. As an example of such an implementation, I have extended MLton, a compiler
unknown title
"... Abstract This paper studies the data locality of the work-stealing schedulingalgorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache missesusing work stealing, and introduce a locality-guided work-stealing algorithm along with experimental va ..."
Abstract
- Add to MetaCart
Abstract This paper studies the data locality of the work-stealing schedulingalgorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache missesusing work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validation.As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires \Theta (n)total operations (work), for which when using work-stealing the total number of cache misses on one processor is constant, while evenon two processors the total number of cache misses is \Omega (n). Fornested-parallel computations, however, we show that on P proces-sors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd ms e P T1), where m isthe execution time of an instruction incurring a cache miss, s is thesteal time, C is the size of cache, and T1 is the number of nodeson the longest chain of dependences. Based on this we give strong bounds on the total running time of nested-parallel computationsusing work stealing. For the second part of our results, we present a locality-guidedwork stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity fora processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammedwork loads. Furthermore, the locality-guided work stealing improves the performance of work-stealing up to 80%.
Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures
"... Work stealing is a popular method of scheduling fine-grained parallel tasks. The performance of work stealing has been extensively studied, both theoretically and empirically, but primarily for the restricted class of nested-parallel (or fully strict) computations. We extend this prior work by consi ..."
Abstract
- Add to MetaCart
Work stealing is a popular method of scheduling fine-grained parallel tasks. The performance of work stealing has been extensively studied, both theoretically and empirically, but primarily for the restricted class of nested-parallel (or fully strict) computations. We extend this prior work by considering a broader class of programs that also supports pipelined parallelism through the use of parallel futures. Though the overhead of work-stealing schedulers is often quantified in terms of the number of steals, we show that a broader metric, the number of deviations, is a better way to quantify work-stealing overhead for less restrictive forms of parallelism, including parallel futures. For such parallelism, we prove bounds on work-stealing overheads—scheduler time and cache misses—as a function of the number of deviations. Deviations can occur, for example, when work is stolen or when a future is touched. We also show instances where deviations can occur independently of steals and touches. Next, we prove that, under work stealing, the expected number of deviations is O(P d+td) in a P-processor execution of a computation with span d and t touches of futures. Moreover, this bound is existentially tight for any work-stealing scheduler that is parsimonious (those where processors steal only when their queues are empty); this class includes all prior work-stealing schedulers. We also present empirical measurements of the number of deviations incurred by a classic application of futures, Halstead’s quicksort, using our parallel implementation of ML. Finally, we identify a family of applications that use futures and, in contrast to quicksort, incur significantly smaller overheads.

