Results 1 - 10
of
208
Scheduling Multithreaded Computations by Work Stealing
, 1994
"... This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing," in which processors needing work steal com ..."
Abstract
-
Cited by 568 (34 self)
- Add to MetaCart
This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies. Specifically, our analysis shows that the ezpected time Tp to execute a fully strict computation on P processors using our work-stealing scheduler is Tp = O(TI/P + Tm), where TI is the minimum serial eze-cution time of the multithreaded computation and T, is the minimum ezecution time with an infinite number of processors. Moreover, the space Sp required by the execution satisfies Sp 5 SIP. We also show that the ezpected total communication of the algorithm is at most O(TmS,,,P), where S, is the site of the largest activation record of any thread, thereby justify-ing the folk wisdom that work-stealing schedulers are more communication eficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor.
The Implementation of the Cilk-5 Multithreaded Language
, 1998
"... The fifth release of the multithreaded language Cilk uses a provably good "work-stealing " scheduling algorithm similar to the rst system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided ..."
Abstract
-
Cited by 489 (28 self)
- Add to MetaCart
(Show Context)
The fifth release of the multithreaded language Cilk uses a provably good "work-stealing " scheduling algorithm similar to the rst system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.
Obstruction-free synchronization: Double-ended queues as an example
- In preparation
, 2003
"... We introduce obstruction-freedom, a new nonblocking property for shared data structure implementations. This property is strong enough to avoid the problems associated with locks, but it is weaker than previous nonblocking properties—specifically lock-freedom and wait-freedom— allowing greater flexi ..."
Abstract
-
Cited by 211 (18 self)
- Add to MetaCart
(Show Context)
We introduce obstruction-freedom, a new nonblocking property for shared data structure implementations. This property is strong enough to avoid the problems associated with locks, but it is weaker than previous nonblocking properties—specifically lock-freedom and wait-freedom— allowing greater flexibility in the design of efficient implementations. Obstruction-freedom admits substantially simpler implementations, and we believe that in practice it provides the benefits of wait-free and lock-free implementations. To illustrate the benefits of obstruction-freedom, we present two obstruction-free CAS-based implementations of double-ended queues (deques); the first is implemented on a linear array, the second on a circular array. To our knowledge, all previous nonblocking deque implementations are based on unrealistic assumptions about hardware support for synchronization, have restricted functionality, or have operations that interfere with operations at the opposite end of the deque even when the deque has many elements in it. Our obstruction-free implementations have none of these drawbacks, and thus suggest that it is much easier to design obstruction-free implementations than lock-free and waitfree ones. We also briefly discuss other obstruction-free data structures and operations that we have implemented. 1.
The Power of Two Random Choices: A Survey of Techniques and Results
- in Handbook of Randomized Computing
, 2000
"... ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately ..."
Abstract
-
Cited by 137 (6 self)
- Add to MetaCart
ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately log n= log log n with high probability. Now suppose instead that the balls are placed sequentially, and each ball is placed in the least loaded of d 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this case, the maximum load is log log n= log d + (1) with high probability [ABKU99]. The important implication of this result is that even a small amount of choice can lead to drastically different results in load balancing. Indeed, having just two random choices (i.e.,...
A Java Fork/Join Framework
, 2000
"... This paper describes the design, implementation, and performance of a Java framework for supporting a style of parallel programming in which problems are solved by (recursively) splitting them into subtasks that are solved in parallel, waiting for them to complete, and then composing results. The ge ..."
Abstract
-
Cited by 117 (0 self)
- Add to MetaCart
This paper describes the design, implementation, and performance of a Java framework for supporting a style of parallel programming in which problems are solved by (recursively) splitting them into subtasks that are solved in parallel, waiting for them to complete, and then composing results. The general design is a variant of the work-stealing framework devised for Cilk. The main implementation techniques surround efficient construction and management of tasks queues and worker threads. The measured performance shows good parallel speedups for most programs, but also suggests possible improvements. 1. INTRODUCTION Fork/Join parallelism is among the simplest and most effective design techniques for obtaining good parallel performance. Fork/join algorithms are parallel versions of familiar divideand -conquer algorithms, taking the typical form: Result solve(Problem problem) { if (problem is small) directly solve problem else { split problem into independent parts fork new subtas...
The data locality of work stealing
- SPAA
, 2000
"... This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validatio ..."
Abstract
-
Cited by 112 (18 self)
- Add to MetaCart
(Show Context)
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using work-stealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nested-parallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single proces-sor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nested-parallel computations using work stealing. For the second part of our results, we present a locality-guided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static partitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the locality-guided work stealing improves the performance of work-stealing up to 80%.
A Pragmatic Implementation of Non-Blocking Linked-Lists
- Lecture Notes in Computer Science
, 2001
"... We present a new non-blocking implementation of concurrent linked-lists supporting linearizable insertion and deletion operations. ..."
Abstract
-
Cited by 96 (1 self)
- Add to MetaCart
We present a new non-blocking implementation of concurrent linked-lists supporting linearizable insertion and deletion operations.
Garbage-First Garbage Collection
, 2004
"... Garbage-First is a server-style garbage collector, targeted for multi-processors with large memories, that meets a soft real-time goal with high probability, while achieving high throughput. Whole-heap operations, such as global marking, are performed concurrently with mutation, to prevent interrupt ..."
Abstract
-
Cited by 73 (4 self)
- Add to MetaCart
Garbage-First is a server-style garbage collector, targeted for multi-processors with large memories, that meets a soft real-time goal with high probability, while achieving high throughput. Whole-heap operations, such as global marking, are performed concurrently with mutation, to prevent interruptions proportional to heap or live-data size. Concurrent marking both provides collection ”completeness ” and identifies regions ripe for reclamation via compacting evacuation. This evacuation is performed in parallel on multiprocessors, to increase throughput.
The Design of a Task Parallel Library
, 2008
"... The Task Parallel Library (TPL) is a library for.NET that makes it easy to expose potential parallelism in a program. The library can be seen as an embedded domain specific language, and relies heavily on generics and delegate expressions to provide a convenient interface with custom control structu ..."
Abstract
-
Cited by 57 (3 self)
- Add to MetaCart
(Show Context)
The Task Parallel Library (TPL) is a library for.NET that makes it easy to expose potential parallelism in a program. The library can be seen as an embedded domain specific language, and relies heavily on generics and delegate expressions to provide a convenient interface with custom control structures for parallelism. In this article, we describe the design and implementation of the library. In particular, we show the use of ‘replicable tasks ’ as an abstraction for implementing parallel iteration and aggregation, and the use of ‘duplicating queues ’ as an alternative to the regular task queues based on the THE protocol. 1.