Results 1  10
of
140
Scheduling Multithreaded Computations by Work Stealing
"... This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is "work stealing," in which processors needing work steal computa ..."
Abstract

Cited by 398 (38 self)
 Add to MetaCart
This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is "work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good workstealing scheduler for multithreaded computations with dependencies. Specifically,
The Implementation of the Cilk5 Multithreaded Language
 In Proceedings of the SIGPLAN '98 Conference on Program Language Design and Implementation
, 1998
"... The fifth release of the multithreaded language Cilk uses a provably good "workstealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear st ..."
Abstract

Cited by 320 (25 self)
 Add to MetaCart
The fifth release of the multithreaded language Cilk uses a provably good "workstealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "workfirst" principle has led to a portable Cilk5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the workfirst principle was exploited in the design...
Obstructionfree synchronization: Doubleended queues as an example
 In preparation
, 2003
"... We introduce obstructionfreedom, a new nonblocking property for shared data structure implementations. This property is strong enough to avoid the problems associated with locks, but it is weaker than previous nonblocking properties—specifically lockfreedom and waitfreedom— allowing greater flexi ..."
Abstract

Cited by 167 (17 self)
 Add to MetaCart
We introduce obstructionfreedom, a new nonblocking property for shared data structure implementations. This property is strong enough to avoid the problems associated with locks, but it is weaker than previous nonblocking properties—specifically lockfreedom and waitfreedom— allowing greater flexibility in the design of efficient implementations. Obstructionfreedom admits substantially simpler implementations, and we believe that in practice it provides the benefits of waitfree and lockfree implementations. To illustrate the benefits of obstructionfreedom, we present two obstructionfree CASbased implementations of doubleended queues (deques); the first is implemented on a linear array, the second on a circular array. To our knowledge, all previous nonblocking deque implementations are based on unrealistic assumptions about hardware support for synchronization, have restricted functionality, or have operations that interfere with operations at the opposite end of the deque even when the deque has many elements in it. Our obstructionfree implementations have none of these drawbacks, and thus suggest that it is much easier to design obstructionfree implementations than lockfree and waitfree ones. We also briefly discuss other obstructionfree data structures and operations that we have implemented. 1.
The Power of Two Random Choices: A Survey of Techniques and Results
 in Handbook of Randomized Computing
, 2000
"... ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately ..."
Abstract

Cited by 99 (2 self)
 Add to MetaCart
ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately log n= log log n with high probability. Now suppose instead that the balls are placed sequentially, and each ball is placed in the least loaded of d 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this case, the maximum load is log log n= log d + (1) with high probability [ABKU99]. The important implication of this result is that even a small amount of choice can lead to drastically different results in load balancing. Indeed, having just two random choices (i.e.,...
The data locality of work stealing
 Theory of Computing Systems
, 2000
"... This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validatio ..."
Abstract

Cited by 68 (14 self)
 Add to MetaCart
This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using workstealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computations using work stealing. For the second part of our results, we present a localityguided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%. 1
A Java Fork/Join Framework
, 2000
"... This paper describes the design, implementation, and performance of a Java framework for supporting a style of parallel programming in which problems are solved by (recursively) splitting them into subtasks that are solved in parallel, waiting for them to complete, and then composing results. The ge ..."
Abstract

Cited by 68 (1 self)
 Add to MetaCart
This paper describes the design, implementation, and performance of a Java framework for supporting a style of parallel programming in which problems are solved by (recursively) splitting them into subtasks that are solved in parallel, waiting for them to complete, and then composing results. The general design is a variant of the workstealing framework devised for Cilk. The main implementation techniques surround efficient construction and management of tasks queues and worker threads. The measured performance shows good parallel speedups for most programs, but also suggests possible improvements. 1. INTRODUCTION Fork/Join parallelism is among the simplest and most effective design techniques for obtaining good parallel performance. Fork/join algorithms are parallel versions of familiar divideand conquer algorithms, taking the typical form: Result solve(Problem problem) { if (problem is small) directly solve problem else { split problem into independent parts fork new subtas...
A Pragmatic Implementation of NonBlocking LinkedLists
 Lecture Notes in Computer Science
, 2001
"... We present a new nonblocking implementation of concurrent linkedlists supporting linearizable insertion and deletion operations. ..."
Abstract

Cited by 62 (1 self)
 Add to MetaCart
We present a new nonblocking implementation of concurrent linkedlists supporting linearizable insertion and deletion operations.
The Design of a Task Parallel Library
, 2008
"... The Task Parallel Library (TPL) is a library for.NET that makes it easy to expose potential parallelism in a program. The library can be seen as an embedded domain specific language, and relies heavily on generics and delegate expressions to provide a convenient interface with custom control structu ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
The Task Parallel Library (TPL) is a library for.NET that makes it easy to expose potential parallelism in a program. The library can be seen as an embedded domain specific language, and relies heavily on generics and delegate expressions to provide a convenient interface with custom control structures for parallelism. In this article, we describe the design and implementation of the library. In particular, we show the use of ‘replicable tasks ’ as an abstraction for implementing parallel iteration and aggregation, and the use of ‘duplicating queues ’ as an alternative to the regular task queues based on the THE protocol. 1.
Runtime Support for Multicore Haskell
"... Purely functional programs should run well on parallel hardware because of the absence of side effects, but it has proved hard to realise this potential in practice. Plenty of papers describe promising ideas, but vastly fewer describe real implementations with good wallclock performance. We describ ..."
Abstract

Cited by 29 (6 self)
 Add to MetaCart
Purely functional programs should run well on parallel hardware because of the absence of side effects, but it has proved hard to realise this potential in practice. Plenty of papers describe promising ideas, but vastly fewer describe real implementations with good wallclock performance. We describe just such an implementation, and quantitatively explore some of the complex design tradeoffs that make such implementations hard to build. Our measurements are necessarily detailed and specific, but they are reproducible, and we believe that they offer some general insights. 1.