Cilk: An Efficient Multithreaded Runtime System
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1995
"... Cilk (pronounced "silk") is a Cbased runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk workstealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the "work" and "criticalpath length" of a C ..."
Cited by 534 (39 self)
Cilk (pronounced "silk") is a Cbased runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk workstealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the "work" and "criticalpath length" of a Cilk computation can be used to model performance accurately. Consequently, a Cilk programmer can focus on reducing the computation's work and criticalpath length, insulated from load balancing and other runtime scheduling issues. We also prove that for the class of "fully strict" (wellstructured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal. The Cilk
Scheduling Multithreaded Computations by Work Stealing
"... This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is "work stealing," in which processors needing work steal computa ..."
Cited by 398 (38 self)
This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is "work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good workstealing scheduler for multithreaded computations with dependencies. Specifically,
The Power of Two Choices in Randomized Load Balancing
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1996
"... Suppose that n balls are placed into n bins, each ball being placed into a bin chosen independently and uniformly at random. Then, with high probability, the maximum load in any bin is approximately log n log log n . Suppose instead that each ball is placed sequentially into the least full of d ..."
Cited by 201 (23 self)
Suppose that n balls are placed into n bins, each ball being placed into a bin chosen independently and uniformly at random. Then, with high probability, the maximum load in any bin is approximately log n log log n . Suppose instead that each ball is placed sequentially into the least full of d bins chosen independently and uniformly at random. It has recently been shown that the maximum load is then only log log n log d +O(1) with high probability. Thus giving each ball two choices instead of just one leads to an exponential improvement in the maximum load. This result demonstrates the power of two choices, and it has several applications to load balancing in distributed systems. In this thesis, we expand upon this result by examining related models and by developing techniques for stu...
The Power of Two Random Choices: A Survey of Techniques and Results
 in Handbook of Randomized Computing
, 2000
"... ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately ..."
Cited by 99 (2 self)
ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately log n= log log n with high probability. Now suppose instead that the balls are placed sequentially, and each ball is placed in the least loaded of d 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this case, the maximum load is log log n= log d + (1) with high probability [ABKU99]. The important implication of this result is that even a small amount of choice can lead to drastically different results in load balancing. Indeed, having just two random choices (i.e.,...
SpaceEfficient Scheduling of Multithreaded Computations
 SIAM Journal on Computing
, 1993
"... . This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency ..."
Cited by 81 (14 self)
. This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T 1 is the time required for executing the computation on 1 processor, T1 is the time required by an infinite number of processors, and S 1 is the space required to execute the computation on 1 processor. A computation executed on P processors is timeefficient if the time is O(T 1 =P + T1 ), that is, it achieves linear speedup when P = O(T 1 =T1 ), and it is spaceefficient if it uses O(S 1 P ) total space, that is, the space per processor is within a constant factor of that required for a 1processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultan...
Diffracting trees
 In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM
, 1994
"... Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrentdatastructure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a dis ..."
Cited by 58 (11 self)
Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrentdatastructure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a distributed/parallel environment. Empirical evidence, collected on a simulated distributed sharedmemory machine and several simulated messagepassing architectures, shows that diffracting trees scale better and are more robust than both combining trees and counting networks, currently the most effective known methods for implementing concurrent counters in software. The use of a randomized coordination method together with a combinatorial data structure overcomes the resiliency drawbacks of combining trees. Our simulations show that to handle the same load, diffracting trees and counting networks should have a similar width w, yet the depth of a diffracting tree is O(log w), whereas counting networks have depth O(log 2 w). Diffracting trees have already been used to implement highly efficient producer/consumer queues, and we believe diffraction will prove to be an effective alternative paradigm to combining and queuelocking in the design of many concurrent data structures.
A Dynamic Distributed Load Balancing Algorithm with Provable Good Performance
 In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1993
"... The overall efficiency of parallel algorithms is most decisively effected by the strategy applied for the mapping of workload. Strategies for balancing dynamically generated workload on a processor network which are also useful for practical applications have intensively been investigated by simulat ..."
Cited by 44 (5 self)
The overall efficiency of parallel algorithms is most decisively effected by the strategy applied for the mapping of workload. Strategies for balancing dynamically generated workload on a processor network which are also useful for practical applications have intensively been investigated by simulations and by direct applications. This paper presents the complete theoretical analysis of a dynamically distributed load balancing strategy. The algorithm is adaptive by nature and is therefore useful for a broad range of applications. A similar algorithmic principle has already been implemented for a number of applications in the areas of combinatorial optimization, parallel programming languages and graphical animation. The algorithm performed convincingly for all these applications. In our analysis we will prove that the expected number of packets on each processor varies only by a constant factor compared with that on any other processor, independent of the generation and consumption of ...
Elimination Trees and the Construction of Pools and Stacks
, 1996
"... Shared pools and stacks are two coordination structures with a history of applications ranging from simple producer/consumer buffers to jobschedulers and procedure stacks. This paper introduces elimination trees, a novel form of diffracting trees that offer pool and stack implementations with super ..."
Cited by 42 (12 self)
Shared pools and stacks are two coordination structures with a history of applications ranging from simple producer/consumer buffers to jobschedulers and procedure stacks. This paper introduces elimination trees, a novel form of diffracting trees that offer pool and stack implementations with superior response (on average constant) under high loads, while guaranteeing logarithmic time "deterministic" termination under sparse request patterns. 1 A preliminary version of this paper appeared in the proceedings of the 7th Annual Symposium on Parallel Algorithms and Architectures (SPAA). Contact Author: Email:shanir@theory.lcs.mit.edu 1 Introduction As multiprocessing breaks away from its traditional number crunching role, we are likely to see a growing need for highly distributed and parallel coordination structures. A realtime application such as a system of sensors and actuators will require fast response under both sparse and intense activity levels (typical examples could be a ra...
The Cilk System for Parallel Multithreaded Computing
, 1996
"... Although costeffective parallel machines are now commercially available, the widespread use of parallel processing is still being held back, due mainly to the troublesome nature of parallel programming. In particular, it is still diiticult to build eiticient implementations of parallel applications ..."
Cited by 42 (2 self)
Although costeffective parallel machines are now commercially available, the widespread use of parallel processing is still being held back, due mainly to the troublesome nature of parallel programming. In particular, it is still diiticult to build eiticient implementations of parallel applications whose communication patterns are either highly irregular or dependent upon dynamic information. Multithreading has become an increasingly popular way to implement these dynamic, asynchronous, concurrent programs. Cilk (pronounced "silk") is our Cbased multithreaded computing system that provides provably good performance guarantees. This thesis describes the evolution of the Cilk language and runtime system, and describes applications which affected the evolution of the system.
LoadSharing in Heterogeneous Systems via Weighted Factoring
 in Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1997
"... We consider the problem of scheduling a parallel loop with independent iterations on a network of heterogeneous workstations, and demonstrate the effectiveness of a variant of factoring, a scheduling policy originating in the context of shared addressspace homogeneous multiprocessors. In the new s ..."
Cited by 31 (0 self)
We consider the problem of scheduling a parallel loop with independent iterations on a network of heterogeneous workstations, and demonstrate the effectiveness of a variant of factoring, a scheduling policy originating in the context of shared addressspace homogeneous multiprocessors. In the new scheme, weighted factoring, processors are dynamically assigned decreasing size chunks of iterations in proportion to their processing speeds. Through experiments on a network of SUN Sparc workstations we show that weighted factoring significantly outperforms variants of a workstealing loadbalancing algorithm and on certain applications dramatically outperforms factoring as well. We then study weighted work assignment analytically, giving upper and lower bounds on its performance under the assumption that the processor iteration execution times can be modeled as weighted random variables. Department of Computer Science, Polytechnic University, Brooklyn, NY, 11201. Research supported by AR...