Results 1  10
of
84
Cilk: An Efficient Multithreaded Runtime System
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1995
"... Cilk (pronounced "silk") is a Cbased runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk workstealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the "work" and "cri ..."
Abstract

Cited by 751 (40 self)
 Add to MetaCart
(Show Context)
Cilk (pronounced "silk") is a Cbased runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk workstealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the "work" and "criticalpath length" of a Cilk computation can be used to model performance accurately. Consequently, a Cilk programmer can focus on reducing the computation's work and criticalpath length, insulated from load balancing and other runtime scheduling issues. We also prove that for the class of "fully strict" (wellstructured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal. The Cilk
Software Transactional Memory
, 1995
"... As we learn from the literature, flexibility in choosing synchronization operations greatly simplifies the task of designing highly concurrent programs. Unfortunately, existing hardware is inflexible and is at best on the level of a Load Linked/Store Conditional operation on a single word. Building ..."
Abstract

Cited by 692 (14 self)
 Add to MetaCart
(Show Context)
As we learn from the literature, flexibility in choosing synchronization operations greatly simplifies the task of designing highly concurrent programs. Unfortunately, existing hardware is inflexible and is at best on the level of a Load Linked/Store Conditional operation on a single word. Building on the hardware based transactional synchronization methodology of Herlihy and Moss, we offer software transactional memory (STM), a novel software method for supporting flexible transactional programming of synchronization operations. STM is nonblocking, and can be implemented on existing machines using only a Load Linked/Store Conditional operation. We use STM to provide a general highly concurrent method for translating sequential object implementations to lockfree ones based on implementing a kword compare&swap STMtransaction. Empirical evidence collected on simulated multiprocessor architectures shows that the our method always outperforms all the lockfree translation methods in ...
Scheduling Multithreaded Computations by Work Stealing
, 1994
"... This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is “work stealing," in which processors needing work steal com ..."
Abstract

Cited by 574 (44 self)
 Add to MetaCart
This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is “work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good workstealing scheduler for multithreaded computations with dependencies. Specifically, our analysis shows that the ezpected time Tp to execute a fully strict computation on P processors using our workstealing scheduler is Tp = O(TI/P + Tm), where TI is the minimum serial ezecution time of the multithreaded computation and T, is the minimum ezecution time with an infinite number of processors. Moreover, the space Sp required by the execution satisfies Sp 5 SIP. We also show that the ezpected total communication of the algorithm is at most O(TmS,,,P), where S, is the site of the largest activation record of any thread, thereby justifying the folk wisdom that workstealing schedulers are more communication eficient than their worksharing counterparts. All three of these bounds are existentially optimal to within a constant factor.
The Power of Two Choices in Randomized Load Balancing
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1996
"... Suppose that n balls are placed into n bins, each ball being placed into a bin chosen independently and uniformly at random. Then, with high probability, the maximum load in any bin is approximately log n log log n . Suppose instead that each ball is placed sequentially into the least full of d ..."
Abstract

Cited by 290 (24 self)
 Add to MetaCart
Suppose that n balls are placed into n bins, each ball being placed into a bin chosen independently and uniformly at random. Then, with high probability, the maximum load in any bin is approximately log n log log n . Suppose instead that each ball is placed sequentially into the least full of d bins chosen independently and uniformly at random. It has recently been shown that the maximum load is then only log log n log d +O(1) with high probability. Thus giving each ball two choices instead of just one leads to an exponential improvement in the maximum load. This result demonstrates the power of two choices, and it has several applications to load balancing in distributed systems. In this thesis, we expand upon this result by examining related models and by developing techniques for stu...
The Power of Two Random Choices: A Survey of Techniques and Results
 in Handbook of Randomized Computing
, 2000
"... ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately ..."
Abstract

Cited by 140 (6 self)
 Add to MetaCart
ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately log n= log log n with high probability. Now suppose instead that the balls are placed sequentially, and each ball is placed in the least loaded of d 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this case, the maximum load is log log n= log d + (1) with high probability [ABKU99]. The important implication of this result is that even a small amount of choice can lead to drastically different results in load balancing. Indeed, having just two random choices (i.e.,...
SpaceEfficient Scheduling of Multithreaded Computations
 SIAM Journal on Computing
, 1993
"... . This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency ..."
Abstract

Cited by 113 (19 self)
 Add to MetaCart
. This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T 1 is the time required for executing the computation on 1 processor, T1 is the time required by an infinite number of processors, and S 1 is the space required to execute the computation on 1 processor. A computation executed on P processors is timeefficient if the time is O(T 1 =P + T1 ), that is, it achieves linear speedup when P = O(T 1 =T1 ), and it is spaceefficient if it uses O(S 1 P ) total space, that is, the space per processor is within a constant factor of that required for a 1processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultan...
Diffracting trees
 In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM
, 1994
"... Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrentdatastructure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a dis ..."
Abstract

Cited by 62 (13 self)
 Add to MetaCart
(Show Context)
Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrentdatastructure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a distributed/parallel environment. Empirical evidence, collected on a simulated distributed sharedmemory machine and several simulated messagepassing architectures, shows that diffracting trees scale better and are more robust than both combining trees and counting networks, currently the most effective known methods for implementing concurrent counters in software. The use of a randomized coordination method together with a combinatorial data structure overcomes the resiliency drawbacks of combining trees. Our simulations show that to handle the same load, diffracting trees and counting networks should have a similar width w, yet the depth of a diffracting tree is O(log w), whereas counting networks have depth O(log 2 w). Diffracting trees have already been used to implement highly efficient producer/consumer queues, and we believe diffraction will prove to be an effective alternative paradigm to combining and queuelocking in the design of many concurrent data structures.
A Dynamic Distributed Load Balancing Algorithm with Provable Good Performance
 In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1993
"... The overall efficiency of parallel algorithms is most decisively effected by the strategy applied for the mapping of workload. Strategies for balancing dynamically generated workload on a processor network which are also useful for practical applications have intensively been investigated by simulat ..."
Abstract

Cited by 47 (5 self)
 Add to MetaCart
(Show Context)
The overall efficiency of parallel algorithms is most decisively effected by the strategy applied for the mapping of workload. Strategies for balancing dynamically generated workload on a processor network which are also useful for practical applications have intensively been investigated by simulations and by direct applications. This paper presents the complete theoretical analysis of a dynamically distributed load balancing strategy. The algorithm is adaptive by nature and is therefore useful for a broad range of applications. A similar algorithmic principle has already been implemented for a number of applications in the areas of combinatorial optimization, parallel programming languages and graphical animation. The algorithm performed convincingly for all these applications. In our analysis we will prove that the expected number of packets on each processor varies only by a constant factor compared with that on any other processor, independent of the generation and consumption of ...
The Cilk System for Parallel Multithreaded Computing
, 1996
"... Although costeffective parallel machines are now commercially available, the widespread use of parallel processing is still being held back, due mainly to the troublesome nature of parallel programming. In particular, it is still diiticult to build eiticient implementations of parallel applications ..."
Abstract

Cited by 44 (2 self)
 Add to MetaCart
Although costeffective parallel machines are now commercially available, the widespread use of parallel processing is still being held back, due mainly to the troublesome nature of parallel programming. In particular, it is still diiticult to build eiticient implementations of parallel applications whose communication patterns are either highly irregular or dependent upon dynamic information. Multithreading has become an increasingly popular way to implement these dynamic, asynchronous, concurrent programs. Cilk (pronounced "silk") is our Cbased multithreaded computing system that provides provably good performance guarantees. This thesis describes the evolution of the Cilk language and runtime system, and describes applications which affected the evolution of the system.
Elimination Trees and the Construction of Pools and Stacks
, 1996
"... Shared pools and stacks are two coordination structures with a history of applications ranging from simple producer/consumer buffers to jobschedulers and procedure stacks. This paper introduces elimination trees, a novel form of diffracting trees that offer pool and stack implementations with super ..."
Abstract

Cited by 44 (12 self)
 Add to MetaCart
Shared pools and stacks are two coordination structures with a history of applications ranging from simple producer/consumer buffers to jobschedulers and procedure stacks. This paper introduces elimination trees, a novel form of diffracting trees that offer pool and stack implementations with superior response (on average constant) under high loads, while guaranteeing logarithmic time "deterministic" termination under sparse request patterns. 1 A preliminary version of this paper appeared in the proceedings of the 7th Annual Symposium on Parallel Algorithms and Architectures (SPAA). Contact Author: Email:shanir@theory.lcs.mit.edu 1 Introduction As multiprocessing breaks away from its traditional number crunching role, we are likely to see a growing need for highly distributed and parallel coordination structures. A realtime application such as a system of sensors and actuators will require fast response under both sparse and intense activity levels (typical examples could be a ra...