Results 1  10
of
43
Cilk: An Efficient Multithreaded Runtime System
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1995
"... Cilk (pronounced "silk") is a Cbased runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk workstealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the "work" and "criticalpath length" of a C ..."
Abstract

Cited by 534 (39 self)
 Add to MetaCart
Cilk (pronounced "silk") is a Cbased runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk workstealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the "work" and "criticalpath length" of a Cilk computation can be used to model performance accurately. Consequently, a Cilk programmer can focus on reducing the computation's work and criticalpath length, insulated from load balancing and other runtime scheduling issues. We also prove that for the class of "fully strict" (wellstructured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal. The Cilk
Scheduling Multithreaded Computations by Work Stealing
"... This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is "work stealing," in which processors needing work steal computa ..."
Abstract

Cited by 398 (38 self)
 Add to MetaCart
This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is "work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good workstealing scheduler for multithreaded computations with dependencies. Specifically,
The Power of Two Random Choices: A Survey of Techniques and Results
 in Handbook of Randomized Computing
, 2000
"... ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately ..."
Abstract

Cited by 99 (2 self)
 Add to MetaCart
ITo motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately log n= log log n with high probability. Now suppose instead that the balls are placed sequentially, and each ball is placed in the least loaded of d 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this case, the maximum load is log log n= log d + (1) with high probability [ABKU99]. The important implication of this result is that even a small amount of choice can lead to drastically different results in load balancing. Indeed, having just two random choices (i.e.,...
The data locality of work stealing
 Theory of Computing Systems
, 2000
"... This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validatio ..."
Abstract

Cited by 68 (14 self)
 Add to MetaCart
This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using workstealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computations using work stealing. For the second part of our results, we present a localityguided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%. 1
The Cilk System for Parallel Multithreaded Computing
, 1996
"... Although costeffective parallel machines are now commercially available, the widespread use of parallel processing is still being held back, due mainly to the troublesome nature of parallel programming. In particular, it is still diiticult to build eiticient implementations of parallel applications ..."
Abstract

Cited by 42 (2 self)
 Add to MetaCart
Although costeffective parallel machines are now commercially available, the widespread use of parallel processing is still being held back, due mainly to the troublesome nature of parallel programming. In particular, it is still diiticult to build eiticient implementations of parallel applications whose communication patterns are either highly irregular or dependent upon dynamic information. Multithreading has become an increasingly popular way to implement these dynamic, asynchronous, concurrent programs. Cilk (pronounced "silk") is our Cbased multithreaded computing system that provides provably good performance guarantees. This thesis describes the evolution of the Cilk language and runtime system, and describes applications which affected the evolution of the system.
Towards a Taxonomy of Parallel Tabu Search Heuristics
, 1997
"... In this paper we present a classification of parallel tabu search metaheuristics based, on the one hand, on the control and communication strategies used in the design of the parallel tabu search procedures and, on the other hand, on how the search space is partitionned. These criteria are then used ..."
Abstract

Cited by 41 (8 self)
 Add to MetaCart
In this paper we present a classification of parallel tabu search metaheuristics based, on the one hand, on the control and communication strategies used in the design of the parallel tabu search procedures and, on the other hand, on how the search space is partitionned. These criteria are then used to review the parallel tabu search implementations described in the literature. The taxonomy is further illustrated by the results of several parallelization implementations of a tabu search procedure for multicommodity locationallocation problems with balancing requirements. Key words: Tabu search metaheuristics, Parallelization strategies, Taxonomy R'esum'e Nous pr'esentons un sch'ema de classification des algorithmes parall`eles de recherche avec tabous. La taxonomie est bas'ee, d'une part, sur les strat'egies de controle et de communication des algorithmes parall`eles de recherche avec tabous et, d'autre part, sur les r`egles de partitionnement du domaine. Ces crit`eres sont ensuite...
LoadSharing in Heterogeneous Systems via Weighted Factoring
 in Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1997
"... We consider the problem of scheduling a parallel loop with independent iterations on a network of heterogeneous workstations, and demonstrate the effectiveness of a variant of factoring, a scheduling policy originating in the context of shared addressspace homogeneous multiprocessors. In the new s ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
We consider the problem of scheduling a parallel loop with independent iterations on a network of heterogeneous workstations, and demonstrate the effectiveness of a variant of factoring, a scheduling policy originating in the context of shared addressspace homogeneous multiprocessors. In the new scheme, weighted factoring, processors are dynamically assigned decreasing size chunks of iterations in proportion to their processing speeds. Through experiments on a network of SUN Sparc workstations we show that weighted factoring significantly outperforms variants of a workstealing loadbalancing algorithm and on certain applications dramatically outperforms factoring as well. We then study weighted work assignment analytically, giving upper and lower bounds on its performance under the assumption that the processor iteration execution times can be modeled as weighted random variables. Department of Computer Science, Polytechnic University, Brooklyn, NY, 11201. Research supported by AR...
DNA computers in vitro and vivo
, 1996
"... We show how DNA molecules and stan dard lab techniques may be used to create a nondeterministic Turing machine. This is the first scheme that shows how to make a universal computer with DNA. We claim that both our scheme and previous ones will work but they probably cannot be scaled up to be of pra ..."
Abstract

Cited by 29 (0 self)
 Add to MetaCart
We show how DNA molecules and stan dard lab techniques may be used to create a nondeterministic Turing machine. This is the first scheme that shows how to make a universal computer with DNA. We claim that both our scheme and previous ones will work but they probably cannot be scaled up to be of practical computational importance. In vivo,
Parallel propositional satisfiability checking with distributed dynamic learning
 Parallel Computing
, 2003
"... We address the parallelization and distributed execution of an algorithm from the area of symbolic computation: propositional satisfiability (SAT) checking with dynamic learning. Our parallel programming models are strict multithreading for the core SAT checking procedure, complemented by mobile age ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
We address the parallelization and distributed execution of an algorithm from the area of symbolic computation: propositional satisfiability (SAT) checking with dynamic learning. Our parallel programming models are strict multithreading for the core SAT checking procedure, complemented by mobile agents realizing a distributed dynamic learning process. Individual threads treat dynamically created subproblems, while mobile agents collect and distribute pertinent knowledge obtained during the learning process. The parallel algorithm runs on top of our parallel system platform DOTS (Distributed ObjectOriented Threads System), which provides support for our parallel programming models in highly heterogeneous distributed systems. We present performance measurements evaluating the performance gains by our approach in different application domains with practical significance. Key words: parallel symbolic computation, parallel propositional satisfiability checking, distributed multithreading 1
Fast Priority Queues for Parallel BranchandBound
 In Workshop on Algorithms for Irregularly Structured Problems, number 980 in LNCS
, 1995
"... . Currently used parallel best first branchandbound algorithms either suffer from contention at a centralized priority queue or can only approximate the best first strategy. Bottleneck free algorithms for parallel priority queues are known but they cannot be implemented very efficiently on contemp ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
. Currently used parallel best first branchandbound algorithms either suffer from contention at a centralized priority queue or can only approximate the best first strategy. Bottleneck free algorithms for parallel priority queues are known but they cannot be implemented very efficiently on contemporary machines. We present quite simple randomized algorithms for parallel priority queues on distributed memory machines. For branchandbound they are asymptotically as efficient as previously known PRAM algorithms with high probability. The simplest versions require not much more communication than the approximated branchandbound algorithm of Karp and Zhang. Keywords: Analysis of randomized algorithms, distributed memory, load balancing, median selection, parallel best first branchandbound, parallel pritority queue. 1 Introduction Branchandbound search is an important technique for many combinatorial optimization problems. Since it can be a quite time consuming technique, paralleli...