Results 1 - 10
of
43
Scheduling Multithreaded Computations by Work Stealing
, 1994
"... This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing," in which processors needing work steal com ..."
Abstract
-
Cited by 568 (34 self)
- Add to MetaCart
This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies. Specifically, our analysis shows that the ezpected time Tp to execute a fully strict computation on P processors using our work-stealing scheduler is Tp = O(TI/P + Tm), where TI is the minimum serial eze-cution time of the multithreaded computation and T, is the minimum ezecution time with an infinite number of processors. Moreover, the space Sp required by the execution satisfies Sp 5 SIP. We also show that the ezpected total communication of the algorithm is at most O(TmS,,,P), where S, is the site of the largest activation record of any thread, thereby justify-ing the folk wisdom that work-stealing schedulers are more communication eficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor.
The Implementation of the Cilk-5 Multithreaded Language
, 1998
"... The fifth release of the multithreaded language Cilk uses a provably good "work-stealing " scheduling algorithm similar to the rst system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided ..."
Abstract
-
Cited by 489 (28 self)
- Add to MetaCart
(Show Context)
The fifth release of the multithreaded language Cilk uses a provably good "work-stealing " scheduling algorithm similar to the rst system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.
Parallel Programmability and the Chapel Language
- Int. J. High Perform. Comput. Appl
"... It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effecti ..."
Abstract
-
Cited by 193 (8 self)
- Add to MetaCart
(Show Context)
It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effectively program traditional sequential computers, and this gap seems only to be widening as time passes. The parallel computing community’s inability to tap the skills
An analysis of dag-consistent distributed shared-memory algorithms
- in Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1996
"... In this paper, we analyze the performance of parallel multi-threaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a FP(C) workstealing thread scheduler and the BACKE ..."
Abstract
-
Cited by 98 (19 self)
- Add to MetaCart
(Show Context)
In this paper, we analyze the performance of parallel multi-threaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a FP(C) workstealing thread scheduler and the BACKER algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time TP(C)of a “fully strict” multithreaded computation on P processors, each with a LRU cache of C pages, is O(T1(C)=P+mCT∞), where T1(C)is the total work of the computation including page faults, T ∞ is its critical-path length excluding page faults, and m is the minimum page transfer time. As
Detecting Data Races in Cilk Programs that Use Locks
, 1998
"... When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a “data race ” occurs, which is usually a bug. This paper describes the algorithms and strategies used by a debugging tool, called the Nondeterminator-2, which ..."
Abstract
-
Cited by 89 (9 self)
- Add to MetaCart
(Show Context)
When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a “data race ” occurs, which is usually a bug. This paper describes the algorithms and strategies used by a debugging tool, called the Nondeterminator-2, which checks for data races in pro-grams coded in the Cilk multithreaded language. Like its predeces-sor, the Nondeterminator, which checks for simple “determinacy” races, the Nondeterminator-2 is a debugging tool, not a verifier, since it checks for data races only in the computation generated by a serial execution of the program on a given input. We give an algorithm, ALL-SETS, that determines whether the computation generated by a serial execution of a Cilk program on a given input contains a race. For a program that runs serially in time T, accesses V shared memory locations, uses a total of n locks, and holds at most k << n locks simultaneously, ALL-SETS runs in
Adaptive and Reliable Parallel Computing on Networks of Workstations
, 1996
"... In this paper, we present the design of Cilk-NOW, a runtime system that adaptively and reliably executes functional Cilk programs in parallel on a network of UNIX workstations. Cilk (pronounced “silk”) is a parallel multithreaded extension of the C language, and all Cilk runtime systems employ a pro ..."
Abstract
-
Cited by 80 (1 self)
- Add to MetaCart
(Show Context)
In this paper, we present the design of Cilk-NOW, a runtime system that adaptively and reliably executes functional Cilk programs in parallel on a network of UNIX workstations. Cilk (pronounced “silk”) is a parallel multithreaded extension of the C language, and all Cilk runtime systems employ a provably efficient threadscheduling algorithm. Cilk-NOW is such a runtime system, and in addition, Cilk-NOW automatically delivers adaptive and reliable execution for a functional subset of Cilk programs. By adaptive execution, we mean that each Cilk program dynamically utilizes a changing set of otherwise-idle workstations. By reliable execution, we mean that the Cilk-NOW system as a whole and each executing Cilk program are able to tolerate machine and network faults. Cilk-NOW provides these features while programs remain fault oblivious, meaning that Cilk programmers need not code for fault tolerance. Throughout this paper, we focus on end-to-end design decisions, and we show how these decisions allow the design to exploit high-level algorithmic properties of the Cilk programming model in order to simplify and streamline the implementation.
Dag-consistent distributed shared memory
- IN PROCEEDINGS OF THE 10TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM (IPPS
, 1996
"... We introduce dag consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented dag-consistency in software for the Cilk multithreaded runtime system running on a Connection Machine CM5. Our implementation includes a dag-co ..."
Abstract
-
Cited by 62 (10 self)
- Add to MetaCart
(Show Context)
We introduce dag consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented dag-consistency in software for the Cilk multithreaded runtime system running on a Connection Machine CM5. Our implementation includes a dag-consistent distributed cactus stack for storage allocation. We provide empirical evidence of the flexibility and efficiency of dag consistency for applications that include blocked matrix multiplication, Strassen’s matrix multiplication algorithm, and a Barnes-Hut code. Although Cilk schedules the executions of these programs dynamically, their performances are competitive with statically scheduled implementations in the literature. We also prove that the number FP of page faults incurred by a user program running onPprocessors can be related to the numberF1of page faults running serially by the formula FP F1+2Cs, where C is the cache size andsis the number of thread migrations executed by Cilk’s scheduler.
Athapascan-1: On-Line Building Data Flow Graph in a Parallel Language
- In PACT
, 1998
"... In order to achieve practical efficient execution on a parallel architecture, a knowledge of the data dependencies related to the application appears as the key point for building an efficient schedule. By restricting accesses in shared memory, we show that such a data dependency graph can be comput ..."
Abstract
-
Cited by 34 (6 self)
- Add to MetaCart
(Show Context)
In order to achieve practical efficient execution on a parallel architecture, a knowledge of the data dependencies related to the application appears as the key point for building an efficient schedule. By restricting accesses in shared memory, we show that such a data dependency graph can be computed on-line on a distributed architecture. The overhead introduced is bounded with respect to the parallelism expressed by the user: each basic computation corresponds to a user-defined task, each data-dependency to a userdefined data structure. We introduce a language named Athapascan-1 that allows built a graph of dependencies from a strong typing of shared memory accesses. We detail compilation and implementation of the language. Besides, the performance of a code (parallel time, communication and arithmetic works, memory space) are defined from a cost model without the need of a machine model. We exhibit efficient scheduling with respect to these costs on theoretical machine models. Keyw...