Results 1 - 10
of
28
Scheduling Multithreaded Computations by Work Stealing
"... This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is "work stealing," in which processors needing work steal computa ..."
Abstract
-
Cited by 316 (32 self)
- Add to MetaCart
This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is "work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies. Specifically,
The Implementation of the Cilk-5 Multithreaded Language
- In Proceedings of the SIGPLAN '98 Conference on Program Language Design and Implementation
, 1998
"... The fifth release of the multithreaded language Cilk uses a provably good "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear st ..."
Abstract
-
Cited by 248 (20 self)
- Add to MetaCart
The fifth release of the multithreaded language Cilk uses a provably good "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design...
Detecting Data Races in Cilk Programs that Use Locks
, 1998
"... When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a “data race ” occurs, which is usually a bug. This paper describes the algorithms and strategies used by a debugging tool, called the Nondeterminator-2, which ..."
Abstract
-
Cited by 75 (10 self)
- Add to MetaCart
When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a “data race ” occurs, which is usually a bug. This paper describes the algorithms and strategies used by a debugging tool, called the Nondeterminator-2, which checks for data races in programs coded in the Cilk multithreaded language. Like its predecessor, the Nondeterminator, which checks for simple “determinacy” races, the Nondeterminator-2 is a debugging tool, not a verifier, since it checks for data races only in the computation generated by a serial execution of the program on a given input. We give an algorithm, ALL-SETS, that determines whether the computation generated by a serial execution of a Cilk program on a given input contains a race. For a program that runs serially in time T, accesses V shared memory locations, uses a total of n locks, and holds at most k n locks simultaneously, ALL-SETS runs in O(nkT α(V;V))time and O(nk α(V;V)) V)space, where α is Tarjan’s functional inverse of Ackermann’s function. Since ALL-SETS may be too inefficient in the worst case, we propose a much more efficient algorithm which can be used to detect races in programs that obey the “umbrella ” locking discipline, a programming methodology that is more flexible than similar disciplines proposed in the literature. We present an algorithm, BRELLY, which detects violations of the umbrella discipline in O(kT time Keywords using O(kV)space. We also prove that any “abelian ” Cilk program, one whose critical sections commute, produces a determinate final state if it is deadlock free and if it generates any computation which is datarace free. Thus, the Nondeterminator-2’s two algorithms can verify the determinacy of a deadlock-free abelian program running on a given input.
Adaptive and Reliable Parallel Computing on Networks of Workstations
, 1996
"... In this paper, we present the design of Cilk-NOW, a runtime system that adaptively and reliably executes functional Cilk programs in parallel on a network of UNIX workstations. Cilk (pronounced “silk”) is a parallel multithreaded extension of the C language, and all Cilk runtime systems employ a pro ..."
Abstract
-
Cited by 60 (1 self)
- Add to MetaCart
In this paper, we present the design of Cilk-NOW, a runtime system that adaptively and reliably executes functional Cilk programs in parallel on a network of UNIX workstations. Cilk (pronounced “silk”) is a parallel multithreaded extension of the C language, and all Cilk runtime systems employ a provably efficient threadscheduling algorithm. Cilk-NOW is such a runtime system, and in addition, Cilk-NOW automatically delivers adaptive and reliable execution for a functional subset of Cilk programs. By adaptive execution, we mean that each Cilk program dynamically utilizes a changing set of otherwise-idle workstations. By reliable execution, we mean that the Cilk-NOW system as a whole and each executing Cilk program are able to tolerate machine and network faults. Cilk-NOW provides these features while programs remain fault oblivious, meaning that Cilk programmers need not code for fault tolerance. Throughout this paper, we focus on end-to-end design decisions, and we show how these decisions allow the design to exploit high-level algorithmic properties of the Cilk programming model in order to simplify and streamline the implementation.
Dag-consistent distributed shared memory
- IN PROCEEDINGS OF THE 10TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM (IPPS
, 1996
"... We introduce dag consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented dag-consistency in software for the Cilk multithreaded runtime system running on a Connection Machine CM5. Our implementation includes a dag-co ..."
Abstract
-
Cited by 57 (13 self)
- Add to MetaCart
We introduce dag consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented dag-consistency in software for the Cilk multithreaded runtime system running on a Connection Machine CM5. Our implementation includes a dag-consistent distributed cactus stack for storage allocation. We provide empirical evidence of the flexibility and efficiency of dag consistency for applications that include blocked matrix multiplication, Strassen’s matrix multiplication algorithm, and a Barnes-Hut code. Although Cilk schedules the executions of these programs dynamically, their performances are competitive with statically scheduled implementations in the literature. We also prove that the number FP of page faults incurred by a user program running onPprocessors can be related to the numberF1of page faults running serially by the formula FP F1+2Cs, where C is the cache size andsis the number of thread migrations executed by Cilk’s scheduler.
Parallel Programmability and the Chapel Language
- Int. J. High Perform. Comput. Appl
"... It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effecti ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effectively program traditional sequential computers, and this gap seems only to be widening as time passes. The parallel computing community’s inability to tap the skills
Precedence-based memory models
- In Eleventh International Workshop on Distributed Algorithms, number 1320 in Lecture Notes in Computer Science
, 1997
"... We present a computation-centric theory of memory models. Unlike traditional processor-centric models, computation-centric models focus on the logical dependencies among instructions rather than the processor that happens to execute them. This theory allows us to define what a memory model is, and t ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
We present a computation-centric theory of memory models. Unlike traditional processor-centric models, computation-centric models focus on the logical dependencies among instructions rather than the processor that happens to execute them. This theory allows us to define what a memory model is, and to investigate abstract properties of memory models. In particular, we focus on constructibility, which is a necessary property of those models that can be implemented exactly by an online algorithm. For a nonconstructible model, we show that there is a natural way to define the constructible version of that model. We explore the implications of constructibility in the context of dag-consistent memory models, which do not require that memory locations be serialized. The strongest dag-consistent model, called NN-dag consistency, is not constructible. However, its constructible version is equivalent to a model that we call location consistency, in which each location is serialized independently. 1
The Weakest Reasonable Memory Model
, 1998
"... A memory model is some description of how memory behaves in a parallel computer system. While there is consensus that sequential consistency [Lamport 1979] is the strongest memory model, nobody seems to have tried to identify the weakest memory model. This thesis concerns itself with precisely this ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
A memory model is some description of how memory behaves in a parallel computer system. While there is consensus that sequential consistency [Lamport 1979] is the strongest memory model, nobody seems to have tried to identify the weakest memory model. This thesis concerns itself with precisely this problem. We cannot hope to identify the weakest memory model unless we specify a minimal set of properties we want it to obey. In this thesis, we identify five such properties: completeness, monotonicity, constructibility, nondeterminism confinement, and classicality. Constructibility is especially interesting, because a nonconstructible model cannot be implemented exactly, and hence every implementation necessarily supports a stronger model. One nonconstructible model is, for example, dag consistency [Blumofe et al. 1996a]. We argue (with some caveats) that if one wants the five properties, then location consistency is the weakest reasonable memory model. In location consistency, every memo...
Data movement and control substrate for parallel scientific computing
- of Lecture Notes in Computer Science
, 1997
"... In this paper, we describe the design and implementation of a datamovement and control substrate (DMCS) for network-based, homogeneous communication within a single multiprocessor. DMCS is an implementation of an API for communication and computation that has been proposed by the PORTS consortium. O ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
In this paper, we describe the design and implementation of a datamovement and control substrate (DMCS) for network-based, homogeneous communication within a single multiprocessor. DMCS is an implementation of an API for communication and computation that has been proposed by the PORTS consortium. One of the goals of this consortium is to de ne an API that can support heterogeneous computing without undue performance penalties for homogeneous computing. Preliminary results in our implementation suggest that this is quite feasible. The DMCS implementation seeks to minimize the assumptions made about the homogeneous nature of its target architecture. Finally, we present some extensions to the API for PORTS that will improve the performance of sparse, adaptive and irregular type of numeric computations.

