Results 1 - 10
of
21
Cilk: An Efficient Multithreaded Runtime System
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1995
"... Cilk (pronounced "silk") is a C-based runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk work-stealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the "work" and "critical-path length" of a C ..."
Abstract
-
Cited by 431 (34 self)
- Add to MetaCart
Cilk (pronounced "silk") is a C-based runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk work-stealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the "work" and "critical-path length" of a Cilk computation can be used to model performance accurately. Consequently, a Cilk programmer can focus on reducing the computation's work and critical-path length, insulated from load balancing and other runtime scheduling issues. We also prove that for the class of "fully strict" (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal. The Cilk
Scheduling Multithreaded Computations by Work Stealing
"... This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is "work stealing," in which processors needing work steal computa ..."
Abstract
-
Cited by 316 (32 self)
- Add to MetaCart
This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is "work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies. Specifically,
Executing Multithreaded Programs Efficiently
, 1995
"... right to do so. by:::::::::::::::::::::::::::::::::::::::::::::::::::::::: ..."
Abstract
-
Cited by 62 (7 self)
- Add to MetaCart
right to do so. by::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Efficient load balancing for wide-area divideand-conquer applications
- In: Proc. PPoPP’01, Snowbird, UT (2001
"... Divide-and-conquer programs are easily parallelized by letting the programmer annotate potential parallelism in the form of spawn and sync constructs. To achieve efficient program execution, the generated work load has to be balanced evenly among the available CPUs. For single cluster systems, Rando ..."
Abstract
-
Cited by 46 (16 self)
- Add to MetaCart
Divide-and-conquer programs are easily parallelized by letting the programmer annotate potential parallelism in the form of spawn and sync constructs. To achieve efficient program execution, the generated work load has to be balanced evenly among the available CPUs. For single cluster systems, Random Stealing (RS) is known to achieve optimal load balancing. However, RS is inefficient when applied to hierarchical wide-area systems where multiple clusters are connected via wide-area networks (WANs) with high latency and low bandwidth. In this paper, we experimentally compare RS with existing loadbalancing strategies that are believed to be efficient for multi-cluster systems, Random Pushing and two variants of Hierarchical Stealing. We demonstrate that, in practice, they obtain less than optimal results. We introduce a novel load-balancing algorithm, Clusteraware Random Stealing (CRS) which is highly efficient and easy to implement. CRS adapts itself to network conditions and job granularities, and does not require manually-tuned parameters. Although CRS sends more data across the WANs, it is faster than its competitors for 11 out of 12 test applications with various WAN configurations. It has at most 4 % overhead in run time compared to RS on a single, large cluster, even with high wide-area latencies and low wide-area bandwidths. These strong results suggest that divideand-conquer parallelism is a useful model for writing distributed supercomputing applications on hierarchical wide-area systems.
The Cilk System for Parallel Multithreaded Computing
, 1996
"... Although cost-effective parallel machines are now commercially available, the widespread use of parallel processing is still being held back, due mainly to the troublesome nature of parallel programming. In particular, it is still diiticult to build eiticient implementations of parallel applications ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
Although cost-effective parallel machines are now commercially available, the widespread use of parallel processing is still being held back, due mainly to the troublesome nature of parallel programming. In particular, it is still diiticult to build eiticient implementations of parallel applications whose communication patterns are either highly irregular or dependent upon dynamic information. Multithreading has become an increasingly popular way to implement these dynamic, asynchronous, concurrent programs. Cilk (pronounced "silk") is our C-based multithreaded computing system that provides provably good performance guarantees. This thesis describes the evolution of the Cilk language and runtime system, and describes applications which affected the evolution of the system.
Randomized Load Balancing for Tree-structured Computation
- In Scalable High Performance Computing Conference
, 1994
"... In this paper, we study the performance of a randomized algorithm for balancing load across a multiprocessor executing a dynamic irregular task tree. Specifically, we show that the time taken to explore a task tree is likely to be within a small constant factor of an inherent lower bound for the tre ..."
Abstract
-
Cited by 30 (7 self)
- Add to MetaCart
In this paper, we study the performance of a randomized algorithm for balancing load across a multiprocessor executing a dynamic irregular task tree. Specifically, we show that the time taken to explore a task tree is likely to be within a small constant factor of an inherent lower bound for the tree instance. Our model permits arbitrary task times and overlap between computation and load balance, and thus extends earlier work which assumed fixed cost tasks and used a bulk synchronous style in which the system alternated between distinct computing and load balancing steps. Our analysis is supported by experiments with application codes, demonstrating that the efficiency is high enough to make this method practical. 1 Introduction In this paper we study a popular randomized strategy for load balancing dynamic tree-structured task graphs on large scale message passing multiprocessors. First, we show analytically that with high probability, the randomized strategy results in parallel run...
Satin: Simple and efficient java-based grid programming
- In AGridM 2003 Workshop on Adaptive Grid Middleware
, 2005
"... Grid programming environments need to be both portable and efficient to exploit the computational power of dynamically available resources. In previous work, we have presented the divide-and-conquer based Satin model for parallel computing on clustered wide-area systems. In this paper, we present th ..."
Abstract
-
Cited by 28 (9 self)
- Add to MetaCart
Grid programming environments need to be both portable and efficient to exploit the computational power of dynamically available resources. In previous work, we have presented the divide-and-conquer based Satin model for parallel computing on clustered wide-area systems. In this paper, we present the Satin implementation on top of our new Ibis platform which combines Java’s write once, run everywhere with efficient communication between JVMs. We evaluate Satin/Ibis on the testbed of the EU-funded GridLab project, showing that Satin’s load-balancing algorithm automatically adapts both to heterogeneous processor speeds and varying network performance, resulting in efficient utilization of the computing resources. Our results show that when the wide-area links suffer from congestion, Satin’s load-balancing algorithm can still achieve around 80 % efficiency, while an algorithm that is not grid aware drops to 26 % or less. 1.
Programming Environments for High-Performance Grid Computing: the Albatross Project
, 2002
"... The aim of the Albatross project is to study applications and programming environments for computational Grids. We focus on high performance applications, running in parallel on multiple clusters or MPPs that are connected by wide-area networks (WANs). We briefly present three Grid programming envir ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
The aim of the Albatross project is to study applications and programming environments for computational Grids. We focus on high performance applications, running in parallel on multiple clusters or MPPs that are connected by wide-area networks (WANs). We briefly present three Grid programming environments developed in the context of the Albatross project: the MagPIe library for collective communication with MPI, the Replicated Method Invocation mechanism for Java (RepMI), and the Java-based Satin system for running divide-and-conquer programs on Grid platforms.
Competitive Implementation of Parallel Programs
"... We apply the methodology of competitive analysis of algorithms to the implementation of programs on parallel machines. We consider the problem of finding the best on-line distributed scheduling strategy that executes in parallel an unknown directed acyclic graph ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
We apply the methodology of competitive analysis of algorithms to the implementation of programs on parallel machines. We consider the problem of finding the best on-line distributed scheduling strategy that executes in parallel an unknown directed acyclic graph
Network-Based Multicomputers: A Practical Supercomputer Architecture
- IEEE Transactions on Parallel and Distributed Systems
, 1996
"... Multicomputers built around a general network are an attractive architecture for a wide class of applications. The architecture provides many benefits compared with special-purpose approaches, including heterogeneity, reuse of application and system code, and sharing of resources. The architecture a ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Multicomputers built around a general network are an attractive architecture for a wide class of applications. The architecture provides many benefits compared with special-purpose approaches, including heterogeneity, reuse of application and system code, and sharing of resources. The architecture also poses new challenges to both computer system implementors and users. First, traditional local-area networks do not have enough bandwidth and create a communication bottleneck, thus seriously limiting the set of applications that can be run effectively. Second, programmers have to deal with large bodies of code distributed over a variety of architectures, and work in an environment where both the network and nodes are shared with other users. Our experience in the Nectar project shows that it is possible to overcome these problems. We show how networks based on high-speed crossbar switches and efficient protocol implementations can support high bandwidth and low latency communication while still enjoying the flexibility of general networks, and we use three applications to demonstrate that network-based multicomputers are a practical architecture. We also show how the network traffic generated by this new class of applications poses severe requirements for networks.

