Results 1 - 10
of
13
StackThreads/MP: Integrating Futures into Calling Standards
- PPOPP'99
, 1999
"... An implementation scheme of fine-grain multithreading that needs no changes to current calling standards for sequential languages and modest extensions to sequential compilers is described. Like previous similar systems, it performs an asynchronous call as if it were an ordinary procedure call, and ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
An implementation scheme of fine-grain multithreading that needs no changes to current calling standards for sequential languages and modest extensions to sequential compilers is described. Like previous similar systems, it performs an asynchronous call as if it were an ordinary procedure call, and detaches the callee from the caller when the callee sus-pends or either of them migrates to another processor. Un-like previous similar systems, it detaches and connects arbi-trary frames generated by off-the-shelf sequential compilers obeying calling standards. As a consequence, it requires neither a frontend preprocessor nor a native code genera-tor that has a builtin notion of parallelism. The system practically works with unmodified GNU Ccompiler (GCC). Desirable extensions to sequential compilers for guarantee-ing portability and correctness of the scheme are clarified and claimed modest. Experiments indicate that sequential performance is not sacrificed for practical applications and both sequential and parallel performance are comparable to Cilk[B], whose current implementation requires a fairly so-phisticated preprocessor to C. These results show that ef-ficient asynchronous calls (a.k.a. future calls) can be inte-grated into current calling standard with a very small impact both on sequential performance and compiler engineering.
An Effective Garbage Collection Strategy for Parallel Programming Languages on Large Scale Distributed-Memory Machines
, 1997
"... This paper describes the design and implementation of a garbage collection scheme on large-scale distributed-memory computers and reports various experimental results. The collector is based on the conservative GC library by Boehm & Weiser. Each processor traces local pointers using the GC library w ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
This paper describes the design and implementation of a garbage collection scheme on large-scale distributed-memory computers and reports various experimental results. The collector is based on the conservative GC library by Boehm & Weiser. Each processor traces local pointers using the GC library while traversing remote pointers by exchanging "mark messages" between processors. It exhibits a promising performance---in the most space-intensive settings we tested, the total collection overhead ranges from 5% up to 15% of the application running time (excluding idle time). We not only examine basic performance figures such as the total overhead or latency of a global collection, but also demonstrate how local collection scheduling strategies affect application performance. In our collector, a local collection is scheduled either independently or synchronously. Experimental results show that the benefit of independent local collections has been overstated in the literature. Independent l...
An Architecture for Highly Concurrent, Well-Conditioned Internet Services
, 2002
"... An Architecture for Highly Concurrent, Well-Conditioned Internet Services by Matthew David Welsh Doctor of Philosophy in Computer Science University of California at Berkeley Professor David Culler, Chair This dissertation presents an architecture for handling the massive concurrency and load ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
An Architecture for Highly Concurrent, Well-Conditioned Internet Services by Matthew David Welsh Doctor of Philosophy in Computer Science University of California at Berkeley Professor David Culler, Chair This dissertation presents an architecture for handling the massive concurrency and load conditioning demands of busy Internet services. Our thesis is that existing programming models and operating system structures do not adequately meet the needs of complex, dynamic Internet servers, which must support extreme concurrency (on the order of tens of thousands of client connections) and experience load spikes that are orders of magnitude greater than the average. We propose a new software framework, called the staged event-driven architecture (or SEDA), in which applications are constructed as a network of event-driven stages connected with explicit queues. In this model, each stage embodies a robust, reusable software component that performs a subset of request processing. By performing admission control on each event queue, the service can be well-conditioned to load, preventing resources from being overcommitted when demand exceeds service capacity. SEDA employs dynamic control to tune runtime parameters (such as the scheduling parameters of each stage) automatically, as well as to manage load, for example, by performing adaptive load shedding.
An Efficient Compilation Framework for Languages Based on a Concurrent Process Calculus
, 1997
"... We propose a framework for compiling programming languages based on concurrent process calculi, in which computation is expressed by a combination of processes and communication channels. Our framework realizes a compile-time process scheduling and unboxed channels. The compile-time scheduling enabl ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
We propose a framework for compiling programming languages based on concurrent process calculi, in which computation is expressed by a combination of processes and communication channels. Our framework realizes a compile-time process scheduling and unboxed channels. The compile-time scheduling enables us to execute multiple independent processes without ascheduling pool operation. Unboxed channels allow us to create a channel without memory allocations and to communicate values on registers. The framework is given as a set of translation rules from a concurrent calculus to an ML-like sequential program. Experimental results show that our compiler can execute sequential programs written in the process calculus only a few times slower than equivalent C programs. This indicates that pure process calculi like ours and programming languages based on them can be implemented efficiently, without losing their simplicity, purity, and elegance.
Performance Evaluation of OpenMP Applications with Nested Parallelism
, 2000
"... Many existing OpenMP systems do not su ciently implement nested parallelism. This is supposedly because nested parallelism is believed to require a significant implementation effort, incur a large overhead, or lack applications. This paper demonstrates Omni/ST, a simple and efficient implementation ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Many existing OpenMP systems do not su ciently implement nested parallelism. This is supposedly because nested parallelism is believed to require a significant implementation effort, incur a large overhead, or lack applications. This paper demonstrates Omni/ST, a simple and efficient implementation of OpenMP nested parallelism using StackThreads/MP, which is a fine-grain thread library. Thanks to StackThreads/MP, OpenMP parallel constructs are simply mapped onto thread creation primitives of StackThreads/MP, yet they are efficiently managed with a fixed number of threads in the underlying thread package (e.g., Pthreads). Experimental results on Sun Ultra Enterprise 10000 with up to 60 processors show that overhead imposed by nested parallelism is very small (1-3 % in ve out of six applications, and 8 % for the other), and there is a significant scalability benefit for applications with nested parallelism.
Language and Virtual Machine Support for Efficient Fine-Grained Futures in Java
- In International Conference on Parallel Architectures and Compilation Techniques (PACT
, 2007
"... In this work, we investigate the implementation of futures in Java J2SE v5.0. Java 5.0 provides an interface-based implementation of futures that enables users to encapsulate potentially asynchronous computation and to define their own execution engines for futures. Although this methodology decoupl ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
In this work, we investigate the implementation of futures in Java J2SE v5.0. Java 5.0 provides an interface-based implementation of futures that enables users to encapsulate potentially asynchronous computation and to define their own execution engines for futures. Although this methodology decouples thread scheduling from application logic, for applications with fine-grained parallelism, this model imposes an undue burden on the average users and introduces significant performance overhead. To address these issues, we investigate the use of lazy futures and offer an alternative implementation to the Java 5.0 approach. In particular, we present a directive-based programming model for using futures in Java that uses annotations in Java 5.0 (as opposed to interfaces) and a lazy future implementation to significantly simplify programmer effort. Our directive-based future system employs novel compilation and runtime techniques that transparently and adaptively split and spawn futures for parallel execution. All such decisions are automatic and guided by dynamically determined future granularity and underlying resource availability. We empirically evaluate our future implementation using different Java Virtual Machine configurations and common Java benchmarks that implement fine-grained parallelism. We compare directive-based lazy futures with lazy and Java 5.0 futures and show that our approach is significantly more scalable. 1.
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared MemoryParallel Computer
, 1998
"... We implemented two applications with irregular parallelism in (1) C and a thread libraryand (2) our concurrent language Schematic which supports e cient ne-grain dynamic thread creation and its dynamic load balance. We compared the two approaches focusing on program description cost and performance. ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
We implemented two applications with irregular parallelism in (1) C and a thread libraryand (2) our concurrent language Schematic which supports e cient ne-grain dynamic thread creation and its dynamic load balance. We compared the two approaches focusing on program description cost and performance. Schematic not only achieves common programming practices seen in C such as task queue management with much smaller description cost, but incorporates some advanced optimizations for synchronization such as inter-thread communication on register. The case studyshows that Schematic can describe irregular applications more naturallyand can achieve high performance: Schematic is executed about 2.8 times slower than C on sequential environment and its speedup on 64 processor environment is comparable to C.
A Comparative Analysis of Fine-Grain Threads Packages
- Journal of Parallel and Distributed Computing
, 2000
"... The rising availability of multiprocessing platforms has increased the importance of providing programming models that allow users to express parallelism simply, portably, and eciently. One popular way to write parallel programs is to use threads for concurrent sections of code. User-level thread ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The rising availability of multiprocessing platforms has increased the importance of providing programming models that allow users to express parallelism simply, portably, and eciently. One popular way to write parallel programs is to use threads for concurrent sections of code. User-level threads packages allow programmers to implement multithreaded programs in which thread creation, thread management, and thread synchronization are relatively inexpensive. Fine-grain programs are multithreaded programs in which the work is divided into a large number of threads, where each thread contains a relatively small amount of work. The potential benet of large numbers of threads include easier load balancing, better scalability, greater potential for overlapping communication and computation, and improved platform-independence. However, ne-grain programs are largely considered inecient due to the overheads involved in managing numerous threads. In this paper, we survey several thre...
Dynamic Load Balancing Issues In The Earth Runtime System
, 1999
"... Multithreading is a promising approach to address the problems inherent in multiprocessor systems, such as network and synchronization latencies. Moreover, the benefits of multithreading are not limited to loop-based algorithms but apply also to irregular parallelism. EARTH - Efficient Architecture ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Multithreading is a promising approach to address the problems inherent in multiprocessor systems, such as network and synchronization latencies. Moreover, the benefits of multithreading are not limited to loop-based algorithms but apply also to irregular parallelism. EARTH - Efficient Architecture for Running THreads, is a multithreaded model supporting fine-grain, non-preemptive threads. This model is supported by a C-based runtime system which provides the multithreaded environment for the execution of concurrent programs. This thesis describes the design and implementation of a set of dynamic load balancing algorithms, and an in-depth study of their behavior with divide-and-conquer, regular, and irregular classes of applications. The results described in this thesis are based on EARTH-SP2, an implementation of the EARTH program execution model on the IBM SP-2, a distributed memory multiprocessor system. The main results of this study are as follows: ffl A randomizing load balance...
A SCOOPP Evaluation on Packing Parallel Objects in Run-time
- Proceedings of the 4th International Meeting of Vector and Parallel Processing (VecPar-2000
"... system is an hybrid compile and run-time system. SCOOPP dynamically scales OO applications on a wide range of target platforms, including a novel feature to perform a run-time packing of excess parallel tasks. This communication details the methodology and policies to pack parallel objects into grai ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
system is an hybrid compile and run-time system. SCOOPP dynamically scales OO applications on a wide range of target platforms, including a novel feature to perform a run-time packing of excess parallel tasks. This communication details the methodology and policies to pack parallel objects into grains and method calls into messages. The SCOOPP evaluation focus on a pipelined parallel algorithm- the Eratosthenes sieve- which may dynamically generate a large number of fine-grained parallel tasks and messages. This case study shows how the parallelism grain-size- both computational and communication- has a strong impact on performance and on the programmer burden. The presented performance results show that the SCOOPP methodology is feasible and the proposed policies achieve efficient portability results across several target platforms. 1

