Results 1 - 10
of
13
Job Scheduling in Multiprogrammed Parallel Systems
, 1997
"... Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of us ..."
Abstract
-
Cited by 145 (15 self)
- Add to MetaCart
Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of users, this is not necessarily the case. In the context of multiprogrammed parallel machines, scheduling refers to the execution of threads from competing programs. This is an operating system issue, involved with resource allocation, not a program development issue. Scheduling schemes for multiprogrammed parallel systems can be classified as one or two leveled. Single-level scheduling combines the allocation of processing power with the decision of which thread will use it. Two level scheduling decouples the two issues: first, processors are allocated to the job, and then the job's threads are scheduled using this pool of processors. The processors of a parallel system can be shared i...
Dag-consistent distributed shared memory
- IN PROCEEDINGS OF THE 10TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM (IPPS
, 1996
"... We introduce dag consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented dag-consistency in software for the Cilk multithreaded runtime system running on a Connection Machine CM5. Our implementation includes a dag-co ..."
Abstract
-
Cited by 57 (13 self)
- Add to MetaCart
We introduce dag consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented dag-consistency in software for the Cilk multithreaded runtime system running on a Connection Machine CM5. Our implementation includes a dag-consistent distributed cactus stack for storage allocation. We provide empirical evidence of the flexibility and efficiency of dag consistency for applications that include blocked matrix multiplication, Strassen’s matrix multiplication algorithm, and a Barnes-Hut code. Although Cilk schedules the executions of these programs dynamically, their performances are competitive with statically scheduled implementations in the literature. We also prove that the number FP of page faults incurred by a user program running onPprocessors can be related to the numberF1of page faults running serially by the formula FP F1+2Cs, where C is the cache size andsis the number of thread migrations executed by Cilk’s scheduler.
The Cilk System for Parallel Multithreaded Computing
, 1996
"... Although cost-effective parallel machines are now commercially available, the widespread use of parallel processing is still being held back, due mainly to the troublesome nature of parallel programming. In particular, it is still diiticult to build eiticient implementations of parallel applications ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
Although cost-effective parallel machines are now commercially available, the widespread use of parallel processing is still being held back, due mainly to the troublesome nature of parallel programming. In particular, it is still diiticult to build eiticient implementations of parallel applications whose communication patterns are either highly irregular or dependent upon dynamic information. Multithreading has become an increasingly popular way to implement these dynamic, asynchronous, concurrent programs. Cilk (pronounced "silk") is our C-based multithreaded computing system that provides provably good performance guarantees. This thesis describes the evolution of the Cilk language and runtime system, and describes applications which affected the evolution of the system.
LCM: Memory System Support for Parallel Language Implementation
- In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI
, 1994
"... Higher-1evel parallel programming languages can be difficult to implement efficiently on parallel machines. This paper shows how a flexible, compiler-controlled memory system can help achieve good performance for language constructs that previously appeared too costly to be practical. Our compiler-c ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
Higher-1evel parallel programming languages can be difficult to implement efficiently on parallel machines. This paper shows how a flexible, compiler-controlled memory system can help achieve good performance for language constructs that previously appeared too costly to be practical. Our compiler-controlled memory system is called Loosely Coherent Memory (LCM). It is an example of a larger class of Reconcilable Shared Memory (RSM) systems, which generalize the replication and merge policies of cache-coherent shared-memory. RSM pro-tocols differ in the action taken by a processor in re-sponse to a request for a location and the way in which a processor reconcdes multiple outstanding copies of a location, LCM memory becomes temporarily in-consistent to implement the semantics of C* * par-allel functions efficiently. RSM provides a compiler with control over memory-system policies, which it can use to implement a language’s semantics, improve performance, or detect errors. We illustrate the first two points with LCM and our compiler for the data-parallel language C**.
Efficient systemenforced deterministic parallelism
- In Proceedings of the USENIX Symposium on Operating System Design and Implementation (OSDI
, 2010
"... Deterministic execution offers many benefits for debugging, fault tolerance, and security. Current methods of executing parallel programs deterministically, however, often incur high costs, allow misbehaved software to defeat repeatability, and transform time-dependent races into input- or path-depe ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
Deterministic execution offers many benefits for debugging, fault tolerance, and security. Current methods of executing parallel programs deterministically, however, often incur high costs, allow misbehaved software to defeat repeatability, and transform time-dependent races into input- or path-dependent races without eliminating them. We introduce a new parallel programming model addressing these issues, and use Determinator, a proof-of-concept OS, to demonstrate the model’s practicality. Determinator’s microkernel API provides only “shared-nothing ” address spaces and deterministic interprocess communication primitives to make execution of all unprivileged code—well-behaved or not— precisely repeatable. Atop this microkernel, Determinator’s user-level runtime adapts optimistic replication techniques to offer a private workspace model for both thread-level and process-level parallel programing. This model avoids the introduction of read/write data races, and converts write/write races into reliably-detected conflicts. Coarse-grained parallel benchmarks perform and scale comparably to nondeterministic systems, on both multicore PCs and across nodes in a distributed cluster. 1
Portable High-Performance Programs
, 1999
"... right notice and this permission notice are preserved on all copies. ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
right notice and this permission notice are preserved on all copies.
The Parallel Asynchronous Recursion Model
- Third IEEE Symp. on Prallel and Districuted Processing
, 1992
"... This extended abstract introduces and evaluates a new model of parallel computation, called the Parallel Asynchronous Recursion (PAR) model. This model offers distinct advantages to the program designer and the parallel machine architect, while avoiding some of the PRAM's shortcomings. The PAR model ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
This extended abstract introduces and evaluates a new model of parallel computation, called the Parallel Asynchronous Recursion (PAR) model. This model offers distinct advantages to the program designer and the parallel machine architect, while avoiding some of the PRAM's shortcomings. The PAR model can be thought of as a procedural programming language augmented with a process control structure that can, in parallel, recursively fork independent processes and merge their results. The unique aspect of the PAR model lies in its memory semantics, which differ substantially from both global and distributed memory models. It provides a high level of abstraction that removes the tasks of explicit processor scheduling and synchronization. Efficient simulations of the PAR model on well established models confirm that the PAR model's advantages can be obtained at a reasonable cost. 1. Introduction Though the PRAM is a widely studied model of parallel computation, it is not universally accepte...
Trinitis: Implementing an OpenMP Execution Environment on Infiniband Clusters
- In Proc. of IWOMP’2005 (2005
, 2005
"... Abstract. Cluster systems interconnected via fast interconnection networks have been successfully applied to various research fields for parallel execution of large applications. Next to MPI, the conventional programming model, OpenMP is increasingly used for parallelizing sequential codes. Due to i ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. Cluster systems interconnected via fast interconnection networks have been successfully applied to various research fields for parallel execution of large applications. Next to MPI, the conventional programming model, OpenMP is increasingly used for parallelizing sequential codes. Due to its easy programming interface and similar semantics with traditional programming languages, OpenMP is especially appropriate for non-professional users. For exploiting scalable parallel computation, we have established a PC cluster using InfiniBand, a high-performance, de facto standard interconnection technology. In order to support the users with a simple parallel programming model, we have implemented an OpenMP execution environment on top of this cluster. As a global memory abstraction is needed for shared data, we first built a software distributed shared memory implementing a kind of Home-based Lazy Release Consistency protocol. We then modified an existing OpenMP source-to-source compiler for mapping shared data on this DSM and for handling issues with respect to process/thread activities and task distribution. Experimental results based on a set of different OpenMP applications show a speedup of up to 5.22 on systems with 6 processor nodes. 1
Aurora: Scoped Behaviour for Per-Context Optimized Distributed Data Sharing
- In Proc. of the 11th Int’l Parallel Processing Symp. (IPPS’97
, 1997
"... We introduce the all-software, standard C++-based Aurora distributed shared data system. As with related systems, it provides a shared data abstraction on distributed memory hardware. An innovation in Aurora is the use of scoped behaviour for per-context data sharing optimizations (i.e., portion of ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We introduce the all-software, standard C++-based Aurora distributed shared data system. As with related systems, it provides a shared data abstraction on distributed memory hardware. An innovation in Aurora is the use of scoped behaviour for per-context data sharing optimizations (i.e., portion of source code, such as a loop or phase). With scoped behaviour, a new language scope (e.g., nested braces) can be used to optimize the data sharing behaviour of the selected source code. Different scopes and different shared data can be optimized in different ways. Thus, scoped behaviour provides a novel level of flexibility to incrementally tune the parallel performance of an application. 1. Introduction Parallel programming systems based on shared memory and shared data models are becoming more popular and widespread. Accessing local and remote data using the same programming interface (i.e., reads and writes) is often more convenient than mixing local accesses with message passing. Conseq...

