Results 1 - 10
of
26
A Lock-Free Multiprocessor OS Kernel
, 1991
"... Typical shared-memory multiprocessor OS kernels use interlocking, implemented as spinlocks or waiting semaphores. We have implemented a complete multiprocessor OS kernel (including threads, virtual memory, and I/O including a window system and a file system) using only lock-free synchronization meth ..."
Abstract
-
Cited by 86 (2 self)
- Add to MetaCart
Typical shared-memory multiprocessor OS kernels use interlocking, implemented as spinlocks or waiting semaphores. We have implemented a complete multiprocessor OS kernel (including threads, virtual memory, and I/O including a window system and a file system) using only lock-free synchronization methods based on Compare-and-Swap. Lock-free synchronization avoids many serious problems caused by locks: considerable overhead, concurrency bottlenecks, deadlocks, and priority inversion in real-time scheduling. Measured numbers show the low overhead of our implementation, competitive with user-level thread management systems. Contents 1 Introduction 1 2 Synchronization in OS Kernels 1 2.1 Disabling Interrupts : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2.2 Locking Synchronization Methods : : : : : : : : : : : : : : : : : : : : : : : : : : 2 2.3 Lock-Free Synchronization Methods : : : : : : : : : : : : : : : : : : : : : : : : : 2 3 Lock-Free Quajects 3 3.1 LIFO ...
Lock-Free Linked Lists Using Compare-and-Swap
- In Proceedings of the Fourteenth Annual ACM Symposium on Principles of Distributed Computing
, 1995
"... Lock-free data structures implement concurrent objects without the use of mutual exclusion. This approach can avoid performance problems due to unpredictable delays while processes are within critical sections. Although universal methods are known that give lock-free data structures for any abstract ..."
Abstract
-
Cited by 84 (1 self)
- Add to MetaCart
Lock-free data structures implement concurrent objects without the use of mutual exclusion. This approach can avoid performance problems due to unpredictable delays while processes are within critical sections. Although universal methods are known that give lock-free data structures for any abstract data type, the overhead of these methods makes them inefficient when compared to conventional techniques using mutual exclusion, such as spin locks. We give lock-free data structures and algorithms for implementing a shared singly-linked list, allowing concurrent traversal, insertion, and deletion by any number of processes. We also show how the basic data structure can be used as a building block for other lock-free data structures. Our algorithms use the single word Compare-and-Swap synchronization primitive to implement the linked list directly, avoiding the overhead of universal methods, and are thus a practical alternative to using spin locks. 1 Introduction A concurrent object is an...
Provably efficient scheduling for languages with fine-grained parallelism
- IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract
-
Cited by 68 (22 self)
- Add to MetaCart
Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
Filaments: Efficient Support for Fine-Grain Parallelism
, 1993
"... . It has long been thought that coarse-grain parallelism is much more efficient than fine-grain parallelism due to the overhead of process (thread) creation, context switching, and synchronization. On the other hand, there are several advantages to fine-grain parallelism: architecture independence, ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
. It has long been thought that coarse-grain parallelism is much more efficient than fine-grain parallelism due to the overhead of process (thread) creation, context switching, and synchronization. On the other hand, there are several advantages to fine-grain parallelism: architecture independence, ease of programming, ease of use as a target for code generation, and load-balancing potential. This paper describes a portable threads package, Filaments, that supports efficient execution of fine-grain parallel programs on shared-memory multiprocessors. Filaments supports three kinds of threads---run-to-completion, barrier (iterative), and fork/join--- which appear to be sufficient for scientific computations. Filaments employs a unique combination of techniques to achieve efficiency: stateless threads, very small thread descriptors, optimized barrier synchronization, scheduling that enhances data locality, and automatic pruning of fork/join threads. The gains in performance are such that ...
Space-Efficient Scheduling of Nested Parallelism
- ACM Transactions on Programming Languages and Systems
, 1999
"... This article presents an on-line scheduling algorithm that is provably space e#cient and time e#cient for nested-parallel languages. For a computation with depth D and serial space requirement S1 , the algorithm generates a schedule that requires at most S1 +O(K D p)space (including scheduler spa ..."
Abstract
-
Cited by 28 (5 self)
- Add to MetaCart
This article presents an on-line scheduling algorithm that is provably space e#cient and time e#cient for nested-parallel languages. For a computation with depth D and serial space requirement S1 , the algorithm generates a schedule that requires at most S1 +O(K D p)space (including scheduler space) on p processors. Here, K is a user-adjustable runtime parameter specifying the net amount of memory that a thread may allocate before it is preempted by the scheduler. Adjusting the value of K provides a trade-o# between the running time and the memory requirement of a parallel computation. To allow the scheduler to scale with the number of processors, we also parallelize the scheduler and analyze the space and time bounds of the computation to include scheduling costs. In addition to showing that the scheduling algorithm is space and time e#cient in theory, we demonstrate that it is e#ective in practice. We have implemented a runtime system that uses our algorithm to schedule lightweight parallel threads. The results of executing parallel programs on this system show that our scheduling algorithm significantly reduces memory usage compared to previous techniques, without compromising performance
Coscheduling Based on Run-Time Identification of Activity Working Sets
- International Journal of Parallel Programming
"... This paper introduces a method for runtime identification of sets of interacting activities ("working sets") with the purpose of coscheduling them, i.e. scheduling them so that all the activities in the set execute simultaneously on distinct processors. The identification is done by monitoring acces ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
This paper introduces a method for runtime identification of sets of interacting activities ("working sets") with the purpose of coscheduling them, i.e. scheduling them so that all the activities in the set execute simultaneously on distinct processors. The identification is done by monitoring access rates to shared communication objects: activities that access the same objects at a high rate thereby interact frequently, and therefore would benefit from coscheduling. Simulation results show that coscheduling with our runtime identification scheme can give better performance than uncoordinated scheduling based on a single global activity queue. The finer-grained the interactions among the activities in a working set, the better the performance differential. Moreover, coscheduling based on automatic runtime identification achieves about the same performance as coscheduling based on manual identification of working sets by the programmer. Keywords: coscheduling, gang scheduling, on-line ...
Balancing Processor Loads and Exploiting Data Locality in N-Body Simulations
- In Proceedings of Supercomputing’95 (CD-ROM
, 1995
"... Although N-body simulation algorithms are amenable to parallelization, performance gains from execution on parallel machines are difficult to obtain due to load imbalances caused by irregular distributions of bodies. In general, there is a tension between balancing processor loads and maintaining lo ..."
Abstract
-
Cited by 26 (11 self)
- Add to MetaCart
Although N-body simulation algorithms are amenable to parallelization, performance gains from execution on parallel machines are difficult to obtain due to load imbalances caused by irregular distributions of bodies. In general, there is a tension between balancing processor loads and maintaining locality, as the dynamic re-assignment of work necessitates access to remote data. Fractiling is a dynamic scheduling scheme that simultaneously balances processor loads and maintains locality by exploiting the self-similarity properties of fractals. Fractiling is based on a probabilistic analysis, and thus, accommodates load imbalances caused by predictable phenomena, such as irregular data, and unpredictable phenomena, such as data-access latencies. In experiments on a KSR1, performance of N-body simulation codes were improved by as much as 53% by fractiling. Performance improvements were obtained on uniform and nonuniform distributions of bodies, underscoring the need for a scheduling schem...
On The Implementation And Effectiveness Of Autoscheduling For Shared-Memory Multiprocessors
, 1995
"... processors Physical processors Alignment Distribution dependent mapping Implementation Figure 3.4 HPF approach to data partition and distribution. states that iteration i is to be executed by the processor to which A(i) is assigned. Therefore processor p 1 executes iterations f1; 2; 3; 4g. T ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
processors Physical processors Alignment Distribution dependent mapping Implementation Figure 3.4 HPF approach to data partition and distribution. states that iteration i is to be executed by the processor to which A(i) is assigned. Therefore processor p 1 executes iterations f1; 2; 3; 4g. The ON clause is a feature borrowed from the language Kali [25]. 3.1.3 HPF The High Performance Fortran (HPF) [6, 26, 27] language was designed as a set of extensions and modifications to Fortran 90 to support data parallel programming. The ability to achieve top performance on MIMD and SIMD computers with nonuniform memory access was one of the main goals of the project. The design of HPF was influenced by Fortran D and Vienna Fortran [28, 29]. Just as Fortran D approaches the problem of data partitioning and distribution in two stages, HPF uses three. First, arrays are aligned to each other. Second, arrays are distributed across a user-defined rectilinear arrangement of abstract processo...
ParC - An Extension of C for Shared Memory Parallel Processing
"... this paper we describe the features and semantics of ParC. The rest of this section explains the motivation for designing a new language, the eect of the motivating forces on the design, and the structure of the software environment that surrounds it. The next section describes the parallel construc ..."
Abstract
-
Cited by 15 (11 self)
- Add to MetaCart
this paper we describe the features and semantics of ParC. The rest of this section explains the motivation for designing a new language, the eect of the motivating forces on the design, and the structure of the software environment that surrounds it. The next section describes the parallel constructs and scoping rules. The exact semantics of parallel constructs when there are more activities than processors have been widely neglected in the literature. We discuss this issue and provide guidelines for acceptable implementations. We then describe the innovative instructions for forced termination, which are based on analogies with C instructions that break out of a construct, followed by a discussion of synchronization mechanisms. A discussion of the programming methodology of ParC is then given and is followed by a discussion of our experiences with ParC . A comparison of ParC with other parallel programming languages is delayed until the end of the paper, after we have described all of its features
Space-Efficient Implementation of Nested Parallelism
- In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1996
"... Many of today's high level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an efficient scheduling algorithm is required to assig ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Many of today's high level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an efficient scheduling algorithm is required to assign computations to processors at runtime. Besides having low overheads and good load balancing, it is important for the scheduling algorithm to minimize the space usage of the parallel program. This paper presents a scheduling algorithm that is provably space-efficient and time-efficient for nested parallel languages. In addition to proving the space and time bounds of the parallel schedule generated by the algorithm, we demonstrate that it is efficient in practice. We have implemented a runtime system that uses our algorithm to schedule parallel threads. The results of executing parallel programs on this system show that our scheduling algorithm significantly reduces memory usage compared to...

