Results 1 -
8 of
8
Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism
- ACM Transactions on Computer Systems
, 1992
"... Threads are the vehicle,for concurrency in many approaches to parallel programming. Threads separate the notion of a sequential execution stream from the other aspects of traditional UNIX-like processes, such as address spaces and I/O descriptors. The objective of this separation is to make the expr ..."
Abstract
-
Cited by 420 (18 self)
- Add to MetaCart
Threads are the vehicle,for concurrency in many approaches to parallel programming. Threads separate the notion of a sequential execution stream from the other aspects of traditional UNIX-like processes, such as address spaces and I/O descriptors. The objective of this separation is to make the expression and control of parallelism sufficiently cheap that the programmer or compiler can exploit even fine-grained parallelism with acceptable overhead. Threads can be supported either by the operating system kernel or by user-level library code in the application address space, but neither approach has been fully satisfactory. This paper addresses this dilemma. First, we argue that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations; we thus argue that managing par- allelism at the user level is essential to high-performance parallel computing. Next, we argue that the lack of system integration exhibited by user-level threads is a consequence of the lack of kernel support for user-level threads provided by contemporary multiprocessor operating systems; we thus argue that kernel threads or processes, as currently conceived, are the wrong abstraction on which to support user- level management of parallelism. Finally, we describe the design, implementation, and performance of a new kernel interface and user-level thread package that together provide the same functionality as kernel threads without compromis- ing the performance and flexibility advantages of user-level management of parallelism.
Parallel Sorting by Regular Sampling
, 1992
"... A new parallel sorting algorithm suitable for MIMD multiprocessors is presented. The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good pivot selection. For n data elements to be sorted and p process ..."
Abstract
-
Cited by 93 (7 self)
- Add to MetaCart
A new parallel sorting algorithm suitable for MIMD multiprocessors is presented. The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good pivot selection. For n data elements to be sorted and p processors, when n p 3 the algorithm is shown to be asymptotically optimal. In theory, the algorithm is within a factor of two of achieving ideal load balancing. In practice, there is almost perfect partitioning of work. On a variety of shared and distributed memory machines, the algorithm achieves better than half-linear speedups. - 1. Introduction Sorting is one of most studied problems in computer science because of its theoretical interest and practical importance. With the advent of parallel processing, parallel sorting has become an important area for algorithm research. Although considerable work has been done on the theory of parallel sorting and efficient implementations on SIMD arch...
Parallel Performance Prediction using Lost Cycles Analysis
- IN PROCEEDINGS OF SUPERCOMPUTING '94
, 1994
"... Most performance debugging and tuning of parallel programs is based on the "measure-modify" approach, which is heavily dependent on detailed measurements of programs during execution. This approach is extremely time-consuming and does not lend itself to predicting performance under varying condition ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
Most performance debugging and tuning of parallel programs is based on the "measure-modify" approach, which is heavily dependent on detailed measurements of programs during execution. This approach is extremely time-consuming and does not lend itself to predicting performance under varying conditions. Analytic modeling and scalability analysis provide predictive power, but are not widely used inpractice, due primarily to their emphasis on asymptotic behavior and the difficulty of developing accurate models that work for real-world programs. In this paper we describe a set of tools for performance tuning of parallel programs that bridges this gap between measurement and modeling. Our approach is based on lost cycles analysis, which involves measurement and modeling of all sources of overhead in a parallel program. We first describe a tool for measuring overheads in parallel programs that we have incorporated into the runtime environment for Fortran programs on the Kendall Square KSR1. We then describe a tool that ts these overhead measurements to analytic forms. We illustrate the use of these tools by analyzing the performance tradeoffs among parallel implementations of 2D FFT. These examples show how our tools enable programmers to develop accurate performance models of parallel applications without requiring extensive performance modeling expertise.
Implementing Lock-Free Queues
- In Proceedings of the Seventh International Conference on Parallel and Distributed Computing Systems, Las Vegas, NV
, 1994
"... We study practical techniques for implementing the FIFO queue abstract data type using lock-free data structures, which synchronize the operations of concurrent processes without the use of mutual exclusion. Two new algorithms based on linked lists and arrays are presented. We also propose a new sol ..."
Abstract
-
Cited by 49 (1 self)
- Add to MetaCart
We study practical techniques for implementing the FIFO queue abstract data type using lock-free data structures, which synchronize the operations of concurrent processes without the use of mutual exclusion. Two new algorithms based on linked lists and arrays are presented. We also propose a new solution to the ABA problem associated with the Compare&Swap instruction. The performance of our linked list algorithm is compared several other lock-free queue implementations, as well as more conventional locking techniques. 1 Introduction Concurrent access to a data structure shared among several processes must be synchronized in order to avoid conflicting updates. Conventionally this is done using mutual exclusion; processes modify the data structure only inside a critical section of code, within which the process is guaranteed exclusive access to the data structure. Typically, on a multiprocessor, critical sections are guarded with a spin-lock . We will refer to all methods using mutual ...
Maximizing Speedup Through Self-Tuning of Processor Allocation
- IN PROCEEDINGS OF THE 10TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM
, 1996
"... We address the problem of maximizing the speedup of an individual parallel job through the selection of an appropriate number of processors on which to run it. If a parallel job exhibits speedup that increases monotonically in the number of processors, the solution is clearly to make use of all av ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
We address the problem of maximizing the speedup of an individual parallel job through the selection of an appropriate number of processors on which to run it. If a parallel job exhibits speedup that increases monotonically in the number of processors, the solution is clearly to make use of all available processors. However, many applications do not have this characteristic: they reach a point beyond which the use of additional processors degrades performance. For these applications, it is important to choose a processor allocation carefully. Our approach to this problem is to provide a runtime system that adjusts the number of processors used by the application based on dynamic measurements of performance gathered during its execution. Our runtime system has a number of advantages over user specified fixed allocations, the currently most common approach to this problem: (1) we are resilient to changes in an application's speedup behavior due to the input data; and (2) we are...
Using Abstraction in Explicitly Parallel Programs
- Dept. of Electrical Engineering and Computer Science, MIT
, 1990
"... ion in Explicitly Parallel Programs by Katherine Anne Yelick c fl Massachusetts Institute of Technology, 1990 This report is a revised version of the author's thesis, which was submitted to the Department of Electrical Engineering and Computer Science on December 31, 1990 in partial fulfillment of ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
ion in Explicitly Parallel Programs by Katherine Anne Yelick c fl Massachusetts Institute of Technology, 1990 This report is a revised version of the author's thesis, which was submitted to the Department of Electrical Engineering and Computer Science on December 31, 1990 in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the Massachusetts Institute of Technology. The thesis was supervised by John V. Guttag. The author's current address is the Computer Science Division, University of California, Berkeley, CA 94720. 2 Abstract It is well-known that writing parallel programs that are both fast and correct is significantly harder than writing sequential ones. In this thesis we introduce a transition-based approach to the design and implementation of parallel programs. This approach is aimed at applications whose complex data and control structures make them hard to parallelize by conventional means. It is based on a programming model with explicit pa...
Searching Game Trees in Parallel *
, 1990
"... We present a new parallel game-tree search algorithm. Our approach classifies a processor's available work as either mandatory (necessary for the solution) or speculative (may be necessary for the solution). Due to the nature of parallel game tree search, it is not possible to keep all processors bu ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
We present a new parallel game-tree search algorithm. Our approach classifies a processor's available work as either mandatory (necessary for the solution) or speculative (may be necessary for the solution). Due to the nature of parallel game tree search, it is not possible to keep all processors busy with mandatory work. Our algorithm ER allows potential speculative work to be dynamically ordered, thereby reducing starvation without incurring an equivalent increase in speculative loss. Measurements of ER's performance on both random trees and trees from an actual game show that at least 16 processors can be applied profitably to a single search. These results contrast with previously published studies, which report a rapid drop-off of efficiency as the number of processors increases. 1. Introduction Game playing programs require a great deal of computation and would thus appear to be an ideal candidate for parallel implementation. In fact, current champion chess-playing programs gene...
Stepwise Refinement of Reactive Processor Farms
, 1991
"... A processor farm is a distributed system that consists of a unique master pro- cessor together with a number of identical slave processors. The master processor interacts with some environment that generates tasks to be solved by the slave processors. The processors are connected via some communi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A processor farm is a distributed system that consists of a unique master pro- cessor together with a number of identical slave processors. The master processor interacts with some environment that generates tasks to be solved by the slave processors. The processors are connected via some communication network that takes care of the communication of the tasks between the master and the slaves.

