Results 1 - 10
of
22
Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors
- IEEE Transactions on Parallel and Distributed Systems
, 1994
"... Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attempt to achieve the minimum completion time by distributing the workload as evenly ..."
Abstract
-
Cited by 133 (2 self)
- Add to MetaCart
Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attempt to achieve the minimum completion time by distributing the workload as evenly as possible, while minimizing the number of synchronization operations required. In this paper we consider a third dimension to the problem of loop scheduling on shared-memory multiprocessors: communication overhead caused by accesses to non-local data. We show that traditional algorithms for loop scheduling, which ignore the location of data when assigning iterations to processors, incur a significant performance penalty on modern shared-memory multiprocessors. We propose a new loop scheduling algorithm that attempts to simultaneously balance the workload, minimize synchronization, and co-locate loop iterations with the necessary data. We compare the performance of this new algorithm to ot...
First-Class User-Level Threads
- In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles
, 1991
"... It is often desirable, for reasons of clarity, portability, and efficiency, to write parallel programs in which the number of processes is independent of the number of available processors. Several modern operating systems support more than one process in an address space, but the overhead of creati ..."
Abstract
-
Cited by 114 (12 self)
- Add to MetaCart
It is often desirable, for reasons of clarity, portability, and efficiency, to write parallel programs in which the number of processes is independent of the number of available processors. Several modern operating systems support more than one process in an address space, but the overhead of creating and synchronizing kernel processes can be high. Many runtime environments implement lightweight processes (threads) in user space, but this approach usually results in second-class status for threads, making it difficult or impossible to perform scheduling operations at appropriate times (e.g. when the current thread blocks in the kernel). In addition, a lack of common assumptions may also make it difficult for parallel programs or library routines that use dissimilar thread packages to communicate with each other, or to synchronize access to shared data. We describe a set of kernel mechanisms and conventions designed to accord first-class status to user-level threads, allowing them to be used in any reasonable way that traditional kernel-provided processes can be used, while leaving the details of their implementation to userlevel code. The key features of our approach are (1) shared memory for asynchronous communication between the kernel and the user, (2) software interrupts for events that might require action on the part of a user-level scheduler, and (3) a scheduler interface convention that facilitates interactions in user space between dissimilar kinds of threads. We have incorporated these mechanisms in the Psyche parallel operating system, and have used them to implement several different kinds of user-level threads. We argue for our approach in terms of both flexibility and performance.
Paradigms for process interaction in distributed programs
- ACM Computing Surveys
, 1991
"... Distributed computations are concurrent programs in which processes communicate by message passing. Such programs typically execute on network architectures such as networks of workstations ordistributed memory parallel machines (i. e, multicomputers such ashypercubes). Several paradigms—examples or ..."
Abstract
-
Cited by 108 (0 self)
- Add to MetaCart
Distributed computations are concurrent programs in which processes communicate by message passing. Such programs typically execute on network architectures such as networks of workstations ordistributed memory parallel machines (i. e, multicomputers such ashypercubes). Several paradigms—examples or models—for process interaction
Distributed Filaments: Efficient Fine-Grain Parallelism on a Cluster of Workstations
- In First Symposium on Operating Systems Design and Implementation
, 1994
"... A fine-grain parallel program is one in which processes are typically small, ranging from a few to a few hundred instructions. Fine-grain parallelism arises naturally in many situations, such as iterative grid computations, recursive fork/join programs, the bodies of parallel FOR loops, and the impl ..."
Abstract
-
Cited by 71 (15 self)
- Add to MetaCart
A fine-grain parallel program is one in which processes are typically small, ranging from a few to a few hundred instructions. Fine-grain parallelism arises naturally in many situations, such as iterative grid computations, recursive fork/join programs, the bodies of parallel FOR loops, and the implicit parallelism in functional or dataflow languages. It is useful both to describe massively parallel computations and as a target for code generation by compilers. However, fine-grain parallelism has long been thought to be inefficient due to the overheads of process creation, context switching, and synchronization. This paper describes a software kernel, Distributed Filaments (DF), that implements fine-grain parallelism both portably and efficiently on a workstation cluster. DF runs on existing, off-the-shelf hardware and software. It has a simple interface, so it is easy to use. DF achieves efficiency by using stateless threads on each node, overlapping communication and computation, emp...
Multi-Model Parallel Programming In Psyche
- Proceedings of the Second ACM Symposium on Principles and Practice of Parallel Programming
, 1990
"... Many different parallel programming models, including lightweight processes that communicate with shared memory and heavyweight processes that communicate with messages, have been used to implement parallel applications. Unfortunately, operating systems and languages designed for parallel programmin ..."
Abstract
-
Cited by 37 (12 self)
- Add to MetaCart
Many different parallel programming models, including lightweight processes that communicate with shared memory and heavyweight processes that communicate with messages, have been used to implement parallel applications. Unfortunately, operating systems and languages designed for parallel programming typically support only one model. Multi-model parallel programming is the simultaneous use of several different models, both across programs and within a single program. This paper describes multi-model parallel programming in the Psyche multiprocessor operating system. We explain why multi-model programming is desirable and present an operating system interface designed to support it. Through a series of three examples, we illustrate how the Psyche operating system supports different models of parallelism and how the different models are able to interact. 1. Introduction The widespread use of distributed systems since the late 1970's and the growing importance of multiprocessor systems ...
Filaments: Efficient Support for Fine-Grain Parallelism
, 1993
"... . It has long been thought that coarse-grain parallelism is much more efficient than fine-grain parallelism due to the overhead of process (thread) creation, context switching, and synchronization. On the other hand, there are several advantages to fine-grain parallelism: architecture independence, ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
. It has long been thought that coarse-grain parallelism is much more efficient than fine-grain parallelism due to the overhead of process (thread) creation, context switching, and synchronization. On the other hand, there are several advantages to fine-grain parallelism: architecture independence, ease of programming, ease of use as a target for code generation, and load-balancing potential. This paper describes a portable threads package, Filaments, that supports efficient execution of fine-grain parallel programs on shared-memory multiprocessors. Filaments supports three kinds of threads---run-to-completion, barrier (iterative), and fork/join--- which appear to be sufficient for scientific computations. Filaments employs a unique combination of techniques to achieve efficiency: stateless threads, very small thread descriptors, optimized barrier synchronization, scheduling that enhances data locality, and automatic pruning of fork/join threads. The gains in performance are such that ...
Scheduling and Resource Management Techniques for Multiprocessors
, 1990
"... and related areas. Application requirements motivated the major research areas, processor scheduling and non-uniform memory management, as these areas contain the most important problems raised by the changing design and use of multiprocessors. ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
and related areas. Application requirements motivated the major research areas, processor scheduling and non-uniform memory management, as these areas contain the most important problems raised by the changing design and use of multiprocessors.
Using fine-grain threads and run-time decision making in parallel computing
- Journal of Parallel and Distributed Computing
, 1996
"... Programming distributed-memory multiprocessors and networks of workstations requires deciding what can execute concurrently, how processes communicate, and where data is placed. These decisions can be made statically by a programmer or compiler, or they can be made dynamically at run time. Using run ..."
Abstract
-
Cited by 28 (12 self)
- Add to MetaCart
Programming distributed-memory multiprocessors and networks of workstations requires deciding what can execute concurrently, how processes communicate, and where data is placed. These decisions can be made statically by a programmer or compiler, or they can be made dynamically at run time. Using run-time decisions leads to a simpler interface—because decisions are implicit—and it can lead to better decisions—because more information is available. This paper examines the costs, benefits, and details of making decisions at run time. The starting point is explicit fine-grain parallelism with any number (even thousands) of threads. Five specific techniques are considered: (1) implicitly coarsening the granularity of parallelism, (2) using implicit communication implemented by a distributed shared memory, (3) overlapping computation and communication, (4) adaptively moving threads and data between nodes to minimize communication and balance load, and (5) dynamically remapping data to pages to avoid false sharing. Details are given on the performance of each of these techniques as well as their overall performance on several scientific applications. 1
Affinity Scheduling of Unbalanced Workloads
, 1994
"... Scheduling in a shared memory multiprocessor is often complicated by the fact that a unit of work may be processed more efficiently on one processor than on any other, due to factors such as the presence of required data in a local cache. The unit of work is said to have an "affinity" for the given ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
Scheduling in a shared memory multiprocessor is often complicated by the fact that a unit of work may be processed more efficiently on one processor than on any other, due to factors such as the presence of required data in a local cache. The unit of work is said to have an "affinity" for the given processor, in such a case. The scheduling issue that has to be considered is the tradeoff between the goals of respecting processor affinities (so as to obtain improved efficiencies in execution) and of dynamically assigning each unit of work to whichever processor happens to be, at the time, least loaded (so as to obtain better load balance and decreased processor idle times). A specific context in which the above scheduling issue arises is that of shared memory multiprocessors with large, per-processor caches or cached main memories. The shared-memory programming paradigm of such machines permits the dynamic scheduling of work. The data required by a unit of work may, however, often reside mostly in the cache of one particular processor, to which that unit of work thus has affinity. In this paper, two new "affinity scheduling" algorithms are proposed for a context in which the units of work have widely varying execution times. An experimental study of these algorithms finds them to perform well in this context.
A Survey of Multiprocessor Operating System Kernels
, 1993
"... Multiprocessors have been accepted as vehicles for improved computing speeds, cost/performance, and enhanced reliability or availability. However, the added performance requirements of user programs and functional capabilities of parallel hardware introduce new challenges to operating system design ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Multiprocessors have been accepted as vehicles for improved computing speeds, cost/performance, and enhanced reliability or availability. However, the added performance requirements of user programs and functional capabilities of parallel hardware introduce new challenges to operating system design and implementation. This paper reviews research and commercial developments in multiprocessor operating system kernels from the late 1970's to the early 1990's. The paper first discusses some common operating system structuring techniques and examines the advantages and disadvantages of using each technique. It then identifies some of the major design goals and key issues in multiprocessor operating systems. Issues and solution approaches are illustrated by review of a variety of research or commercial multiprocessor operating system kernels. College of Computing Georgia Institute of Technology Atlanta, Georgia 30332--0280 Contents 1 Introduction 1 2 Structuring an Operating System 4 2....

