Results 1 - 10
of
33
Gang Scheduling Performance Benefits for Fine-Grain Synchronization
- Journal of Parallel and Distributed Computing
, 1992
"... Abstract Multiprogrammed multiprocessors executing fine-grained parallel programs appear to require new scheduling policies. A promising new idea is gang scheduling, where a set of threads are scheduled to execute simultaneously on a set of processors. This has the intuitive appeal of supplying the ..."
Abstract
-
Cited by 117 (12 self)
- Add to MetaCart
Abstract Multiprogrammed multiprocessors executing fine-grained parallel programs appear to require new scheduling policies. A promising new idea is gang scheduling, where a set of threads are scheduled to execute simultaneously on a set of processors. This has the intuitive appeal of supplying the threads with an environment that is very similar to a dedicated machine. It allows the threads to interact efficiently by using busy waiting, without the risk of waiting for a thread that currently is not running. Without gang scheduling, threads have to block in order to synchronize, thus suffering the overhead of a context switch. While this is tolerable in coarse grain computations, and might even lead to performance benefits if the threads are highly unbalanced, it causes severe performance degradation in the fine-grain case. We have developed a model to evaluate the performance of different combinations of synchronization mechanisms and scheduling policies, and validated it by an implementation on the Makbilan multiprocessor. The model leads to the conclusion that gang scheduling is required for efficient fine grain synchronization on multiprogrammed multiprocessors. 1 Introduction Multiprocessors are often dedicated to running a single application at a time. The program is allowed full control over what happens on each processor, and in fact it might be required to include instructions that regulate the mapping and scheduling of parallel threads. Much experience relating to these issues has been accumulated over the years, and automatic parallelization and compilation techniques have been developed. These techniques allow dedicated processors to be used efficiently by a single application.
Effective Distributed Scheduling of Parallel Workloads
, 1996
"... We present a distributed algorithm for time-sharing parallel workloads that is competitive with coscheduling. Implicit scheduling allows each local scheduler in the system to make independent decisions that dynamically coordinate the scheduling of cooperating processes across processors. Of particul ..."
Abstract
-
Cited by 106 (5 self)
- Add to MetaCart
We present a distributed algorithm for time-sharing parallel workloads that is competitive with coscheduling. Implicit scheduling allows each local scheduler in the system to make independent decisions that dynamically coordinate the scheduling of cooperating processes across processors. Of particular importance is the blocking algorithm which decides the action of a process waiting for a communication or synchronization event to complete. Through simulation of bulk-synchronous parallel applications, we find that a simple two-phase fixed-spin blocking algorithm performs well; a two-phase adaptive algorithm that gathers run-time data on barrier wait-times performs slightly better. Our results hold for a range of machine parameters and parallel program characteristics. These findings are in direct contrast to the literature that states explicit coscheduling is necessary for fine-grained programs. We show that the choice of the local scheduler is crucial, with a priority-based scheduler p...
Empirical Studies of Competitive Spinning for a Shared-Memory Multiprocessor
- IN PROCEEDINGS OF THE 13TH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES
, 1991
"... A common operation in multiprocessor programs is acquiring a lock to protect access to shared data. Typically, the requesting thread is blocked if the lock it needs is held by another thread. The cost of blocking one thread and activating another can be a substantial part of program execution time. ..."
Abstract
-
Cited by 99 (1 self)
- Add to MetaCart
A common operation in multiprocessor programs is acquiring a lock to protect access to shared data. Typically, the requesting thread is blocked if the lock it needs is held by another thread. The cost of blocking one thread and activating another can be a substantial part of program execution time. Alternatively, the thread could spin until the lock is free, or spin for a while and then block. This may avoid context-switch overhead, but processor cycles may be wasted in unproductive spinning. This paper studies seven strategies for determining whether and how long to spin before blocking. Of particular interest are competitive strategies, for which the performance can be shown to be no worse than some constant factor times an optimal off-line strategy. The performance of five competitive strategies is compared with that of always blocking, always spinning, or using the optimal off-line algorithm. Measurements of lock-waiting time distributions for five parallel programs were used to co...
The Stochastic Rendezvous Network Model for Performance of Synchronous Client-Server-like Distributed Software
- IEEE Transactions on Computers
, 1995
"... Distributed or parallel software with synchronous communication via rendezvous is found in client-server systems and in proposed Open Distributed Systems, in implementation environments such as Ada, V, Remote Procedure Call systems, in Transputer systems, and in specification techniques such as CSP, ..."
Abstract
-
Cited by 83 (23 self)
- Add to MetaCart
Distributed or parallel software with synchronous communication via rendezvous is found in client-server systems and in proposed Open Distributed Systems, in implementation environments such as Ada, V, Remote Procedure Call systems, in Transputer systems, and in specification techniques such as CSP, CCS and LOTOS. The delays induced by rendezvous can cause serious performance problems, which are not easy to estimate using conventional models which focus on hardware contention, or on a restricted view of the parallelism which ignores implementation constraints. Stochastic Rendezvous Networks are queueing networks of a new type which have been proposed as a modelling framework for these systems. They incorporate the two key phenomena of included service and a second phase of service. This paper extends the model to also incorporate different services or entries associated with each task. Approximations to arrival-instant probabilities are employed with a Mean-Value Analysis framework, to...
Scheduling with Implicit Information in Distributed Systems
- In Proceedings of the 1998 ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems
, 1998
"... Implicit coscheduling is a distributed algorithm for time-sharing communicating processes in a cluster of workstations. By observing and reacting to implicit information, local schedulers in the system make independent decisions that dynamically coordinate the scheduling of communicating processes. ..."
Abstract
-
Cited by 62 (4 self)
- Add to MetaCart
Implicit coscheduling is a distributed algorithm for time-sharing communicating processes in a cluster of workstations. By observing and reacting to implicit information, local schedulers in the system make independent decisions that dynamically coordinate the scheduling of communicating processes. The principal mechanism involved is two-phase spin-blocking: a process waiting for a message response spins for some amount of time, and then relinquishes the processor if the response does not arrive. In this paper, we describe our experience implementing implicit coscheduling on a cluster of 16 UltraSPARC I workstations; this has led to contributions in three main areas. First, we more rigorously analyze the two-phase spin-block algorithm and show that spin time should be increased when a process is receiving messages. Second, we present performance measurements for a wide range of synthetic benchmarks and for seven Split-C parallel applications. Finally, we show how implicit coscheduling ...
Reactive Synchronization Algorithms for Multiprocessors
"... Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable run-time factors. The designer of a synchronization algorithm has a choice of protocols to use for implementing the synchro ..."
Abstract
-
Cited by 49 (2 self)
- Add to MetaCart
Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable run-time factors. The designer of a synchronization algorithm has a choice of protocols to use for implementing the synchronization operation. For example, candidate protocols for locks include test-and-set protocols and queueing protocols. Frequently, the best choice of protocols depends on the level of contention: previous research has shown that test-and-set protocols for locks outperform queueing protocols at low contention, while the opposite is true at high contention. This paper investigates reactive synchronization algorithms that dynamically choose protocols in response to the level of contention. We describe reactive algorithms for spin locks and fetch-and-op that choose among several shared-memory and message-passing protocols. Dynamically choosing protocols presents a challenge: a reactive algorithm needs to select and change protocols efficiently, and has to allow for the possibility that multiple processes may be executing different protocols at the same time. We describe the notion of consensus objects that the reactive algorithms use to preserve correctness in the face of dynamic protocol changes. Experimental measurements demonstrate that reactive algorithms perform close to the best static choice of protocols at all levels of contention. Furthermore, with mixed levels of contention, reactive algorithms outperform passive algorithms with fixed protocols, provided that contention levels do not change too frequently. Measurements of several parallel applications show that reactive algorithms result in modest performance gains for spin locks and significant gains for fetch-and-op.
Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1998
"... In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing natural ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing naturally-occurring local events and their corresponding implicit information, i.e., information available outside of a defined interface. Many systems, particularly in distributed and networked environments, have leveraged implicit control to simplify the implementation of services with autonomous components. To concretely demonstrate the advantages of implicit control, we propose and implement implicit coscheduling, an algorithm for dynamically coordinating the time...
Processor Allocation Policies for Message-Passing Parallel Computers
, 1994
"... When multiple jobs compete for processing resources on a parallel computer, the operating system kernel's processor allocation policy determines how many and which processors to allocate to each. This dissertation investigates the issues involved in constructing a processor allocation policy for lar ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
When multiple jobs compete for processing resources on a parallel computer, the operating system kernel's processor allocation policy determines how many and which processors to allocate to each. This dissertation investigates the issues involved in constructing a processor allocation policy for large scale, message-passing parallel computers supporting a scientific workload. First, the issues that affect the performance of scheduling policies for message-passing parallel systems are examined. We argue that reasonable policies must provide nearly equal resource allocation to all runnable jobs and allocate, to a single job, processors that are in close proximity to one another. Second, the concept of efficiency preservation is defined as a characteristic of processor allocation policie...
Waiting Algorithms for Synchronization in Large-Scale Multiprocessors
- ACM Transactions on Computer Systems
, 1991
"... Through analysis and experiments, this paper investigates two-phase waiting algorithms to minimize the cost of waiting for synchronization in large-scale multiprocessors. In a two-phase algorithm, a thread #rst waits by polling a synchronization variable. If the cost of polling reaches a limit L ..."
Abstract
-
Cited by 42 (4 self)
- Add to MetaCart
Through analysis and experiments, this paper investigates two-phase waiting algorithms to minimize the cost of waiting for synchronization in large-scale multiprocessors. In a two-phase algorithm, a thread #rst waits by polling a synchronization variable. If the cost of polling reaches a limit L poll and further waiting is necessary, the thread is blocked, incurring an additional #xed cost, B. The choice of L poll is a critical determinant of the performance of two-phase algorithms.
Scheduler-Conscious Synchronization
- ACM Transactions on Computer Systems
, 1994
"... Efficient synchronization is important for achieving good performance in parallel programs, especially on large-scale multiprocessors. Most synchronization algorithms have been designed to run on a dedicated machine, with one application process per processor, and can suffer serious performance degr ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
Efficient synchronization is important for achieving good performance in parallel programs, especially on large-scale multiprocessors. Most synchronization algorithms have been designed to run on a dedicated machine, with one application process per processor, and can suffer serious performance degradation in the presence of multiprogramming. Problems arise when running processes block or, worse, busy-wait for action on the part of a process that the scheduler has chosen not to run. In this paper we describe and evaluate a set of scheduler-conscious synchronization algorithms that perform well in the presence of multiprogramming while maintaining good performance on dedicated machines. We consider both large and small machines, with a particular focus on scalability, and examine mutual-exclusion locks, reader-writer locks, and barriers. The algorithms we study fall into two classes: those that heuristically determine appropriate behavior and those that use scheduler information to guid...

