Results 1 - 10
of
44
Symbiotic Jobscheduling for a Simultaneous Multithreading Processor
- In Eighth International Conference on Architectural Support for Programming Languages and Operating Systems
, 2000
"... Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous execution, the operating system scheduler must ..."
Abstract
-
Cited by 179 (14 self)
- Add to MetaCart
Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous execution, the operating system scheduler must choose the set of jobs to coschedule This paper demonstrates that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler. Thus, the full benefits of SMT hardware can only be achieved if the scheduler is aware of thread interactions. Here, a mechanism is presented that allows the scheduler to significantly raise the performance of SMT architectures. This is done without any advance knowledge of a workload's characteristics, using sampling to identify jobs which run well together.
Thread scheduling for multiprogrammed multiprocessors
- In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta
, 1998
"... We present a user-level thread scheduler for shared-memory multiprocessors, and we analyze its performance under multiprogramming. We model multiprogramming with two scheduling levels: our scheduler runs at user-level and schedules threads onto a fixed collection of processes, while below, the opera ..."
Abstract
-
Cited by 130 (5 self)
- Add to MetaCart
We present a user-level thread scheduler for shared-memory multiprocessors, and we analyze its performance under multiprogramming. We model multiprogramming with two scheduling levels: our scheduler runs at user-level and schedules threads onto a fixed collection of processes, while below, the operating system kernel schedules processes onto a fixed collection of processors. We consider the kernel to be an adversary, and our goal is to schedule threads onto processes such that we make efficient use of whatever processor resources are provided by the kernel. Our thread scheduler is a non-blocking implementation of the work-stealing algorithm. For any multithreaded computation with work ¢¤ £ and critical-path length ¢¦ ¥ , and for any number § of processes, our scheduler executes the computation in expected time ¨�©�¢�£���§¤����¢�¥�§���§¤�� � , where §� � is the average number of processors allocated to the computation by the kernel. This time bound is optimal to within a constant factor, and achieves linear speedup whenever § is small relative to the parallelism 1
Effective Distributed Scheduling of Parallel Workloads
, 1996
"... We present a distributed algorithm for time-sharing parallel workloads that is competitive with coscheduling. Implicit scheduling allows each local scheduler in the system to make independent decisions that dynamically coordinate the scheduling of cooperating processes across processors. Of particul ..."
Abstract
-
Cited by 106 (5 self)
- Add to MetaCart
We present a distributed algorithm for time-sharing parallel workloads that is competitive with coscheduling. Implicit scheduling allows each local scheduler in the system to make independent decisions that dynamically coordinate the scheduling of cooperating processes across processors. Of particular importance is the blocking algorithm which decides the action of a process waiting for a communication or synchronization event to complete. Through simulation of bulk-synchronous parallel applications, we find that a simple two-phase fixed-spin blocking algorithm performs well; a two-phase adaptive algorithm that gathers run-time data on barrier wait-times performs slightly better. Our results hold for a range of machine parameters and parallel program characteristics. These findings are in direct contrast to the literature that states explicit coscheduling is necessary for fine-grained programs. We show that the choice of the local scheduler is crucial, with a priority-based scheduler p...
Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor
, 2002
"... Simultaneous Multithreading machines benefit from jobscheduling software that monitors how well coscheduled jobs share CPU resources, and coschedules jobs that interact well to make more efficient use of those resources. As a result, informed coscheduling can yield significant performance gains over ..."
Abstract
-
Cited by 74 (1 self)
- Add to MetaCart
Simultaneous Multithreading machines benefit from jobscheduling software that monitors how well coscheduled jobs share CPU resources, and coschedules jobs that interact well to make more efficient use of those resources. As a result, informed coscheduling can yield significant performance gains over naive schedulers. However, prior work on coscheduling focused on equal-priority job mixes, which is an unrealistic assumption for modern operating systems.
Scheduling with Implicit Information in Distributed Systems
- In Proceedings of the 1998 ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems
, 1998
"... Implicit coscheduling is a distributed algorithm for time-sharing communicating processes in a cluster of workstations. By observing and reacting to implicit information, local schedulers in the system make independent decisions that dynamically coordinate the scheduling of communicating processes. ..."
Abstract
-
Cited by 62 (4 self)
- Add to MetaCart
Implicit coscheduling is a distributed algorithm for time-sharing communicating processes in a cluster of workstations. By observing and reacting to implicit information, local schedulers in the system make independent decisions that dynamically coordinate the scheduling of communicating processes. The principal mechanism involved is two-phase spin-blocking: a process waiting for a message response spins for some amount of time, and then relinquishes the processor if the response does not arrive. In this paper, we describe our experience implementing implicit coscheduling on a cluster of 16 UltraSPARC I workstations; this has led to contributions in three main areas. First, we more rigorously analyze the two-phase spin-block algorithm and show that spin time should be increased when a process is receiving messages. Second, we present performance measurements for a wide range of synthetic benchmarks and for seven Split-C parallel applications. Finally, we show how implicit coscheduling ...
Dynamic Coscheduling on Workstation Clusters
- Scheduling Strategies for Parallel Processing, volume 1459 of Lecture Notes in Computer Science
, 1998
"... Coscheduling has been shown to be a critical factor in achieving e#cient parallel execution in timeshared environments [12, 19, 4]. However, the most common approach, gang scheduling, has limitations in scaling, can compromise good interactive response, and requires that communicating processes be ..."
Abstract
-
Cited by 48 (2 self)
- Add to MetaCart
Coscheduling has been shown to be a critical factor in achieving e#cient parallel execution in timeshared environments [12, 19, 4]. However, the most common approach, gang scheduling, has limitations in scaling, can compromise good interactive response, and requires that communicating processes be identified in advance. We explore a technique called dynamic coscheduling (DCS) which produces emergent coscheduling of the processes constituting a parallel job. Experiments are performed in a workstation environment with high performance networks and autonomous timesharing schedulers for each CPU. The results demonstrate that DCS can achieve e#ective, robust coscheduling for a range of workloads and background loads. Empirical comparisons to implicit scheduling and uncoordinated scheduling are presented. Under spin-block synchronization, DCS reduces job response times by up to 20% over implicit scheduling while maintaining fairness; and under spinning synchronization, DCS reduces ...
Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1998
"... In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing natural ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing naturally-occurring local events and their corresponding implicit information, i.e., information available outside of a defined interface. Many systems, particularly in distributed and networked environments, have leveraged implicit control to simplify the implementation of services with autonomous components. To concretely demonstrate the advantages of implicit control, we propose and implement implicit coscheduling, an algorithm for dynamically coordinating the time...
Extending Proportional-Share Scheduling to a Network of Workstations
- In Proceedings of Parallel and Distributed Processing Techniques and Applications (PDPTA’97), Las Vegas, NV
, 1997
"... As networks of workstations (NOW) emerge as a viable platform for a wide range of workloads, a new scheduling approach is needed to allocate the collection of resources across competing users. In this paper, we show that extensions to a proportional-share scheduler for improving response time can st ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
As networks of workstations (NOW) emerge as a viable platform for a wide range of workloads, a new scheduling approach is needed to allocate the collection of resources across competing users. In this paper, we show that extensions to a proportional-share scheduler for improving response time can still fairly allocate resources to a mix of sequential, interactive, and parallel jobs in this distributed environment. We find that a proportional-share scheduler, specifically a stride-scheduler, running on each node in the cluster is a good building-block. Simple extensions are implemented and analyzed which provide better response-times for interactive jobs by giving those jobs their share of resources over a longer time-interval. When scheduling jobs across the cluster, we show that fairness can be guaranteed if each local scheduler knows the number of tickets issued to each user and if the tickets are balanced across all workstations. Finally, we show that a proportional-share of resourc...
High performance virtual machines (HPVM): Clusters with supercomputing APIs and performance
- in: Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing
, 1997
"... The HPVM project provides software which enables high-performance computing on clusters of PCs and workstations using standard supercomputing APIs such as MPI, SHMEM Put/Get, and Global Arrays. HPVMs—High-Performance Virtual Machines—are surprisingly competitive with MPP systems, such as the IBM SP2 ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
The HPVM project provides software which enables high-performance computing on clusters of PCs and workstations using standard supercomputing APIs such as MPI, SHMEM Put/Get, and Global Arrays. HPVMs—High-Performance Virtual Machines—are surprisingly competitive with MPP systems, such as the IBM SP2 and Cray T3D. The Illinois HPVM achieves impressive low-level communication performance across the cluster: one-way latencies of around 11 µsec and bandwidths> 50 MBytes/sec—even for small packets (< 256 bytes). Performance at higher levels, such as MPI, is expected to be approximately 17 µsec latency and also> 50 MByte/sec bandwidth.

