Results 1 - 10
of
46
Job Scheduling in Multiprogrammed Parallel Systems
, 1997
"... Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of us ..."
Abstract
-
Cited by 145 (15 self)
- Add to MetaCart
Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of users, this is not necessarily the case. In the context of multiprogrammed parallel machines, scheduling refers to the execution of threads from competing programs. This is an operating system issue, involved with resource allocation, not a program development issue. Scheduling schemes for multiprogrammed parallel systems can be classified as one or two leveled. Single-level scheduling combines the allocation of processing power with the decision of which thread will use it. Two level scheduling decouples the two issues: first, processors are allocated to the job, and then the job's threads are scheduled using this pool of processors. The processors of a parallel system can be shared i...
Predicting Queue Times on Space-Sharing Parallel Computers
, 1997
"... We present statistical techniques for predicting the queue times experienced by jobs submitted to a space-sharing parallel machine with first-come-first-served (FCFS) scheduling. We apply these techniques to trace data from the Intel Paragon at the San Diego Supercomputer Center and the IBM SP2 at t ..."
Abstract
-
Cited by 59 (1 self)
- Add to MetaCart
We present statistical techniques for predicting the queue times experienced by jobs submitted to a space-sharing parallel machine with first-come-first-served (FCFS) scheduling. We apply these techniques to trace data from the Intel Paragon at the San Diego Supercomputer Center and the IBM SP2 at the Cornell Theory Center. We show that it is possible to predict queue times with accuracy that is acceptable for several intended applications. The coefficient of correlation between our predicted queue times and the actual queue times from simulated schedules is between 0:65 and 0:72. 1 Introduction On space-sharing parallel computers, it is useful to be able to predict how long a submitted job will be queued before processors are allocated to it. Some of the applications of these predictions are: Load metrics: They provide a measure of load that is more concrete than abstractions such as load average, allowing users to make decisions about what jobs to run, where to run them or what si...
A Comprehensive Model of the Supercomputer Workload
"... ... This paper attacks this problem by considering requested time (and its relation with execution time) and the possibility of job cancellation, two aspects of the supercomputer workload that have not been modeled yet. Moreover, we also improve upon existing models for the arrival instant and p ..."
Abstract
-
Cited by 47 (5 self)
- Add to MetaCart
... This paper attacks this problem by considering requested time (and its relation with execution time) and the possibility of job cancellation, two aspects of the supercomputer workload that have not been modeled yet. Moreover, we also improve upon existing models for the arrival instant and partition size.
The Cost of Doing Science on the Cloud: The Montage Example
, 2008
"... Utility grids such as the Amazon EC2 cloud and Amazon S3 offer computational and storage resources that can be used on-demand for a fee by compute and data-intensive applications. The cost of running an application on such a cloud depends on the compute, storage and communication resources it will p ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
Utility grids such as the Amazon EC2 cloud and Amazon S3 offer computational and storage resources that can be used on-demand for a fee by compute and data-intensive applications. The cost of running an application on such a cloud depends on the compute, storage and communication resources it will provision and consume. Different execution plans of the same application may result in significantly different costs. Using the Amazon cloud fee structure and a real-life astronomy application, we study via simulation the cost performance tradeoffs of different execution and resource provisioning plans. We also study these trade-offs in the context of the storage and communication fees of Amazon S3 when used for longterm application data archival. Our results show that by provisioning the right amount of storage and compute resources, cost can be significantly reduced with no significant impact on application performance. 1.
Workload characteristics of a multi-cluster supercomputer
, 2004
"... Abstract. This paper presents a comprehensive characterization of a multi-cluster supercomputer 3 workload using twelve-month scientific research traces. Metrics that we characterize include system utilization, job arrival rate and interarrival time, job cancellation rate, job size (degree of parall ..."
Abstract
-
Cited by 40 (4 self)
- Add to MetaCart
Abstract. This paper presents a comprehensive characterization of a multi-cluster supercomputer 3 workload using twelve-month scientific research traces. Metrics that we characterize include system utilization, job arrival rate and interarrival time, job cancellation rate, job size (degree of parallelism), job run time, memory usage, and user/group behavior. Correlations between metrics (job runtime and memory usage, requested and actual runtime, etc) are identified and extensively studied. Differences with previously reported workloads are recognized and statistical distributions are fitted for generating synthetic workloads with the same characteristics. This study provides a realistic basis for experiments in resource management and evaluations of different scheduling strategies in a multi-cluster research environment. 1
The Elusive Goal of Workload Characterization
- Perf. Eval. Rev
, 1999
"... The study and design of computer systems requires good models of the workload to which these systems are subjected. Until recently, the data necessary to build these models---observations from production installations---were not available, especially for parallel computers. Instead, most models were ..."
Abstract
-
Cited by 34 (6 self)
- Add to MetaCart
The study and design of computer systems requires good models of the workload to which these systems are subjected. Until recently, the data necessary to build these models---observations from production installations---were not available, especially for parallel computers. Instead, most models were based on assumptions and mathematical attributes that facilitate analysis. Recently a number of supercomputer sites have made accounting data available that make it possible to build realistic workload models. It is not clear, however, how to generalize from specific observations to an abstract model of the workload. This paper presents observations of workloads from several parallel supercomputers and discusses modeling issues that have caused problems for researchers in this area. 1 Introduction We like to think of building computer systems as a systematic process of engineering---we define requirements, draw designs, analyze their properties, evaluate options, and finally construct a w...
Benchmarks and Standards for the Evaluation of Parallel Job Schedulers
, 1999
"... The evaluation of parallel job schedulers hinges on the workloads used. It is suggested that this be standardized, in terms of both format and content, so as to ease the evaluation and comparison of different systems. The question remains whether this can encompass both traditional parallel systems ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
The evaluation of parallel job schedulers hinges on the workloads used. It is suggested that this be standardized, in terms of both format and content, so as to ease the evaluation and comparison of different systems. The question remains whether this can encompass both traditional parallel systems and metacomputing systems. This paper is based on a panel on this subject that was held at the workshop, and the ensuing discussion; its authors are both the panel members and participants from the audience. Naturally, not all of us agree with all the opinions expressed here...
Predicting bounds on queuing delay for batch-scheduled parallel machines
- In Proceedings of PPoPP 2006
, 2006
"... Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources have accounts at multiple sites and have the option of choosing at which site or s ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources have accounts at multiple sites and have the option of choosing at which site or sites to submit a parallel job. In such a situation, the amount of time a user’s job will wait in any one batch queue can significantly impact the overall time a user waits from job submission to job completion. In this work, we explore a new method for providing end-users with predictions for the bounds on the queuing delay individual jobs will experience. We evaluate this method using batch scheduler logs for distributed-memory parallel machines that cover a 9-year period at 7 large HPC centers. Our results show that it is possible to predict delay bounds reliably for jobs in different queues, and for jobs requesting different ranges of processor counts. Using this information, scientific application developers can intelligently decide where to submit their parallel codes in order to minimize overall turnaround time. 1.
An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling and Migration
"... Effective scheduling strategies to improve response times, throughput, and utilization are an important consideration in large supercomputing environments. Such machines have traditionally used space-sharing strategies to accommodate multiple jobs at the same time. This approach, however, can result ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
Effective scheduling strategies to improve response times, throughput, and utilization are an important consideration in large supercomputing environments. Such machines have traditionally used space-sharing strategies to accommodate multiple jobs at the same time. This approach, however, can result in low system utilization and large job wait times. This paper discusses three techniques that can be used beyond simple space-sharing to greatly improve the performance figures of large parallel systems. The first technique we analyze is backfilling, the second is gang-scheduling, and the third is migration. The main contribution of this paper is an evaluation of the benefits from combining the above techniques. We demonstrate that, under certain conditions, a strategy that combines backfilling, gang-scheduling, and migration is always better than the individual strategies for all quality of service parameters that we consider.
Combining Workstations and Supercomputers to Support Grid Applications: The Parallel Tomography Experience
, 2000
"... Computational Grids are becoming an increasingly important and powerful platform for the execution of largescale, resource-intensive applications. However, it remains a challenge for applications to tap into the potential of Grid resources in order to achieve performance. In this paper, we illustrat ..."
Abstract
-
Cited by 25 (7 self)
- Add to MetaCart
Computational Grids are becoming an increasingly important and powerful platform for the execution of largescale, resource-intensive applications. However, it remains a challenge for applications to tap into the potential of Grid resources in order to achieve performance. In this paper, we illustrate how work queue applications can leverage Grids to achieve performance through coallocation. We describe our experiences developing a scheduling strategy for a production tomography application targeted to Grids that contain both workstations and parallel supercomputers. Our strategy uses dynamic information exported by a supercomputer's batch scheduler to simultaneously schedule tasks on workstations and immediately available supercomputer nodes. This strategy is of great practical interest because it combines resources available to the typical research lab: time-shared workstations and CPU time in remote space-shared supercomputers. We show that this strategy improves the performance of ...

