• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Processor allocation and checkpoint interval selection in cluster computing systems (2001)

by J S Plank, M G Thomason
Venue:JPDC
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 15
Next 10 →

Modeling Machine Availability in Enterprise and Wide-area Distributed Computing Environments

by Daniel Nurmi, John Brevik, Rich Wolski - In Euro-Par’05 , 2003
"... In this paper, we consider the problem of modeling machine availability in enterprise-area and wide-area distributed computing settings. Using availability data gathered from three different environments, we detail the suitability of four potential statistical distributions for each data set: expone ..."
Abstract - Cited by 51 (7 self) - Add to MetaCart
In this paper, we consider the problem of modeling machine availability in enterprise-area and wide-area distributed computing settings. Using availability data gathered from three different environments, we detail the suitability of four potential statistical distributions for each data set: exponential, Pareto, Weibull, and hyperexponential. In each case, we use software we have developed to determine the necessary parameters automatically from each data collection.

Fault-aware job scheduling for bluegene/l systems

by A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, A. Sivasubramaniam - In IEEE IPDPS, Intl. Parallel and Distributed Processing Symposium , 2004
"... Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. In this paper evaluate the effectiveness of a previously developed job scheduling algorithm for BlueGene/L in the presence of faults. We have developed two new job-sc ..."
Abstract - Cited by 23 (7 self) - Add to MetaCart
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. In this paper evaluate the effectiveness of a previously developed job scheduling algorithm for BlueGene/L in the presence of faults. We have developed two new job-scheduling algorithms considering failures while scheduling the jobs. We have also evaluated the impact of these algorithms on average bounded slowdown, average response time and system utilization, considering different levels of proactive failure prediction and prevention techniques reported in the literature. Our simulation studies show that the use of these new algorithms with even trivial fault prediction confidence or accuracy levels (as low as) can significantly improve the performance of the BlueGene/L system. 1.

Automatic Methods for Predicting Machine Availability in Desktop Grid and Peer-to-peer Systems

by John Brevik - In Proceedings of the of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid’04 , 2004
"... In this paper, we examine the problem of predicting machine availability in desktop and enterprise computing environments. Predicting the duration that a machine will run until it restarts (availability duration) is critically useful to application scheduling and resource characterization in federat ..."
Abstract - Cited by 20 (1 self) - Add to MetaCart
In this paper, we examine the problem of predicting machine availability in desktop and enterprise computing environments. Predicting the duration that a machine will run until it restarts (availability duration) is critically useful to application scheduling and resource characterization in federated systems. We describe one parametric model fitting technique and two non-parametric prediction techniques, comparing their accuracy in predicting the quantiles of empirically observed machine availability distributions. We describe each method analytically and evaluate its precision using a synthetic trace of machine availability constructed from a known distribution. To detail their practical efficacy, we apply them to machine availability traces from three separate desktop and enterprise computing environments, and evaluate each method in terms of the accuracy with which it predicts availability in a trace driven simulation. Our results indicate that availability duration can be predicted with quantifiable confidence bounds and that these bounds can be used as conservative bounds on lifetime predictions. Moreover, a non-parametric method based on a binomial approach generates the most accurate estimates.

Performance implications of periodic checkpointing on large-scale cluster systems

by A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta - In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - Workshop 18 , 2005
"... Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a common technique for mitigating the amount of work lost due to job failures, but its effectiveness under realistic circumstanc ..."
Abstract - Cited by 14 (2 self) - Add to MetaCart
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a common technique for mitigating the amount of work lost due to job failures, but its effectiveness under realistic circumstances has not been studied. In this paper, we analyze the system-level performance of periodic application checkpointing using parameters similar to those projected for BlueGene/L systems. Our results reflect simulations on a toroidal interconnect architecture, using a real job log from a machine similar to BlueGene/L, and with a real failure distribution from a large-scale cluster. Our simulation studies investigate the impact of parameters such as checkpoint overhead and checkpoint interval on a number of performance metrics, including bounded slowdown, system utilization, and total work lost. The results suggest that periodic checkpointing may not be an effective way to improve the average bounded slowdown or average system utilization metrics, though it reduces the amount of work lost due to failures. We show that overzealous checkpointing with high overhead can amplify the effects of failures. The study also suggests that new metrics and checkpointing techniques may be required to effectively handle job failures on large-scale machines like BlueGene/L. 1.

Model-based Checkpoint Scheduling for Volatile Resource Environments

by Daniel Nurmi, Rich Wolski, John Brevik - In Proceedings of Cluster 2005 , 2004
"... In this paper, we describe a system for application checkpoint scheduling in volatile resource environments. Our approach combines historical measurements of resource availability with an estimate of checkpoint/recovery delay to generate checkpoint intervals that minimize overhead. When executing in ..."
Abstract - Cited by 6 (3 self) - Add to MetaCart
In this paper, we describe a system for application checkpoint scheduling in volatile resource environments. Our approach combines historical measurements of resource availability with an estimate of checkpoint/recovery delay to generate checkpoint intervals that minimize overhead. When executing in a desktop computing or resource harvesting context, long-running applications must checkpoint, since resources can be reclaimed by their owners without warning. Our system records the historical availability from each resource and fits a statistical model to the observations using either Maximum Likelihood Estimation (MLE) or Expectation Maximization (EM). When an application is initiated on a particular resource, the system uses the computed distribution to parameterize a Markov state-transition model for the application’s execution, evaluates the expected overhead as a function of the checkpoint interval, and numerically optimizes this quantity. Using Condor as a target platform, we investigate the effectiveness of this technique fitting exponential, Weibull, 2-phase hyperexponential and 3-phase hyperexponential distributions to observed availability data. To verify our method and compare the distributions each against the same conditions, we use observations taken from the Condor pool at the University of Wisconsin and trace-based simulation. We examine the practical value of our approach by observing an implementation of our system when applied to a test application that is then run on the “live ” Condor system. Finally, we conclude with a verification of the simulated results against the experimental observations. Our results indicate that application efficiency is relatively insensitive to

Duplex: A Reusable Fault Tolerance Extension Framework for Network Access Devices

by Srikant Sharma, Jiawu Chen, Wei Li, Kartik Gopalan, Tzi-cker Chiueh - In Proceedings of 2003 International Conference on Dependable Systems and Networks (DSN , 2003
"... A growing variety of edge network access devices appear on the marketplace that perform various functions which are meant to complement generic routers' capabilities, such as firewalling, intrusion detection, virus scanning, network address translation, traffic shaping and route optimization. Becaus ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
A growing variety of edge network access devices appear on the marketplace that perform various functions which are meant to complement generic routers' capabilities, such as firewalling, intrusion detection, virus scanning, network address translation, traffic shaping and route optimization. Because these edge network access devices are deployed on the critical path between a user site and its Internet service provider, high availability is crucial to their design. This paper describes the design, construction and evaluation of a general implementation framework for supporting fault tolerance on edge network devices. This implementation framework, called Duplex, is designed to be independent of the functionality of the hosting edge network access device, such that only a minimal amount of programming is required to tailor this framework to a specific edge network access device implementation. Duplex can tolerate power failure, hardware failure, and software failure by supporting device mirroring and watchdog timer-based link bypassing. Empirical performance measurements of an instance of Duplex that is embedded in a commercial bandwidth management device show that the run-time overhead of its fault tolerance mechanisms is less than 1 msec 90% of the time, and the failure detection and recovery period is less than 1.3 sec when running at 100 Mbps.

Minimizing the Network Overhead of Checkpointing in Cycle-harvesting Cluster Environment

by Daniel Nurmi, John Brevik, Rich Wolski - In Proceedings of Cluster 2005 , 2005
"... Cycle-harvesting systems such as Condor have been developed to make desktop machines in a local area (which are often similar to clusters in hardware configuration) available as a compute platform. To provide a dual-use capability, opportunistic jobs harvesting cycles from the desktop must be checkp ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Cycle-harvesting systems such as Condor have been developed to make desktop machines in a local area (which are often similar to clusters in hardware configuration) available as a compute platform. To provide a dual-use capability, opportunistic jobs harvesting cycles from the desktop must be checkpointed before the desktop resources are reclaimed by their owners and the job is evacuated. In this paper, we investigate a new system for computing efficient checkpoint schedules in cycleharvesting environments. Our system records the historical availability from each resource and fits a statistical model to the observations. Because checkpointing must often traverse the network (i.e. the desktop hosts do not provide sufficient persistent storage for checkpoints), we combine this model with predictions of network performance to the storage site to compute a checkpoint schedule. When an application is initiated on a particular resource, the system uses the computed distribution to parameterize a Markov state-transition model for the application’s execution, evaluates the expected time and network overhead as a function of the checkpoint interval, and numerically optimizes with respect to time. We report on the performance of and implementation of this system using the Condor cycleharvesting environment at the Universiity of Wisconsin. We also evaluate the efficiencies we achieve for a variety of network overheads using trace-based simulation. Finally, we validate our simulations against the observed performance with Condor. Our results indicate that while the choice of model distribution has a relatively small but positive effect on time efficiency, it has a substantial impact on network utilization.

Application-Driven Coordination-Free Distributed Checkpointing

by Adnan Agbaria, William H. Sanders
"... Distributed checkpointing is an important concept in providing fault tolerance in distributed systems. In today’s applications, e.g., grid and massively parallel applications, the imposed overhead of taking a distributed checkpoint using the known approaches can often outweigh its benefits due to co ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Distributed checkpointing is an important concept in providing fault tolerance in distributed systems. In today’s applications, e.g., grid and massively parallel applications, the imposed overhead of taking a distributed checkpoint using the known approaches can often outweigh its benefits due to coordination and other overhead from the processes. This paper presents an innovative approach for distributed checkpointing. In this approach, the checkpoints are obtained using offline analysis based on the application level. During execution, no coordination is required. After presenting our approach, we prove its safety and present a performance analysis of it using stochastic models.

Using checkpointing to recover from poor multi-site parallel job scheduling decisions

by William M. Jones - In The 5th Workshop on Middleware for Grid Computing at the ACM/IFIP/USENIX 8th International Middleware Conference
"... Recent research in multi-site parallel job scheduling leverages user-provided estimates of job communication characteristics to effectively partition the job across multiple clusters. Previous research addressed the impact of inaccuracies in these estimates on overall system performance and found th ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
Recent research in multi-site parallel job scheduling leverages user-provided estimates of job communication characteristics to effectively partition the job across multiple clusters. Previous research addressed the impact of inaccuracies in these estimates on overall system performance and found that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are many instances where these errors result in poor scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application runtime performance and turnaround time. In this paper, we explore the use of job checkpointing to selectively stop offending jobs in order to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which checkpointing improves overall performance. We demonstrate that checkpointing is beneficial even when the overhead of doing so is costly.

Evaluating Cooperative Checkpointing for Supercomputing Systems

by Adam Oliner, Ramendra Sahoo
"... Cooperative checkpointing, in which the system dynamically skips checkpoints requested by applications at runtime, can exploit system-level information to improve performance and reliability in the face of failures. We evaluate the applicability of cooperative checkpointing to large-scale systems th ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Cooperative checkpointing, in which the system dynamically skips checkpoints requested by applications at runtime, can exploit system-level information to improve performance and reliability in the face of failures. We evaluate the applicability of cooperative checkpointing to large-scale systems through simulation studies considering real workloads, failure logs, and different network topologies. We consider two cooperative checkpointing algorithms: work-based cooperative checkpointing uses a heuristic based on the amount of unsaved work and risk-based cooperative checkpointing leverages failure event prediction. Our results demonstrate that, compared to periodic checkpointing, riskbased checkpointing with event prediction accuracy as low as 10 % is able to significantly improve system utilization and reduce average bounded slowdown by a factor of 9, without losing any additional work to failures. Similarly, work-based checkpointing conferred tremendous performance benefits in the face of large checkpoint overheads. 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University