Results 1 - 10
of
29
Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?
, 2007
"... Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance ..."
Abstract
-
Cited by 108 (7 self)
- Add to MetaCart
Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4 % common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years. Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks. Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.
A large-scale study of failures in high-performance computing systems
- In Proc. of the 2006 International Conference on Dependable Systems and Networks (DSN’06
, 2006
"... systems ..."
Fluid stochastic Petri nets: Theory applications and solution techniques
- European Journal of Operational Research
, 1998
"... In this paper we introduce a new class of stochastic Petri nets in which one or more places can hold uid rather than discrete tokens. We de ne a class of uid stochastic Petri nets in such awaythat the discrete and continuous portions may a ect each other. Following this de nition we provide equation ..."
Abstract
-
Cited by 49 (10 self)
- Add to MetaCart
In this paper we introduce a new class of stochastic Petri nets in which one or more places can hold uid rather than discrete tokens. We de ne a class of uid stochastic Petri nets in such awaythat the discrete and continuous portions may a ect each other. Following this de nition we provide equations for their transient and steady-state behavior. We present several examples showing the utility of the construct in communication network modeling and reliability analysis, and discuss important special cases. We then discuss numerical methods for computing the transient behavior of such nets. Finally, some numerical examples are presented.
Networked Windows NT System Field Failure Data Analysis
, 1999
"... This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study indiv ..."
Abstract
-
Cited by 46 (0 self)
- Add to MetaCart
This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%,(5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime. 1
Failure data analysis of a large-scale heterogeneous server environment
- In Proceedings of the 2004 International Conference on Dependable Systems and Networks
, 2004
"... The growing complexity of hardware and software mandates the recognition of fault occurrence in system deployment and management. While there are several techniques to prevent and/or handle faults, there continues to be a growing need for an in-depth understanding of system errors and failures and t ..."
Abstract
-
Cited by 36 (4 self)
- Add to MetaCart
The growing complexity of hardware and software mandates the recognition of fault occurrence in system deployment and management. While there are several techniques to prevent and/or handle faults, there continues to be a growing need for an in-depth understanding of system errors and failures and their empirical and statistical properties. This understanding can help evaluate the effectiveness of different techniques for improving system availability, in addition to developing new solutions. In this paper, we analyze the empirical and statistical properties of system errors and failures from a network of nearly 400 heterogeneous servers running a diverse workload over a year. While improvements in system robustness continue to limit the number of actual failures to a very small fraction of the recorded errors, the failure rates are significant and highly variable. Our results also show that the system error and failure patterns are comprised of time-varying behavior containing long stationary intervals. These stationary intervals exhibit various strong correlation structures and periodic patterns, which impact performance but also can be exploited to address such performance issues. 1.
Fault-Tolerant Rate-Monotonic Scheduling
- Journal of Real-Time Systems
, 1998
"... Due to the critical nature of the tasks in hard real-time systems, it is essential that faults be tolerated. Several studies have shown that space applications, which have very high reliability requirements, have also very high transient faults frequency. Therefore, tolerance to this type of faults ..."
Abstract
-
Cited by 29 (12 self)
- Add to MetaCart
Due to the critical nature of the tasks in hard real-time systems, it is essential that faults be tolerated. Several studies have shown that space applications, which have very high reliability requirements, have also very high transient faults frequency. Therefore, tolerance to this type of faults is essential in such applications. In this paper, we present a scheme which can be used to tolerate faults during the execution of preemptive real-time tasks. We describe a recovery scheme which can be used to re-execute tasks in the event of single and multiple transient faults and discuss conditions that must be met by any such recovery scheme. We then extend the Rate Monotonic Scheduling (RMS) scheme to provide tolerance for single and multiple transient faults. We derive schedulability bounds for sets of real-time tasks given the desired level of fault tolerance for each task or subset of tasks. Finally, we analyze and compare the bounds derived as a function of the amount of processing ...
Understanding Failures in Petascale Computers
"... With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interrup ..."
Abstract
-
Cited by 25 (5 self)
- Add to MetaCart
With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interruptions that even smaller terascale computers experience. The information that is available suggests that application interruptions will become far more common in the coming decade, and the largest applications may surrender large fractions of the computer’s resources to taking checkpoints and restarting from a checkpoint after an interruption. This paper reviews sources of failure information for compute clusters and storage systems, projects failure rates and the corresponding decrease in application effectiveness, and discusses coping strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance for supercomputing. The need for a public repository for detailed failure and interruption records is particularly concerning, as projections from one architectural family of machines to another are widely disputed. To this end, this paper introduces the Computer Failure Data Repository and issues a call for failure history data to publish in it. 1.
Guaranteeing Fault Tolerance Through Scheduling In Real-Time Systems
, 1996
"... Real-time systems are those which must execute all tasks within their timing constraints. Due to the catastrophic consequences of missing deadlines of some realtime tasks, fault tolerance is an essential component of such systems. This thesis introduces techniques to enhance the fault tolerance capa ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Real-time systems are those which must execute all tasks within their timing constraints. Due to the catastrophic consequences of missing deadlines of some realtime tasks, fault tolerance is an essential component of such systems. This thesis introduces techniques to enhance the fault tolerance capability of real-time systems by incorporating time redundancy. Time redundancy is essential in ultrareliable real-time systems where correlated faults must be tolerated. It can also be used to detect and tolerate transient faults, which are a majority of the faults in computing systems. This thesis demonstrates how time redundancy can be used in conjunction with hardware and software redundancy to tolerate a variety of faults in real-time systems. This thesis considers several different system and task models, and for each model, presents a schedulability test (a utilization bound or a set of conditions) which guarantees that all tasks in the system will satisfy their timing constraints even ...
Elnozahy. The interplay of power management and fault recovery in real-time systems
- IEEE Trans. on Computers
, 2004
"... Abstract—This paper describes how to exploit the scheduling slack in a real-time system to reduce energy consumption and achieve fault tolerance at the same time. During failure-free operation, a task takes checkpoints to enable recovery from failure. Additionally, the system exploits the slack to c ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Abstract—This paper describes how to exploit the scheduling slack in a real-time system to reduce energy consumption and achieve fault tolerance at the same time. During failure-free operation, a task takes checkpoints to enable recovery from failure. Additionally, the system exploits the slack to conserve energy by reducing the processor speed. If a task fails, it will restart from a saved checkpoint and execute at maximum speed to guarantee that the deadlines are met. The paper shows that the number of checkpoints and their placements interact in subtle ways with the power management policy. We study two checkpoint placement policies for aperiodic tasks and analytically derive the optimal number of checkpoints to conserve energy under each. This optimal number allows the CPU speed to be slowed down to the level that yields minimum energy consumption, while still guaranteeing recoverability of tasks under each checkpointing policy. The results show that traditional periodic checkpointing is not the best policy for the combined purpose of conserving energy and guaranteeing recovery. Instead, better energy savings are possible through a nonuniform distribution of checkpoints that takes into account the energy consumption and reliability factors. Depending on the amount of slack and the checkpointing overhead, energy can be reduced by up to 68 percent under nonuniform checkpointing. We also demonstrate the applicability of these checkpoint placement policies to periodic tasks. Index Terms—Checkpointing, fault tolerance, frequency scaling, power management, real-time systems, reliability, voltage scaling. 1
Tolerance to Multiple Transient Faults for Aperiodic Tasks in Hard Real-Time Systems
- IEEE Transactions on Computers
, 1999
"... Real-time systems are being increasingly used in several applications which are time-critical in nature. Fault tolerance is an essential requirement of such systems, due to the catastrophic consequences of not tolerating faults. In this paper, we study a scheme that guarantees the timely recovery ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Real-time systems are being increasingly used in several applications which are time-critical in nature. Fault tolerance is an essential requirement of such systems, due to the catastrophic consequences of not tolerating faults. In this paper, we study a scheme that guarantees the timely recovery from multiple faults within hard real-time constraints in uniprocessor systems. Assuming earliest-deadline-first scheduling (EDF) for aperiodic preemptive tasks, we develop a necessary and sufficient feasibility-check algorithm for fault-tolerant scheduling with complexity O(n 2 \Delta k), where n is the number of tasks to be scheduled and k is the maximum number of faults to be tolerated. INDEX TERMS: Real-time scheduling, earliest-deadline first, fault-tolerant schedules, fault recovery. 1 Introduction The interest in embedded systems has been growing steadily in the recent past, specially those systems in which timing constraints are essential for the correct execution of the sys...

