Results 1 - 10
of
15
Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?
, 2007
"... Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance ..."
Abstract
-
Cited by 108 (7 self)
- Add to MetaCart
Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4 % common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years. Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks. Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.
Failure trends in a large disk drive population
- In Proceedings of the 5th USENIX Conference on File and Storage Technologies
, 2007
"... It is estimated that over 90 % of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. M ..."
Abstract
-
Cited by 98 (1 self)
- Add to MetaCart
It is estimated that over 90 % of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis. We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity. Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported. 1
IRON file systems
- In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP ’05
, 2005
"... IRON FILE SYSTEMSVijayan Prabhakaran Disk drives are widely used as a primary medium for storing information.While commodity file systems trust disks to either work or fail completely, modern disks exhibit complex failure modes such as latent sector faults and block corrup-tions, where only portions ..."
Abstract
-
Cited by 74 (24 self)
- Add to MetaCart
IRON FILE SYSTEMSVijayan Prabhakaran Disk drives are widely used as a primary medium for storing information.While commodity file systems trust disks to either work or fail completely, modern disks exhibit complex failure modes such as latent sector faults and block corrup-tions, where only portions of a disk fail.
An analysis of latent sector errors in disk drives
- In Proceedings of the 2007 SIGMETRICS Conference on Measurement and Modeling of Computer Systems
, 2007
"... The reliability measures in today’s disk drive-based storage systems focus predominantly on protecting against complete disk failures. Previous disk reliability studies have analyzed empirical data in an attempt to better understand and predict disk failure rates. Yet, very little is known about the ..."
Abstract
-
Cited by 60 (6 self)
- Add to MetaCart
The reliability measures in today’s disk drive-based storage systems focus predominantly on protecting against complete disk failures. Previous disk reliability studies have analyzed empirical data in an attempt to better understand and predict disk failure rates. Yet, very little is known about the incidence of latent sector errors i.e., errors that go undetected until the corresponding disk sectors are accessed. Our study analyzes data collected from production storage systems over 32 months across 1.53 million disks (both nearline and enterprise class). We analyze factors that impact latent sector errors, observe trends, and explore their implications on the design of reliability mechanisms in storage systems. To the best of our knowledge, this is the first study of such large scale – our sample size is at least an order of magnitude larger than previously published studies – and the first one to focus specifically on latent sector errors and their implications on the design and reliability of storage systems.
Understanding Failures in Petascale Computers
"... With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interrup ..."
Abstract
-
Cited by 25 (5 self)
- Add to MetaCart
With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interruptions that even smaller terascale computers experience. The information that is available suggests that application interruptions will become far more common in the coming decade, and the largest applications may surrender large fractions of the computer’s resources to taking checkpoints and restarting from a checkpoint after an interruption. This paper reviews sources of failure information for compute clusters and storage systems, projects failure rates and the corresponding decrease in application effectiveness, and discusses coping strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance for supercomputing. The need for a public repository for detailed failure and interruption records is particularly concerning, as projections from one architectural family of machines to another are widely disputed. To this end, this paper introduces the Computer Failure Data Repository and issues a call for failure history data to publish in it. 1.
Enhanced reliability modeling of raid storage systems
- In Proceedings of the International Conference on Dependable Systems and Networks (DSN
, 2007
"... A flexible model for estimating reliability of RAID storage systems is presented. This model corrects errors associated with the common assumption that system times to failure follow a homogeneous Poisson process. Separate generalized failure distributions are used to model catastrophic failures and ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
A flexible model for estimating reliability of RAID storage systems is presented. This model corrects errors associated with the common assumption that system times to failure follow a homogeneous Poisson process. Separate generalized failure distributions are used to model catastrophic failures and usage dependent data corruptions for each hard drive. Catastrophic failure restoration is represented by a three-parameter Weibull, so the model can include a minimum time to restore as a function of data transfer rate and hard drive storage capacity. Data can be scrubbed as a background operation to eliminate corrupted data that, in the event of a simultaneous catastrophic failure, results in double disk failures. Field-based times to failure data and mathematic justification for a new model are presented. Model results have been verified and predict between 2 to 1,500 times as many double disk failures as that estimated using the current mean time to data loss method. 1.
Data-Intensive Supercomputing: The case for DISC
, 2007
"... Google and its competitors have created a new class of large-scale computer systems to support Internet search. These “Data-Intensive Super Computing ” (DISC) systems differ from conventional supercomputers in their focus on data: they acquire and maintain continually changing data sets, in addition ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Google and its competitors have created a new class of large-scale computer systems to support Internet search. These “Data-Intensive Super Computing ” (DISC) systems differ from conventional supercomputers in their focus on data: they acquire and maintain continually changing data sets, in addition to performing large-scale computations over the data. With the massive amounts of data arising from such diverse sources as telescope imagery, medical records, online transaction records, and web pages, DISC systems have the potential to achieve major advances in science, health care, business efficiencies, and information access. DISC opens up many important research topics in system design, resource management, programming models, parallel algorithms, and applications. By engaging the academic research community in these issues, we can more systematically and in a more open forum explore fundamental aspects of a societally important style of computing. Keywords: parallel computing, data storage, web searchWhen a teenage boy wants to find information about his idol by using Google with the search query “Britney Spears, ” he unleashes the power of several hundred processors operating on a data set of over 200 terabytes. Why then can’t a scientist seeking a cure for cancer invoke large amounts of computation over a terabyte-sized database of DNA microarray data at the click of a button? Recent papers on parallel programming by researchers at Google [13] and Microsoft [19] present the results of using up to 1800 processors to perform computations accessing up to 10 terabytes of data. How can university researchers demonstrate the credibility of their work without having comparable computing facilities available? 1
Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery
- SC07
, 2007
"... Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault sourc ..."
Abstract
-
Cited by 13 (13 self)
- Add to MetaCart
Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault source leading to data unavailability and job resubmissions. This results in reduced center performance, partially due to the lack of coordination between I/O activities and job scheduling. In this work, we propose the coordination of job scheduling with data staging/offloading and on-demand staged data reconstruction to address the availability of job input data and to improve centerwide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center’s standpoint, these techniques optimize resource usage and increase its data/service availability. From a user’s standpoint, they reduce the job turnaround time and optimize the allocated time usage.
Disk failure investigations at the internet archive
- In Work-in-Progess session, NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST2006
, 2006
"... As storage sites grow in size to thousands of disks, and as the need to predict availability and reliability increases, researchers and designers need a better quantitative understanding of the ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
As storage sites grow in size to thousands of disks, and as the need to predict availability and reliability increases, researchers and designers need a better quantitative understanding of the
Dependability Analysis of Virtual Memory Systems
- In DSN-2006
, 2006
"... Recent research has shown that even modern hard disks have complex failure modes that do not conform to “failstop” operation. Disks exhibit partial failures like block access errors and block corruption. Commodity operating systems are required to deal with such failures as commodity hard disks are ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
Recent research has shown that even modern hard disks have complex failure modes that do not conform to “failstop” operation. Disks exhibit partial failures like block access errors and block corruption. Commodity operating systems are required to deal with such failures as commodity hard disks are known to be failure-prone. An important operating system component that is exposed to disk failures is the virtual memory system. In this paper, we examine the failure handling policies of different virtual memory systems for different classes of partial disk errors. We use type and context aware fault injection to explore as many of the internal code paths as possible. From experiments, we find that failure handling policies in current virtual memory systems are at best simplistic, and often inconsistent or even absent. Our fault injection technique also identifies bugs in the failure handling code in these systems. The study identifies possible reasons for poor failure handling, which can help in the design of a failure-aware virtual memory system. 1.

