Results 1 - 10
of
245
Failure trends in a large disk drive population
- In Proceedings of the 5th USENIX Conference on File and Storage Technologies
, 2007
"... It is estimated that over 90 % of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. M ..."
Abstract
-
Cited by 240 (1 self)
- Add to MetaCart
(Show Context)
It is estimated that over 90 % of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis. We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity. Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported. 1
Write Off-Loading: Practical Power Management for Enterprise Storage
"... In enterprise data centers power usage is a problem impacting server density and the total cost of ownership. Storage uses a significant fraction of the power budget and there are no widely deployed power-saving solutions for enterprise storage systems. The traditional view is that enterprise worklo ..."
Abstract
-
Cited by 134 (9 self)
- Add to MetaCart
(Show Context)
In enterprise data centers power usage is a problem impacting server density and the total cost of ownership. Storage uses a significant fraction of the power budget and there are no widely deployed power-saving solutions for enterprise storage systems. The traditional view is that enterprise workloads make spinning disks down ineffective because idle periods are too short. We analyzed block-level traces from 36 volumes in an enterprise data center for one week and concluded that significant idle periods exist, and that they can be further increased by modifying the read/write patterns using write off-loading. Write off-loading allows write requests on spun-down disks to be temporarily redirected to persistent storage elsewhere in the data center. The key challenge is doing this transparently and efficiently at the block level, without sacrificing consistency or failure resilience. We describe our write offloading design and implementation that achieves these goals. We evaluate it by replaying portions of our traces on a rack-based testbed. Results show that just spinning disks down when idle saves 28–36 % of energy, and write off-loading further increases the savings to 45–60%. 1
DRAM Errors in the Wild: A Large-Scale Field Study
"... Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real ..."
Abstract
-
Cited by 126 (2 self)
- Add to MetaCart
(Show Context)
Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days. The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age? We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8 % of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don’t observe any indication that newer generations of DIMMs have worse error behavior.
Understanding Intrinsic Characteristics and System Implications of Flash Memory based Solid State Drives
"... Flash Memory based Solid State Drive (SSD) has been called a “pivotal technology ” that could revolutionize data storage systems. Since SSD shares a common interface with the traditional hard disk drive (HDD), both physically and logically, an effective integration of SSD into the storage hierarchy ..."
Abstract
-
Cited by 111 (5 self)
- Add to MetaCart
(Show Context)
Flash Memory based Solid State Drive (SSD) has been called a “pivotal technology ” that could revolutionize data storage systems. Since SSD shares a common interface with the traditional hard disk drive (HDD), both physically and logically, an effective integration of SSD into the storage hierarchy is very important. However, details of SSD hardware implementations tend to be hidden behind such narrow interfaces. In fact, since sophisticated algorithms are usually, of necessity, adopted in SSD controller firmware, more complex performance dynamics are to be expected in SSD than in HDD systems. Most existing literature or product specifications on SSD just provide high-level descriptions and standard performance data, such as bandwidth and latency. In order to gain insight into the unique performance characteristics
An analysis of latent sector errors in disk drives
- In Proceedings of the 2007 SIGMETRICS Conference on Measurement and Modeling of Computer Systems
, 2007
"... The reliability measures in today’s disk drive-based storage systems focus predominantly on protecting against complete disk failures. Previous disk reliability studies have analyzed empirical data in an attempt to better understand and predict disk failure rates. Yet, very little is known about the ..."
Abstract
-
Cited by 107 (6 self)
- Add to MetaCart
(Show Context)
The reliability measures in today’s disk drive-based storage systems focus predominantly on protecting against complete disk failures. Previous disk reliability studies have analyzed empirical data in an attempt to better understand and predict disk failure rates. Yet, very little is known about the incidence of latent sector errors i.e., errors that go undetected until the corresponding disk sectors are accessed. Our study analyzes data collected from production storage systems over 32 months across 1.53 million disks (both nearline and enterprise class). We analyze factors that impact latent sector errors, observe trends, and explore their implications on the design of reliability mechanisms in storage systems. To the best of our knowledge, this is the first study of such large scale – our sample size is at least an order of magnitude larger than previously published studies – and the first one to focus specifically on latent sector errors and their implications on the design and reliability of storage systems.
Understanding Failures in Petascale Computers
"... With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interrup ..."
Abstract
-
Cited by 103 (6 self)
- Add to MetaCart
(Show Context)
With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interruptions that even smaller terascale computers experience. The information that is available suggests that application interruptions will become far more common in the coming decade, and the largest applications may surrender large fractions of the computer’s resources to taking checkpoints and restarting from a checkpoint after an interruption. This paper reviews sources of failure information for compute clusters and storage systems, projects failure rates and the corresponding decrease in application effectiveness, and discusses coping strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance for supercomputing. The need for a public repository for detailed failure and interruption records is particularly concerning, as projections from one architectural family of machines to another are widely disputed. To this end, this paper introduces the Computer Failure Data Repository and issues a call for failure history data to publish in it. 1.
Auditing to keep online storage services honest
- In Proc. of HotOS XI. Usenix
, 2007
"... A growing number of online service providers offer to store customers ’ photos, email, file system backups, and other digital assets. Currently, customers cannot make informed decisions about the risk of losing data stored with any particular service provider, reducing their incentive to rely on the ..."
Abstract
-
Cited by 91 (1 self)
- Add to MetaCart
(Show Context)
A growing number of online service providers offer to store customers ’ photos, email, file system backups, and other digital assets. Currently, customers cannot make informed decisions about the risk of losing data stored with any particular service provider, reducing their incentive to rely on these services. We argue that thirdparty auditing is important in creating an online serviceoriented economy, because it allows customers to evaluate risks, and it increases the efficiency of insurancebased risk mitigation. We describe approaches and system hooks that support both internal and external auditing of online storage services, describe motivations for service providers and auditors to adopt these approaches, and list challenges that need to be resolved for such auditing to become a reality. 1
Availability in Globally Distributed Storage Systems
"... Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundan ..."
Abstract
-
Cited by 85 (0 self)
- Add to MetaCart
(Show Context)
Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloudbased storage services. We characterize the availability properties of cloud storage systems based on an extensive one year study of Google’s main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet. 1
Pyramid codes: Flexible schemes to trade space for access efficiency in reliable data storage systems
- In Proceedings of the IEEE International Symposium on Network Computing and Applications. IEEE, Los Alamitos
"... We design flexible schemes to explore the tradeoffs between storage space and access efficiency in reliable data storage systems. Aiming at this goal, two new classes of erasure-resilient codes are introduced – Basic Pyramid Codes (BPC) and Generalized Pyramid Codes (GPC). Both schemes require sligh ..."
Abstract
-
Cited by 81 (9 self)
- Add to MetaCart
(Show Context)
We design flexible schemes to explore the tradeoffs between storage space and access efficiency in reliable data storage systems. Aiming at this goal, two new classes of erasure-resilient codes are introduced – Basic Pyramid Codes (BPC) and Generalized Pyramid Codes (GPC). Both schemes require slightly more storage space than conventional schemes, but significantly improve the critical performance of read during failures and unavailability. As a by-product, we establish a necessary matching condition to characterize the limit of failure recovery, that is, unless the matching condition is satisfied, a failure case is impossible to recover. In addition, we define a maximally recoverable (MR) property. For all ERC schemes holding the MR property, the matching condition becomes sufficient, that is, all failure cases satisfying the matching condition are indeed recoverable. We show that GPC is the first class of non-MDS schemes holding the MR property.
What Supercomputers Say: A Study of Five System Logs
- Proc. of DSN 2007
"... If we hope to automatically detect and diagnose failures in large-scale computer systems, we must study real deployed systems and the data they generate. Progress has been hampered by the inaccessibility of empirical data. This paper addresses that dearth by examining system logs from five supercomp ..."
Abstract
-
Cited by 79 (7 self)
- Add to MetaCart
(Show Context)
If we hope to automatically detect and diagnose failures in large-scale computer systems, we must study real deployed systems and the data they generate. Progress has been hampered by the inaccessibility of empirical data. This paper addresses that dearth by examining system logs from five supercomputers, with the aim of providing useful insight and direction for future research into the use of such logs. We present details about the systems, methods of log collection, and how alerts were identified; propose a simpler and more effective filtering algorithm; and define operational context to encompass the crucial information that we found to be currently missing from most logs. The machines we consider (and the number of processors) are: Blue Gene/L (131072), Red Storm (10880), Thunderbird (9024), Spirit (1028), and Liberty (512). This is the first study of raw system logs from multiple supercomputers. 1