Results 1 - 10
of
10
An Analysis of Data Corruption in the Storage Stack
- In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST ’08
, 2008
"... An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this paper, we present the first large-scale study of data corruption. We analyze corruption in ..."
Abstract
-
Cited by 28 (6 self)
- Add to MetaCart
An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this paper, we present the first large-scale study of data corruption. We analyze corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months. We study three classes of corruption: checksum mismatches, identity discrepancies, and parity inconsistencies. We focus on checksum mismatches since they occur the most. We find more than 400,000 instances of checksum mismatches over the 41-month period. We find many interesting trends among these instances including: (i) nearline disks (and their adapters) develop checksum mismatches an order of magnitude more often than enterprise class disk drives, (ii) checksum mismatches within the same disk are not independent events and they show high spatial and temporal locality, and (iii) checksum mismatches across different disks in the same storage system are not independent. We use our observations to derive lessons for corruption-proof system design. 1
Improving File System Reliability with I/O Shepherding
- In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP ’07
, 2007
"... We introduce a new reliability infrastructure for file systems called I/O shepherding. I/O shepherding allows a file system developer to craft nuanced reliability policies to detect and recover from a wide range of storage system failures. We incorporate shepherding into the Linux ext3 file system t ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
We introduce a new reliability infrastructure for file systems called I/O shepherding. I/O shepherding allows a file system developer to craft nuanced reliability policies to detect and recover from a wide range of storage system failures. We incorporate shepherding into the Linux ext3 file system through a set of changes to the consistency management subsystem, layout engine, disk scheduler, and buffer cache. The resulting file system, CrookFS, enables a broad class of policies to be easily and correctly specified. We implement numerous policies, incorporating data protection techniques such as retry, parity, mirrors, checksums, sanity checks, and data structure repairs; even complex policies can be implemented in less than 100 lines of code, confirming the power and simplicity of the shepherding framework. We also demonstrate that shepherding is properly integrated, adding less than 5 % overhead to the I/O path. Categories and Subject Descriptors:
Parity Lost and Parity Regained
"... RAID storage systems protect data from storage errors, such as data corruption, using a set of one or more integrity techniques, such as checksums. The exact protection offered by certain techniques or a combination of techniques is sometimes unclear. We introduce and apply a formal method of analyz ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
RAID storage systems protect data from storage errors, such as data corruption, using a set of one or more integrity techniques, such as checksums. The exact protection offered by certain techniques or a combination of techniques is sometimes unclear. We introduce and apply a formal method of analyzing the design of data protection strategies. Specifically, we use model checking to evaluate whether common protection techniques used in parity-based RAID systems are sufficient in light of the increasingly complex failure modes of modern disk drives. We evaluate the approaches taken by a number of real systems under single-error conditions, and find flaws in every scheme. In particular, we identify a parity pollution problem that spreads corrupt data (the result of a single error) across multiple disks, thus leading to data loss or corruption. We further identify which protection measures must be used to avoid such problems. Finally, we show how to combine real-world failure data with the results from the model checker to estimate the actual likelihood of data loss of different protection strategies. 1
Tolerating File-System Mistakes with EnvyFS
"... We introduce EnvyFS, an N-version local file system designed to improve reliability in the face of file-system bugs. EnvyFS, implemented as a thin VFS-like layer near the top of the storage stack, replicates file-system metadata and data across existing and diverse commodity file systems (e.g., ext3 ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We introduce EnvyFS, an N-version local file system designed to improve reliability in the face of file-system bugs. EnvyFS, implemented as a thin VFS-like layer near the top of the storage stack, replicates file-system metadata and data across existing and diverse commodity file systems (e.g., ext3, ReiserFS, JFS). It uses majority-consensus to operate correctly despite the sometimes faulty behavior of an underlying commodity child file system. Through experimentation, we show EnvyFS is robust to a wide range of failure scenarios, thus delivering on its promise of increased fault tolerance; however, performance and capacity overheads can be significant. To remedy this issue, we introduce SubSIST, a novel single-instance store designed to operate in an N-version environment. In the common case where all child file systems are working properly, SubSIST coalesces most blocks and thus greatly reduces time and space overheads. In the rare case where a child makes a mistake, SubSIST does not propagate the error to other children, and thus preserves the ability of EnvyFS to detect and recover from bugs that affect data reliability. Overall, EnvyFS and SubSIST combine to significantly improve reliability with only modest space and time overheads. 1
Analyzing the Effects of Disk-Pointer Corruption
"... The long-term availability of data stored in a file system depends on how well it safeguards on-disk pointers used to access the data. Ideally, a system would correct all pointer errors. In this paper, we examine how well corruptionhandling techniques work in reality. We develop a new technique call ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The long-term availability of data stored in a file system depends on how well it safeguards on-disk pointers used to access the data. Ideally, a system would correct all pointer errors. In this paper, we examine how well corruptionhandling techniques work in reality. We develop a new technique called type-aware pointer corruption to systematically explore how a file system reacts to corrupt pointers. This approach reduces the exploration space for corruption experiments and works without source code. We use type-aware pointer corruption to examine Windows NTFS and Linux ext3. We find that they rely on type and sanity checks to detect corruption, and NTFS recovers using replication in some instances. However, NTFS and ext3 do not recover from most corruptions, including many scenarios for which they possess sufficient redundant information, leading to further corruption, crashes, and unmountable file systems. We use our study to identify important lessons for handling corrupt pointers. 1.
ABSTRACT The Effects of Metadata Corruption on NFS
"... Distributed file systems need to be robust in the face of failures. In this work, we study the failure handling and recovery mechanisms of a widely used distributed file system, Linux NFS. We study the behavior of NFS under corruption of important metadata through fault injection. We find that the N ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Distributed file systems need to be robust in the face of failures. In this work, we study the failure handling and recovery mechanisms of a widely used distributed file system, Linux NFS. We study the behavior of NFS under corruption of important metadata through fault injection. We find that the NFS protocol behaves in unexpected ways in the presence of these corruptions. On some occasions, incorrect errors are communicated to the client application; in others, the system hangs applications or crashes outright; in a few cases, success is falsely reported when an operation has failed. We use the results of our study to draw lessons for future designs and implementations of the NFS protocol. Categories and Subject Descriptors:
ImpactofDiskCorruptiononOpen-SourceDBMS SriramSubramanian,YupuZhang,RajivVaidyanathan,HaryadiS.Gunawi, AndreaC.Arpaci-Dusseau,RemziH.Arpaci-Dusseau,JeffreyF.Naughton
"... weexaminetheeffectsofcorruptionondatabasemanagement systems.ThroughinjectingfaultsintotheMySQLDBMS,wefind thatincertaincases,corruptioncangreatlyharmthesystem, leadingtountimelycrashes,dataloss,orevenincorrectresults. Overall,of145injectedfaults,110leadtoseriousproblems.More detailedobservationspoin ..."
Abstract
- Add to MetaCart
weexaminetheeffectsofcorruptionondatabasemanagement systems.ThroughinjectingfaultsintotheMySQLDBMS,wefind thatincertaincases,corruptioncangreatlyharmthesystem, leadingtountimelycrashes,dataloss,orevenincorrectresults. Overall,of145injectedfaults,110leadtoseriousproblems.More detailedobservationspointustothreedeficiencies:MySQLdoes nothavethecapabilitytodetectsomecorruptionsduetolack ofredundantinformation,doesnotisolatecorrupteddatafrom validdata,andhasinconsistentreactionstosimilarcorruption scenarios. Todetectandrepaircorruption,aDBMSistypicallyequipped withanofflinechecker.Unfortunately,theMySQLofflinechecker isnotcomprehensiveinthechecksitperforms,misdiagnosing manycorruptionscenariosandmissingothers.Sometimesthe checker itself crashes; more ominously, its incorrect checking canleadtoincorrectrepairs.Overall,wefindthatthechecker doesnotbehavecorrectlyin18of145injectedcorruptions,and thuscanleavetheDBMSvulnerabletotheproblemsdescribed above. I.
Network Appliance, Inc.
"... An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this paper, we present the first large-scale study of data corruption. We analyze corruption in ..."
Abstract
- Add to MetaCart
An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this paper, we present the first large-scale study of data corruption. We analyze corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months. We study three classes of corruption: checksum mismatches, identity discrepancies, and parity inconsistencies. We focus on checksum mismatches since they occur the most. We find more than 400,000 instances of checksum mismatches over the 41-month period. We find many interesting trends among these instances including: (i) nearline disks (and their adapters) develop checksum mismatches an order of magnitude more often than enterprise class disk drives, (ii) checksum mismatches within the same disk are not independent events and they show high spatial and temporal locality, and (iii) checksum mismatches across different disks in the same storage system are not independent. We use our observations to derive lessons for corruption-proof system design. 1
Coerced Cache Eviction and Discreet Mode Journaling: Dealing with Misbehaving Disks
"... a new method to force writes to disk in the presence of a disk cache that does not properly obey write-cache configuration or flush requests. We demonstrate the utility of CCE by building a new journaling mode within the Linux ext3 file system. When mounted in this discreet mode, ext3 uses CCEs to e ..."
Abstract
- Add to MetaCart
a new method to force writes to disk in the presence of a disk cache that does not properly obey write-cache configuration or flush requests. We demonstrate the utility of CCE by building a new journaling mode within the Linux ext3 file system. When mounted in this discreet mode, ext3 uses CCEs to ensure that writes are properly ordered and thus maintains file system integrity despite the presence of an improperly behaving disk. We show that discreet mode journaling operates with acceptable overheads for most workloads. Keywords-file systems; disks; journaling; reliability. I.

