Results 11 - 20
of
81
Safestore: A durable and practical storage system
- In USENIX Annual Technical Conference
, 2007
"... This paper presents SafeStore, a distributed storage system designed to maintain long-term data durability despite conventional hardware and software faults, environmental disruptions, and administrative failures caused by human error or malice. The architecture of SafeStore is based on fault isolat ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
This paper presents SafeStore, a distributed storage system designed to maintain long-term data durability despite conventional hardware and software faults, environmental disruptions, and administrative failures caused by human error or malice. The architecture of SafeStore is based on fault isolation, which Safe-Store applies aggressively along administrative, physical, and temporal dimensions by spreading data across autonomous storage service providers (SSPs). However, current storage interfaces provided by SSPs are not designed for high end-to-end durability. In this paper, we propose a new storage system architecture that (1) spreads data efficiently across autonomous SSPs using informed hierarchical erasure coding that, for a given replication cost, provides several additional 9’s of durability over what can be achieved with existing black-box SSP interfaces, (2) performs an efficient end-to-end audit of SSPs to detect data loss that, for a 20 % cost increase, improves data durability by two 9’s by reducing MTTR, and (3) offers durable storage with cost, performance, and availability competitive with traditional storage systems. We instantiate and evaluate these ideas by building a SafeStore-based file system with an NFSlike interface. 1
RWset: Attacking path explosion in constraint-based test generation
- IN TACAS’08: INTERNATIONAL CONFERENCE ON TOOLS AND ALGORITHMS FOR THE CONSTRUCTIONS AND ANALYSIS OF SYSTEMS
, 2008
"... Abstract. Recent work has used variations of symbolic execution to automatically generate high-coverage test inputs [3, 4, 7, 8, 14]. Such tools have demonstrated their ability to find very subtle errors. However, one challenge they all face is how to effectively handle the exponential number of pat ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
Abstract. Recent work has used variations of symbolic execution to automatically generate high-coverage test inputs [3, 4, 7, 8, 14]. Such tools have demonstrated their ability to find very subtle errors. However, one challenge they all face is how to effectively handle the exponential number of paths in checked code. This paper presents a new technique for reducing the number of traversed code paths by discarding those that must have side-effects identical to some previously explored path. Our results on a mix of open source applications and device drivers show that this (sound) optimization reduces the numbers of paths traversed by several orders of magnitude, often achieving program coverage far out of reach for a standard constraint-based execution system. 1
Parallel randomized state-space search
- in International Conference on Software Engineering
"... Model checkers search the space of possible program behaviors to detect errors and to demonstrate their absence. Despite major advances in reduction and optimization techniques, state-space search can still become cost-prohibitive as program size and complexity increase. In this paper, we present a ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
Model checkers search the space of possible program behaviors to detect errors and to demonstrate their absence. Despite major advances in reduction and optimization techniques, state-space search can still become cost-prohibitive as program size and complexity increase. In this paper, we present a technique for dramatically improving the costeffectiveness of state-space search techniques for error detection using parallelism. Our approach can be composed with all of the reduction and optimization techniques we are aware of to amplify their benefits. It was developed based on insights gained from performing a large empirical study of the cost-effectiveness of randomization techniques in state-space analysis. We explain those insights and our technique, and then show through a focused empirical study that our technique speeds up analysis by factors ranging from 2 to over 1000 as compared to traditional modes of state-space search, and does so with relatively small numbers of parallel processors. 1.
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
"... Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users ’ inputs and file content d ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users ’ inputs and file content due to privacy concerns; (2) difficulty in building the exact same execution environment; and (3) non-determinism of concurrent executions on multi-processors. Therefore, programmers often have to diagnose a production run failure based on logs collected back from customers and the corresponding source code. Such diagnosis requires expert knowledge and is also too time-consuming, tedious to narrow down root causes. To address this problem, we propose a tool, called Sher-Log, that analyzes source code by leveraging information provided by run-time logs to infer what must or may have happened during the failed production run. It requires neither re-execution of the program nor knowledge on the log’s semantics. It infers both control and data value information regarding to the failed execution. We evaluate SherLog with 8 representative real world software failures (6 software bugs and 2 configuration errors) from 7 applications including 3 servers. Information inferred by SherLog are very useful for programmers to diagnose these evaluated failures. Our results also show that SherLog can analyze large server applications such as Apache with thousands of logging messages within only 40 minutes.
EIO: Error Handling is Occasionally Correct
"... The reliability of file systems depends in part on how well they propagate errors. We develop a static analysis technique, EDP, that analyzes how file systems and storage device drivers propagate error codes. Running our EDP analysis on all file systems and 3 major storage device drivers in Linux 2. ..."
Abstract
-
Cited by 17 (11 self)
- Add to MetaCart
The reliability of file systems depends in part on how well they propagate errors. We develop a static analysis technique, EDP, that analyzes how file systems and storage device drivers propagate error codes. Running our EDP analysis on all file systems and 3 major storage device drivers in Linux 2.6, we find that errors are often incorrectly propagated; 1153 calls (13%) drop an error code without handling it. We perform a set of analyses to rank the robustness of each subsystem based on the completeness of its error propagation; we find that many popular file systems are less robust than other available choices. We confirm that write errors are neglected more often than read errors. We also find that many violations are not cornercase mistakes, but perhaps intentional choices. Finally, we show that inter-module calls play a part in incorrect error propagation, but that chained propagations do not. In conclusion, error propagation appears complex and hard to perform correctly in modern systems. 1
Transactional flash
- In Proc. Symposium on Operating Systems Design and Implementation (OSDI
, 2008
"... Transactional flash (TxFlash) is a novel solid-state drive (SSD) that uses flash memory and exports a transactional interface (WriteAtomic) to the higher-level software. The copy-on-write nature of the flash translation layer and the fast random access makes flash memory the right medium to support ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Transactional flash (TxFlash) is a novel solid-state drive (SSD) that uses flash memory and exports a transactional interface (WriteAtomic) to the higher-level software. The copy-on-write nature of the flash translation layer and the fast random access makes flash memory the right medium to support such an interface. We further develop a novel commit protocol called cyclic commit for TxFlash; the protocol has been specified formally and model checked. Our evaluation, both on a simulator and an emulator on top of a real SSD, shows that TxFlash does not increase the flash firmware complexity significantly and provides transactional features with very small overheads (less than 1%), thereby making file systems easier to build. It further shows that the new cyclic commit protocol significantly outperforms traditional commit for small transactions (95 % improvement in transaction throughput) and completely eliminates the space overhead due to commit records. 1
Understanding and Validating Database System Administration
- In Proceedings of the 2006 USENIX Annual Technical Conference
, 2006
"... 1 Introduction Most enterprises rely on at least one database manage-ment system (DBMS) running on commodity computers to maintain their data. A large fraction of these enter-prises, such as Internet services and world-wide corporations, need to keep their databases operational at alltimes. Unfortun ..."
Abstract
-
Cited by 15 (10 self)
- Add to MetaCart
1 Introduction Most enterprises rely on at least one database manage-ment system (DBMS) running on commodity computers to maintain their data. A large fraction of these enter-prises, such as Internet services and world-wide corporations, need to keep their databases operational at alltimes. Unfortunately, doing so has been a difficult task. A key source of unavailability in these systems isdatabase administrator (DBA) mistakes [10, 15, 20]. Database administration is mistake-prone as it involvesmany complex tasks, such as storage space management, database structure management, and performance tun-ing. Even worse, as shall be seen, DBA mistakes are
D³S: Debugging Deployed Distributed Systems
, 2008
"... Testing large-scale distributed systems is a challenge, because some errors manifest themselves only after a distributed sequence of events that involves machine and network failures. D³S is a checker that allows developers to specify predicates on distributed properties of a deployed system, and th ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Testing large-scale distributed systems is a challenge, because some errors manifest themselves only after a distributed sequence of events that involves machine and network failures. D³S is a checker that allows developers to specify predicates on distributed properties of a deployed system, and that checks these predicates while the system is running. When D³S finds a problem it produces the sequence of state changes that led to the problem, allowing developers to quickly find the root cause. Developers write predicates in a simple and sequential programming style, while D³S checks these predicates in a distributed and parallel manner to allow checking to be scalable to large systems and fault tolerant. By using binary instrumentation, D³S works transparently with legacy systems and can change predicates to be checked at runtime. An evaluation with 5 deployed systems shows that D 3 S can detect non-trivial correctness and performance bugs at runtime and with low performance overhead (less than 8%).
Controlling factors in evaluating path-sensitive error detection techniques
- In Foundations of Software Engineering
, 2006
"... Recent advances in static program analysis have made it possible to detect errors in applications that have been thoroughly tested and are in wide-spread use. The ability to find errors that have eluded traditional validation methods is due to the development and combination of sophisticated algorit ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Recent advances in static program analysis have made it possible to detect errors in applications that have been thoroughly tested and are in wide-spread use. The ability to find errors that have eluded traditional validation methods is due to the development and combination of sophisticated algorithmic techniques that are embedded in the implementations of analysis tools. Evaluating new analysis techniques is typically performed by running an analysis tool on a collection of subject programs, perhaps enabling and disabling a given technique in different runs. While seemingly sensible, this approach runs the risk of attributing improvements in the costeffectiveness of the analysis to the technique under consideration, when those improvements may actually be due to details of analysis tool implementations that are uncontrolled during evaluation. In this paper, we focus on the specific class of path-sensitive error detection techniques and identify several factors that can significantly influence the cost of analysis. We show, through careful empirical studies, that the influence of these factors is sufficiently large that, if left uncontrolled, they may lead researchers to improperly attribute improvements in analysis cost and effectiveness. We make several recommendations as to how the influence of these factors can be mitigated when evaluating techniques. 1.
Parity Lost and Parity Regained
"... RAID storage systems protect data from storage errors, such as data corruption, using a set of one or more integrity techniques, such as checksums. The exact protection offered by certain techniques or a combination of techniques is sometimes unclear. We introduce and apply a formal method of analyz ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
RAID storage systems protect data from storage errors, such as data corruption, using a set of one or more integrity techniques, such as checksums. The exact protection offered by certain techniques or a combination of techniques is sometimes unclear. We introduce and apply a formal method of analyzing the design of data protection strategies. Specifically, we use model checking to evaluate whether common protection techniques used in parity-based RAID systems are sufficient in light of the increasingly complex failure modes of modern disk drives. We evaluate the approaches taken by a number of real systems under single-error conditions, and find flaws in every scheme. In particular, we identify a parity pollution problem that spreads corrupt data (the result of a single error) across multiple disks, thus leading to data loss or corruption. We further identify which protection measures must be used to avoid such problems. Finally, we show how to combine real-world failure data with the results from the model checker to estimate the actual likelihood of data loss of different protection strategies. 1

