Results 1 - 10
of
134
Basic concepts and taxonomy of dependable and secure computing
- IEEE TDSC
, 2004
"... Abstract—This paper gives the main definitions relating to dependability, a generic concept including as special case such attributes as reliability, availability, safety, integrity, maintainability, etc. Security brings in concerns for confidentiality, in addition to availability and integrity. Bas ..."
Abstract
-
Cited by 315 (5 self)
- Add to MetaCart
Abstract—This paper gives the main definitions relating to dependability, a generic concept including as special case such attributes as reliability, availability, safety, integrity, maintainability, etc. Security brings in concerns for confidentiality, in addition to availability and integrity. Basic definitions are given first. They are then commented upon, and supplemented by additional definitions, which address the threats to dependability and security (faults, errors, failures), their attributes, and the means for their achievement (fault prevention, fault tolerance, fault removal, fault forecasting). The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of system failures.
Understanding Fault-Tolerant Distributed Systems
- Communications of the ACM
, 1993
"... We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design ..."
Abstract
-
Cited by 296 (23 self)
- Add to MetaCart
We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give examples of systems which adopt one approach or the other. The aim is to introduce some order in the complex discipline of designing and understanding fault-tolerant distributed systems. 1 Introduction Computing systems consist of a multitude of hardware and software components that are bound to fail eventually. In many systems, such component failures can lead to unanticipated, potentially disruptive failure behavior and to service unavailability. Some systems are designed to be fault-tolerant: they either exhibit a well-defined failure behavior when components fail or mask component failures to users, that is, continue to provid...
Understanding BGP Misconfiguration
- In Proc. ACM SIGCOMM
, 2002
"... It is well-known that simple, accidental BGP configuration errors can disrupt Internet connectivity. Yet little is known about the frequency of misconfiguration or its causes, except for the few spectacular incidents of widespread outages. In this paper, we present the first quantitative study of BG ..."
Abstract
-
Cited by 234 (9 self)
- Add to MetaCart
It is well-known that simple, accidental BGP configuration errors can disrupt Internet connectivity. Yet little is known about the frequency of misconfiguration or its causes, except for the few spectacular incidents of widespread outages. In this paper, we present the first quantitative study of BGP misconfiguration. Over a three week period, we analyzed routing table advertisements from 23 vantage points across the Internet backbone to detect incidents of misconfiguration. For each incident we polled the ISP operators involved to verify whether it was a misconfiguration, and to learn the cause of the incident. We also actively probed the Internet to determine the impact of misconfiguration on connectivity.
Improving the reliability of commodity operating systems
, 2003
"... drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85 % of recently reported failures. This article describes Nooks, a reliability subsystem that seeks to greatly enhance operating system (OS) reliability by isolating the OS from driver failures. T ..."
Abstract
-
Cited by 192 (14 self)
- Add to MetaCart
drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85 % of recently reported failures. This article describes Nooks, a reliability subsystem that seeks to greatly enhance operating system (OS) reliability by isolating the OS from driver failures. The Nooks approach is practical: rather than guaranteeing complete fault tolerance through a new (and incompatible) OS or driver architecture, our goal is to prevent the vast majority of driver-caused crashes with little or no change to the existing driver and system code. Nooks isolates drivers within lightweight protection domains inside the kernel address space, where hardware and software prevent them from corrupting the kernel. Nooks also tracks a driver’s use of kernel resources to facilitate automatic cleanup during recovery. To prove the viability of our approach, we implemented Nooks in the Linux operating system and used it to fault-isolate several device drivers. Our results show that Nooks offers a substantial increase in the reliability of operating systems, catching and quickly recovering from many faults that would otherwise crash the system. Under a wide range and number of fault conditions, we show that Nooks recovers automatically from 99 % of the faults that otherwise cause Linux to crash.
Why do Internet services fail, and what can be done about it?
, 2003
"... In 1986 Jim Gray published his landmark study of the causes of failures of Tandem systems and the techniques Tandem used to prevent such failures [6]. Seventeen years later, Internet services have replaced fault-tolerant servers as the new kid on the 24x7-availability block. Using data from three la ..."
Abstract
-
Cited by 187 (9 self)
- Add to MetaCart
In 1986 Jim Gray published his landmark study of the causes of failures of Tandem systems and the techniques Tandem used to prevent such failures [6]. Seventeen years later, Internet services have replaced fault-tolerant servers as the new kid on the 24x7-availability block. Using data from three large-scale Internet services, we analyzed the causes of their failures and the (potential) effectiveness of various techniques for preventing and mitigating service failure. We find that (1) operator error is the largest cause of failures in two of the three services, (2) operator error is the largest contributor to time to repair in two of the three services, (3) configuration errors are the largest category of operator errors, (4) failures in custom-written front-end software are significant, and (5) more extensive online testing and more thoroughly exposing and detecting component failures would reduce failure rates in at least one service. Qualitatively we find that improvement in the maintenance tools and systems used by service operations staff would decrease time to diagnose and repair problems.
The Time-Triggered Architecture
- PROCEEDINGS OF THE IEEE
, 2003
"... The time-triggered architecture (TTA) provides a computing infrastructure for the design and implementation of dependable distributed embedded systems. A large real-time application is decomposed into nearly autonomous clusters and nodes, and a fault-tolerant global time base of known precision is g ..."
Abstract
-
Cited by 157 (10 self)
- Add to MetaCart
The time-triggered architecture (TTA) provides a computing infrastructure for the design and implementation of dependable distributed embedded systems. A large real-time application is decomposed into nearly autonomous clusters and nodes, and a fault-tolerant global time base of known precision is generated at every node. In the TTA, this global time is used to precisely specify the interfaces among the nodes, to simplify the communication and agreement protocols, to perform prompt error detection, and to guarantee the timeliness of real-time applications. The TTA supports a two-phased design methodology, architecture design, and component design. During the architecture design phase, the interactions among the distributed components and the interfaces of the components are fully specified in the value domain and in the temporal domain. In the succeeding component implementation phase, the components are built, taking these interface specifications as constraints. This two-phased design methodology is a prerequisite for the composability of applications implemented in the TTA and for the reuse of prevalidated components within the TTA. This paper presents the architecture model of the TTA, explains the design rationale, discusses the time-triggered communication protocols TTP/C and TTP/A, and illustrates how transparent fault tolerance can be implemented in the TTA.
Debugging operating systems with time-traveling virtual machines
, 2005
"... Operating systems are difficult to debug with traditional cyclic debugging. They are non-deterministic; they run for long periods of time; they interact directly with hardware devices; and their state is easily perturbed by the act of debugging. This paper describes a time-traveling virtual machine ..."
Abstract
-
Cited by 114 (7 self)
- Add to MetaCart
Operating systems are difficult to debug with traditional cyclic debugging. They are non-deterministic; they run for long periods of time; they interact directly with hardware devices; and their state is easily perturbed by the act of debugging. This paper describes a time-traveling virtual machine that overcomes many of the difficulties associated with debugging operating systems. Time travel enables a programmer to navigate backward and forward arbitrarily through the execution history of a particular run and to replay arbitrary segments of the past execution. We integrate time travel into a general-purpose debugger to enable a programmer to debug an OS in reverse, implementing commands such as reverse breakpoint, reverse watchpoint, and reverse single step. The space and time overheads needed to support time travel are reasonable for debugging, and movements in time are fast enough to support interactive debugging. We demonstrate the value of our time-traveling virtual machine by using it to understand and fix several OS bugs that are difficult to find with standard debugging tools. Reverse debugging is especially helpful in finding bugs that are fragile due to non-determinism, bugs in device drivers, bugs that require long runs to trigger, bugs that corrupt the stack, and bugs that are detected after the relevant stack frame is popped. 1
Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?
, 2007
"... Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance ..."
Abstract
-
Cited by 107 (7 self)
- Add to MetaCart
Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4 % common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years. Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks. Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.
Microreboot - A Technique for Cheap Recovery
, 2004
"... A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we sep ..."
Abstract
-
Cited by 94 (2 self)
- Add to MetaCart
A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we separate process recovery from data recovery to enable microrebooting -- a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application.
Recovering device drivers
- In OSDI
, 2004
"... This paper presents a new mechanism that enables applications to run correctly when device drivers fail. Because device drivers are the principal failing component in most systems, reducing driver-induced failures greatly improves overall reliability. Earlier work has shown that an operating system ..."
Abstract
-
Cited by 90 (8 self)
- Add to MetaCart
This paper presents a new mechanism that enables applications to run correctly when device drivers fail. Because device drivers are the principal failing component in most systems, reducing driver-induced failures greatly improves overall reliability. Earlier work has shown that an operating system can survive driver failures [33], but the applications that depend on them cannot. Thus, while operating system reliability was greatly improved, application reliability generally was not. To remedy this situation, we introduce a new operating system mechanism called a shadow driver. A shadow driver monitors device drivers and transparently recovers from driver failures. Moreover, it assumes the role of the failed driver during recovery. In this way, applications using the failed driver, as well as the kernel itself, continue to function as expected. We implemented shadow drivers for the Linux operating system and tested them on over a dozen device drivers. Our results show that applications and the OS can indeed survive the failure of a variety of device drivers. Moreover, shadow drivers impose minimal performance overhead. Lastly, they can be introduced with only modest changes to the OS kernel and with no changes at all to existing device drivers. 1

