Results 1 - 10
of
59
Improving the reliability of commodity operating systems
, 2003
"... drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85 % of recently reported failures. This article describes Nooks, a reliability subsystem that seeks to greatly enhance operating system (OS) reliability by isolating the OS from driver failures. T ..."
Abstract
-
Cited by 192 (14 self)
- Add to MetaCart
drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85 % of recently reported failures. This article describes Nooks, a reliability subsystem that seeks to greatly enhance operating system (OS) reliability by isolating the OS from driver failures. The Nooks approach is practical: rather than guaranteeing complete fault tolerance through a new (and incompatible) OS or driver architecture, our goal is to prevent the vast majority of driver-caused crashes with little or no change to the existing driver and system code. Nooks isolates drivers within lightweight protection domains inside the kernel address space, where hardware and software prevent them from corrupting the kernel. Nooks also tracks a driver’s use of kernel resources to facilitate automatic cleanup during recovery. To prove the viability of our approach, we implemented Nooks in the Linux operating system and used it to fault-isolate several device drivers. Our results show that Nooks offers a substantial increase in the reliability of operating systems, catching and quickly recovering from many faults that would otherwise crash the system. Under a wide range and number of fault conditions, we show that Nooks recovers automatically from 99 % of the faults that otherwise cause Linux to crash.
Enhancing Server Availability and Security Through Failure-Oblivious Computing
- In Proceedings 6 th Symposium on Operating Systems Design and Implementation (OSDI
, 2004
"... We present a new technique, failure-oblivious computing, that enables servers to execute through memory errors without memory corruption. Our safe compiler for C inserts checks that dynamically detect invalid memory accesses. Instead of terminating or throwing an exception, the generated code simply ..."
Abstract
-
Cited by 106 (13 self)
- Add to MetaCart
We present a new technique, failure-oblivious computing, that enables servers to execute through memory errors without memory corruption. Our safe compiler for C inserts checks that dynamically detect invalid memory accesses. Instead of terminating or throwing an exception, the generated code simply discards invalid writes and manufactures values to return for invalid reads, enabling the server to continue its normal execution path. We have applied failure-oblivious computing to a set of widely-used servers from the Linux-based opensource computing environment. Our results show that our techniques 1) make these servers invulnerable to known security attacks that exploit memory errors, and 2) enable the servers to continue to operate successfully to service legitimate requests and satisfy the needs of their users even after attacks trigger their memory errors. We observed several reasons for this successful continued execution. When the memory errors occur in irrelevant computations, failure-oblivious computing enables the server to execute through the memory errors to continue on to execute the relevant computation. Even when the memory errors occur in relevant computations, failure-oblivious computing converts requests that trigger unanticipated and dangerous execution paths into anticipated invalid inputs, which the error-handling logic in the server rejects. Because servers tend to have small error propagation distances (localized errors in the computation for one request tend to have little or no effect on the computations for subsequent requests), redirecting reads that would otherwise cause addressing errors and discarding writes that would otherwise corrupt critical data structures (such as the call stack) localizes the effect of the memory errors, prevents addressing exceptions from terminating the computation, and enables the server to continue on to successfully process subsequent requests. The overall result is a substantial extension of the range of requests that the server can successfully process. 1
Microreboot - A Technique for Cheap Recovery
, 2004
"... A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we sep ..."
Abstract
-
Cited by 94 (2 self)
- Add to MetaCart
A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we separate process recovery from data recovery to enable microrebooting -- a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application.
Recovering device drivers
- In OSDI
, 2004
"... This paper presents a new mechanism that enables applications to run correctly when device drivers fail. Because device drivers are the principal failing component in most systems, reducing driver-induced failures greatly improves overall reliability. Earlier work has shown that an operating system ..."
Abstract
-
Cited by 90 (8 self)
- Add to MetaCart
This paper presents a new mechanism that enables applications to run correctly when device drivers fail. Because device drivers are the principal failing component in most systems, reducing driver-induced failures greatly improves overall reliability. Earlier work has shown that an operating system can survive driver failures [33], but the applications that depend on them cannot. Thus, while operating system reliability was greatly improved, application reliability generally was not. To remedy this situation, we introduce a new operating system mechanism called a shadow driver. A shadow driver monitors device drivers and transparently recovers from driver failures. Moreover, it assumes the role of the failed driver during recovery. In this way, applications using the failed driver, as well as the kernel itself, continue to function as expected. We implemented shadow drivers for the Linux operating system and tested them on over a dozen device drivers. Our results show that applications and the OS can indeed survive the failure of a variety of device drivers. Moreover, shadow drivers impose minimal performance overhead. Lastly, they can be introduced with only modest changes to the OS kernel and with no changes at all to existing device drivers. 1
Unmodified device driver reuse and improved system dependability via virtual machines
- In Proceedings of the 6th Symposium on Operating Systems Design and Implementation
, 2004
"... We propose a method to reuse unmodified device drivers and to improve system dependability using virtual machines. We run the unmodified device driver, with its original operating system, in a virtual machine. This approach enables extensive reuse of existing and unmodified drivers, independent of t ..."
Abstract
-
Cited by 82 (8 self)
- Add to MetaCart
We propose a method to reuse unmodified device drivers and to improve system dependability using virtual machines. We run the unmodified device driver, with its original operating system, in a virtual machine. This approach enables extensive reuse of existing and unmodified drivers, independent of the OS or device vendor, significantly reducing the barrier to building new OS endeavors. By allowing distinct device drivers to reside in separate virtual machines, this technique isolates faults caused by defective or malicious drivers, thus improving a system’s dependability. We show that our technique requires minimal support infrastructure and provides strong fault isolation. Our prototype’s network performance is within 3–8 % of a native Linux system. Each additional virtual machine increases the CPU utilization by about 0.12%. We have successfully reused a wide variety of unmodified Linux network, disk, and PCI device drivers. 1
Automatic Detection and Repair of Errors in Data Structures
, 2002
"... We present a system that accepts a specification of key data structure constraints, then dynamically detects and repairs violations of these constraints. Our experience using our system indicates that the specifications are relatively easy to develop once one understands the data structures. Further ..."
Abstract
-
Cited by 80 (17 self)
- Add to MetaCart
We present a system that accepts a specification of key data structure constraints, then dynamically detects and repairs violations of these constraints. Our experience using our system indicates that the specifications are relatively easy to develop once one understands the data structures. Furthermore, for our set of benchmark applications, our system can e#ectively repair errors to deliver consistent data structures that allow the program to continue to operate successfully within its designed operating envelope.
The Event Heap: A Coordination Infrastructure for Interactive Workspaces
, 2002
"... Abstract. Coordinating the interactions of applications running on the diversity of devices that will be common in ubiquitous computing environments is still a difficult and not completely solved problem. We look at one such environment, an interactive workspace, where groups come together to collab ..."
Abstract
-
Cited by 75 (11 self)
- Add to MetaCart
Abstract. Coordinating the interactions of applications running on the diversity of devices that will be common in ubiquitous computing environments is still a difficult and not completely solved problem. We look at one such environment, an interactive workspace, where groups come together to collaborate on solving problems. Such a space will contain a heterogeneous collection of both new and legacy applications and devices. We propose that a tuplespace model with several extensions is ideal for coordination in this environment. We present a prototype implementation of such a model called the Event Heap. Finally, we show that the system has performed well in actual use over the last year and a half in our prototype interactive workspace, the iRoom. 1
Crash-Only Software
, 2003
"... Crash-only programs crash safely and recover quickly. There is only one way to stop such software---by crashing it---and only one way to bring it up---by initiating recovery. Crash-only systems are built from crash-only components, and the use of transparent component-level retries hides intra-syste ..."
Abstract
-
Cited by 54 (7 self)
- Add to MetaCart
Crash-only programs crash safely and recover quickly. There is only one way to stop such software---by crashing it---and only one way to bring it up---by initiating recovery. Crash-only systems are built from crash-only components, and the use of transparent component-level retries hides intra-system component crashes from end users. In this paper we advocate a crash-only design for Internet systems, showing that it can lead to more reliable, predictable code and faster, more effective recovery. We present ideas on how to build such crash-only Internet services, taking successful techniques to their logical extreme.
Improving Cluster Availability Using Workstation Validation
, 2002
"... We demonstrate a framework for improving the availability of cluster based Internet services. Our approach models Internet services as a collection of interconnected components, each possessing well defined interfaces and failure semantics. Such a decomposition allows designers to engineer high avai ..."
Abstract
-
Cited by 35 (11 self)
- Add to MetaCart
We demonstrate a framework for improving the availability of cluster based Internet services. Our approach models Internet services as a collection of interconnected components, each possessing well defined interfaces and failure semantics. Such a decomposition allows designers to engineer high availability based on an understanding of the interconnections and isolated fault behavior of each component, as opposed to ad-hoc methods. In this work, we focus on using the entire commodity workstation as a component because it possesses natural, fault-isolated interfaces. We define a failure event as a reboot because not only is a workstation unavailable during a reboot, but also because reboots are symptomatic of a larger class of failures, such as configuration and operator errors. Our observations of 3 distinct clusters show that the time between reboots is best modeled by a Weibull distribution with shape parameters of less than 1, implying that a workstation becomes more reliable the longer it has been operating. Leveraging this observed property, we design an allocation strategy which withholds recently rebooted workstations from active service, validating their stability before allowing them to return to service. We show via simulation that this policy leads to a 70-30 rule-of-thumb: For a constant utilization, approximately 70% of the workstation failures can be masked from end clients with 30% extra capacity added to the cluster, provided reboots are not strongly correlated. We also found our technique is most sensitive to the burstiness of reboots as opposed to absolute lengths of workstation uptimes.
Data Structure Repair Using Goal-Directed Reasoning
- IN PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING
, 2005
"... Model-based data structure repair is a promising technique for enabling programs to continue to execute successfully in the face of otherwise fatal data structure corruption errors. Previous research in this field relied on the developer to write a specification to explicitly translate model repairs ..."
Abstract
-
Cited by 35 (14 self)
- Add to MetaCart
Model-based data structure repair is a promising technique for enabling programs to continue to execute successfully in the face of otherwise fatal data structure corruption errors. Previous research in this field relied on the developer to write a specification to explicitly translate model repairs into concrete data structure repairs, raising the possibility of 1) incorrect translations causing the supposedly repaired concrete data structures to be inconsistent, and 2) repaired models with no corresponding concrete data structure representation. We present a new repair algorithm that uses goal-directed reasoning to automatically translate model repairs into concrete data structure repairs. This new repair algorithm eliminates the possibility of incorrect translations and repaired models with no corresponding representation as concrete data structures. Unlike our old algorithm, our new algorithm can also repair linked data structures such as a list or a tree.

