Results 1 - 10
of
28
Software Defects and their Impact on System Availability - A Study of Field Failures in Operating Systems
, 1991
"... In recent years, software defects have become the dominant cause of customer outage, and improvements in software reliability and quality have not kept pace with those of hardware. Yet, software defects are not well enough understood to provide a clear methodology for avoiding or recovering from the ..."
Abstract
-
Cited by 134 (5 self)
- Add to MetaCart
In recent years, software defects have become the dominant cause of customer outage, and improvements in software reliability and quality have not kept pace with those of hardware. Yet, software defects are not well enough understood to provide a clear methodology for avoiding or recovering from them. To gain the necessary insight, we study defects reported between 1986 and 1889 from a on a high-end operating system product. We compare a typical defect (regular) to one that corrupts a program's memory (overlay) given that overlays are considered by field services to be particularly hard to find and fix. This paper: # Shows that the impact of an overlay defect is, on average, much higher than that of a regular defect. # Defines error types to classify the programming mistakes that cause software to fail. # Defines error trigger to classify the events that cause latent errors in programs to surface. The error trigger distribution weights events and environments that are probably inad...
Enhancing Server Availability and Security Through Failure-Oblivious Computing
- In Proceedings 6 th Symposium on Operating Systems Design and Implementation (OSDI
, 2004
"... We present a new technique, failure-oblivious computing, that enables servers to execute through memory errors without memory corruption. Our safe compiler for C inserts checks that dynamically detect invalid memory accesses. Instead of terminating or throwing an exception, the generated code simply ..."
Abstract
-
Cited by 106 (13 self)
- Add to MetaCart
We present a new technique, failure-oblivious computing, that enables servers to execute through memory errors without memory corruption. Our safe compiler for C inserts checks that dynamically detect invalid memory accesses. Instead of terminating or throwing an exception, the generated code simply discards invalid writes and manufactures values to return for invalid reads, enabling the server to continue its normal execution path. We have applied failure-oblivious computing to a set of widely-used servers from the Linux-based opensource computing environment. Our results show that our techniques 1) make these servers invulnerable to known security attacks that exploit memory errors, and 2) enable the servers to continue to operate successfully to service legitimate requests and satisfy the needs of their users even after attacks trigger their memory errors. We observed several reasons for this successful continued execution. When the memory errors occur in irrelevant computations, failure-oblivious computing enables the server to execute through the memory errors to continue on to execute the relevant computation. Even when the memory errors occur in relevant computations, failure-oblivious computing converts requests that trigger unanticipated and dangerous execution paths into anticipated invalid inputs, which the error-handling logic in the server rejects. Because servers tend to have small error propagation distances (localized errors in the computation for one request tend to have little or no effect on the computations for subsequent requests), redirecting reads that would otherwise cause addressing errors and discarding writes that would otherwise corrupt critical data structures (such as the call stack) localizes the effect of the memory errors, prevents addressing exceptions from terminating the computation, and enables the server to continue on to successfully process subsequent requests. The overall result is a substantial extension of the range of requests that the server can successfully process. 1
Automatic Detection and Repair of Errors in Data Structures
, 2002
"... We present a system that accepts a specification of key data structure constraints, then dynamically detects and repairs violations of these constraints. Our experience using our system indicates that the specifications are relatively easy to develop once one understands the data structures. Further ..."
Abstract
-
Cited by 80 (17 self)
- Add to MetaCart
We present a system that accepts a specification of key data structure constraints, then dynamically detects and repairs violations of these constraints. Our experience using our system indicates that the specifications are relatively easy to develop once one understands the data structures. Furthermore, for our set of benchmark applications, our system can e#ectively repair errors to deliver consistent data structures that allow the program to continue to operate successfully within its designed operating envelope.
Networked Windows NT System Field Failure Data Analysis
, 1999
"... This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study indiv ..."
Abstract
-
Cited by 46 (0 self)
- Add to MetaCart
This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%,(5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime. 1
An architecture for specification-based detection of semantic integrity violations in kernel dynamic data
- In Proceedings of the USENIX Security Symposium
, 2006
"... The ability of intruders to hide their presence in compromised systems has surpassed the ability of the current generation of integrity monitors to detect them. Once in control of a system, intruders modify the state of constantly-changing dynamic kernel data structures to hide their processes and e ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
The ability of intruders to hide their presence in compromised systems has surpassed the ability of the current generation of integrity monitors to detect them. Once in control of a system, intruders modify the state of constantly-changing dynamic kernel data structures to hide their processes and elevate their privileges. Current monitoring tools are limited to detecting changes in nominally static kernel data and text and cannot distinguish a valid state change from tampering in these dynamic data structures. We introduce a novel general architecture for defining and monitoring semantic integrity constraints using a specification language-based approach. This approach will enable a new generation of integrity monitors to distinguish valid states from tampering.
Failure data analysis of a large-scale heterogeneous server environment
- In Proceedings of the 2004 International Conference on Dependable Systems and Networks
, 2004
"... The growing complexity of hardware and software mandates the recognition of fault occurrence in system deployment and management. While there are several techniques to prevent and/or handle faults, there continues to be a growing need for an in-depth understanding of system errors and failures and t ..."
Abstract
-
Cited by 36 (4 self)
- Add to MetaCart
The growing complexity of hardware and software mandates the recognition of fault occurrence in system deployment and management. While there are several techniques to prevent and/or handle faults, there continues to be a growing need for an in-depth understanding of system errors and failures and their empirical and statistical properties. This understanding can help evaluate the effectiveness of different techniques for improving system availability, in addition to developing new solutions. In this paper, we analyze the empirical and statistical properties of system errors and failures from a network of nearly 400 heterogeneous servers running a diverse workload over a year. While improvements in system robustness continue to limit the number of actual failures to a very small fraction of the recorded errors, the failure rates are significant and highly variable. Our results also show that the system error and failure patterns are comprised of time-varying behavior containing long stationary intervals. These stationary intervals exhibit various strong correlation structures and periodic patterns, which impact performance but also can be exploited to address such performance issues. 1.
Data Structure Repair Using Goal-Directed Reasoning
- IN PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING
, 2005
"... Model-based data structure repair is a promising technique for enabling programs to continue to execute successfully in the face of otherwise fatal data structure corruption errors. Previous research in this field relied on the developer to write a specification to explicitly translate model repairs ..."
Abstract
-
Cited by 35 (14 self)
- Add to MetaCart
Model-based data structure repair is a promising technique for enabling programs to continue to execute successfully in the face of otherwise fatal data structure corruption errors. Previous research in this field relied on the developer to write a specification to explicitly translate model repairs into concrete data structure repairs, raising the possibility of 1) incorrect translations causing the supposedly repaired concrete data structures to be inconsistent, and 2) repaired models with no corresponding concrete data structure representation. We present a new repair algorithm that uses goal-directed reasoning to automatically translate model repairs into concrete data structure repairs. This new repair algorithm eliminates the possibility of incorrect translations and repaired models with no corresponding representation as concrete data structures. Unlike our old algorithm, our new algorithm can also repair linked data structures such as a list or a tree.
A dynamic technique for eliminating buffer overflow vulnerabilities (and other memory errors
- In Proceedings of the 2004 Annual Computer Security Applications Conference
, 2004
"... Buffer overflow vulnerabilities are caused by programming errors that allow an attacker to cause the program to write beyond the bounds of an allocated memory block to corrupt other data structures. The standard way to exploit a buffer overflow vulnerability involves a request that is too large for ..."
Abstract
-
Cited by 34 (9 self)
- Add to MetaCart
Buffer overflow vulnerabilities are caused by programming errors that allow an attacker to cause the program to write beyond the bounds of an allocated memory block to corrupt other data structures. The standard way to exploit a buffer overflow vulnerability involves a request that is too large for the buffer intended to hold it. The buffer overflow error causes the program to write part of the request beyond the bounds of the buffer, corrupting the address space of the program and causing the program to execute injected code contained in the request. We have implemented a compiler that inserts dynamic checks into the generated code to detect all out of bounds memory accesses. When it detects an out of bounds write, it stores the value away in a hash table to return as the value for corresponding out of bounds reads. The net effect is to (conceptually) give each allocated memory block unbounded size and to eliminate out of bounds accesses as a programming error. We have acquired several widely used open source servers (Apache, Sendmail, Pine, Mutt, and Midnight Commander). With standard compilers, all of these servers are vulnerable to buffer overflow attacks as documented at security tracking web sites. Our compiler eliminates these security vulnerabilities (as well as other memory errors). Our results show that our compiler enables the servers to execute successfully through buffer overflow attacks to continue to correctly service user requests without security vulnerabilities. 1.
Acceptability-Oriented Computing
- IN 2003 ACM SIGPLAN CONFERENCE ON OBJECT-ORIENTED PROGRAMMING SYSTEMS, LANGUAGES, AND APPLICATIONS COMPANION (OOPSLA ’03 COMPANION) ONWARDS! SESSION
, 2003
"... We discuss a new approach to the construction of software systems. Instead of attempting to build a system that is as free of errors as possible, the designer instead identifies key properties that the execution must satisfy to be acceptable to its users. Together, these properties define the accept ..."
Abstract
-
Cited by 23 (7 self)
- Add to MetaCart
We discuss a new approach to the construction of software systems. Instead of attempting to build a system that is as free of errors as possible, the designer instead identifies key properties that the execution must satisfy to be acceptable to its users. Together, these properties define the acceptability envelope of the system: the region that it must stay within to remain acceptable. The developer then augments the system with a layered set of components, each of which enforces one of the acceptability properties. The potential advantages of this approach include more flexible, resilient systems that recover from errors and behave acceptably across a wide range of operating environments, an appropriately prioritized investment of engineering resources, and the ability to productively incorporate unreliable components into the final software system.
Automatic Data Structure Repair for Self-Healing Systems
- In Proceedings of the 1 st Workshop on Algorithms and Architectures for Self-Managing Systems
, 2003
"... We have developed a system that accepts a specification of key data structure constraints, then dynamically detects and repairs violations of these constraints, enabling the program to recover from otherwise crippling errors to continue to execute productively. We present our experience using our sy ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
We have developed a system that accepts a specification of key data structure constraints, then dynamically detects and repairs violations of these constraints, enabling the program to recover from otherwise crippling errors to continue to execute productively. We present our experience using our system to repair violated constraints in a simplified version of the ext2 file system and in the CTAS air-traffic control program. Our experience indicates that the specifications are relatively straightforward to develop and that our technique enables the applications to effectively recover from data structure corruption errors.

