Results 1 -
9 of
9
On the Emulation of Software Faults by Software Fault Injection
- IN PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS
, 2000
"... This paper presents an experimental study on the emulation of software faults by fault injection. In a first experiment, a set of real software faults has been compared with faults injected by a SWIFI tool (Xception) to evaluate the accuracy of the injected faults. Results revealed the limitations o ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
This paper presents an experimental study on the emulation of software faults by fault injection. In a first experiment, a set of real software faults has been compared with faults injected by a SWIFI tool (Xception) to evaluate the accuracy of the injected faults. Results revealed the limitations of Xception (and other SWIFI tools) in the emulation of different classes of software faults (about 44% of the software faults cannot be emulated). The use of field data about real faults was discussed and software metrics were suggested as an alternative to guide the injection process when field data is not available. In a second experiment, a set of rules for the injection of errors meant to emulate classes of software faults was evaluated. The fault triggers used seem to be the cause for the observed strong impact of the faults in the target system and in the program results. The results also show the influence in the fault emulation of aspects such as code size, complexity of data structures, and recursive versus sequential execution.
The Systematic Improvement of Fault Tolerance in the Rio File Cache
- In Proceedings of the 1999 Symposium on Fault-Tolerant Computing
, 1999
"... : Fault injection is typically used to characterize failures and to validate and compare fault-tolerant mechanisms. However, fault injection is rarely used for all these purposes to guide the design and implementation of a fault-tolerant system. We present a systematic and quantitative approach for ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
: Fault injection is typically used to characterize failures and to validate and compare fault-tolerant mechanisms. However, fault injection is rarely used for all these purposes to guide the design and implementation of a fault-tolerant system. We present a systematic and quantitative approach for using software-implemented fault injection to guide the design and implementation of a fault-tolerant system. Our system design goal is to build a write-back file cache on Intel PCs that is as reliable as a write-through file cache. We follow an iterative approach to improve robustness in the presence of operating system errors. In each iteration, we measure the reliability of the system, analyze the fault symptoms that lead to data corruption, and apply fault-tolerant mechanisms that address the fault symptoms. Our initial system is 13 times less reliable than a writethrough file cache. The result of several iterations is a design that is both more reliable (1.9% vs. 3.1% corruption rate) a...
The Design and Verification of the Rio File Cache
- IEEE Transactions on Computers
, 2001
"... Today's file systems are limited in speed and reliability by memory's vulnerability to operating system crashes. Because memory is viewed as unsafe, systems periodically write modified file data back to disk. These extra disk writes lower system performance and the delay period before data is safe ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Today's file systems are limited in speed and reliability by memory's vulnerability to operating system crashes. Because memory is viewed as unsafe, systems periodically write modified file data back to disk. These extra disk writes lower system performance and the delay period before data is safe lowers reliability. The goal of the Rio (RAM I/O) file cache is to make ordinary main memory safe for persistent storage by enabling memory to survive operating system crashes. Reliable main memoryenables the Rio file cache to be as reliable as a write-through file cache, where every write is safe instantly, and as fast as a pure write-back file cache, with no reliability-induced writes to disk. This paper describes the systematic, quantitative process we used to design and verify the Rio file cache on Intel PCs running FreeBSD and the reliability and performance of the resulting system.
Performance Evaluation of Checksum-Based ABFT
- In 16th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’01
, 2001
"... In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. Most of the previous studies that compared ABFT schemes considered only error detection and correction capabilities. Some previous studies looked at the overhead but no previous work --as far as we kno ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. Most of the previous studies that compared ABFT schemes considered only error detection and correction capabilities. Some previous studies looked at the overhead but no previous work --as far as we know-- compared different recovery schemes for data processing applications considering throughput as the main metric. In this work, we compare the performance of two recovery schemes: recomputing and ABFT correction, for different error rates. We consider errors that occur during computation as well as those that occur during error detection, location and correction processes. A metric for performance evaluation of different design alternatives is defined. Results show that multiple error correction using ABFT has poorer performance than single error correction even at high error rates. We also present, implement and evaluate early detection in ABFT. In early detection, we try to detect the errors that occur in the checksum calculation before starting the actual computation. Early detection improves throughput in cases of intensive computations and cases of high error rates.
A Tool for Examining the Behaviour of Faults and Errors in Software Revision
, 2000
"... This report describes the Propagation Analysis Environment (PROPANE) which is a desktop environment for conducting experiments with error injection and fault injection in order to analyse the propagation and effects of errors and faults in software systems. PROPANE supports the injection of a variet ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This report describes the Propagation Analysis Environment (PROPANE) which is a desktop environment for conducting experiments with error injection and fault injection in order to analyse the propagation and effects of errors and faults in software systems. PROPANE supports the injection of a variety of errors types into variables of a software system, as well as controlled injection of faults (by mutation of the source code). PROPANE also has support for various types of probes that can be used to log the values of variables and the occurrences of events during software execution. PROPANE is mainly aimed at, and was specifically developed for the analysis and evaluation of software for single node embedded control systems, although due to its general nature it may be used in many other areas.
Knowledge-Based Management Of Legacy Codes For Automated Design
, 1996
"... OF THE DISSERTATION Knowledge-Based Management of Legacy Codes for Automated Design by John Eric Keane Dissertation Director: Thomas Ellman Systems for automated design optimization of complex real-world objects can, in principle, be constructed by combining domain-independent numerical routines wit ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
OF THE DISSERTATION Knowledge-Based Management of Legacy Codes for Automated Design by John Eric Keane Dissertation Director: Thomas Ellman Systems for automated design optimization of complex real-world objects can, in principle, be constructed by combining domain-independent numerical routines with existing domainspecific analysis and simulation programs. Such "legacy" analysis codes are frequently unsuitable for use in automated design. They may crash for large classes of input, be locally non-smooth, or be highly sensitive to control parameters. To be useful, analysis programs must first be modified to reduce or eliminate only the undesired behaviors, without altering the desired computation. To do this by direct modification of the programs is labor-intensive, and necessitates costly re-validation. This dissertation describes research into how legacy analysis codes can be usefully employed in design automation systems. We show that recovery from failure is possible when the failur...
Algorithm-Based Fault Tolerance: A Performance Perspective Based on Error Rate
"... In Algorithm-based fault tolerance (ABFT), the fault tolerance scheme is tailored to the algorithm performed. Most of the previous studies that compared various ABFT schemes considered only their error detection and correction capabilities. Some previous studies looked at the overhead in general but ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In Algorithm-based fault tolerance (ABFT), the fault tolerance scheme is tailored to the algorithm performed. Most of the previous studies that compared various ABFT schemes considered only their error detection and correction capabilities. Some previous studies looked at the overhead in general but no previous work --as far as we know-- compared different ABFT schemes considering performance as the main metric. In this work, we compare the performance of two ABFT error recovery schemes: recomputing vs. correction, for different error rates. We consider errors that happen during computation as well as those that happen during the error detection, location and correction process. The metrics we use are success ratio and completion time. Results show that multiple error correction using ABFT has worse performance than single error correction. They also show that error rate is an essential factor in making one scheme better than another in terms of performance.
© 2001, A. Al-Yamani, N. Oh and E. J. McCluskey ALGORITHM-BASED FAULT TOLERANCE: A PERFORMANCE PERSPECTIVE BASED ON ERROR RATE
"... In Algorithm-based fault tolerance (ABFT), the fault tolerance scheme is tailored to the algorithm performed. Most of the previous studies that compared various ABFT schemes considered only their error detection and correction capabilities. Some previous studies looked at the overhead in general but ..."
Abstract
- Add to MetaCart
In Algorithm-based fault tolerance (ABFT), the fault tolerance scheme is tailored to the algorithm performed. Most of the previous studies that compared various ABFT schemes considered only their error detection and correction capabilities. Some previous studies looked at the overhead in general but no previous work –as far as we know – compared different ABFT schemes considering performance as the main metric. In this work, we compare the performance of two ABFT error recovery schemes: recomputing vs. correction, for different error rates. We consider errors that happen during computation as well as those that happen during the error detection, location and correction process. The metrics we use are success ratio and completion time. Results show that multiple error correction using ABFT has worse performance than single error correction. They also show that error rate is an essential factor in making one scheme better than another in terms of performance. 1.
Author contact information:
"... This report describes the Propagation Analysis Environment (PROPANE) which is a desktop environment for conducting experiments with error injection and fault injection in order to analyse the propagation and effects of errors and faults in software systems. PROPANE supports the injection of a variet ..."
Abstract
- Add to MetaCart
This report describes the Propagation Analysis Environment (PROPANE) which is a desktop environment for conducting experiments with error injection and fault injection in order to analyse the propagation and effects of errors and faults in software systems. PROPANE supports the injection of a variety of errors types into variables of a software system, as well as controlled injection of faults (by mutation of the source code). PROPANE also has support for various types of probes that can be used to log the values of variables and the occurrences of events during software execution. PROPANE is mainly aimed at, and was specifically developed for the analysis and evaluation of software for single node embedded control systems, although due to its general nature it

