Results 1 - 10
of
105
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers
- IEEE Transactions on Software Engineering
, 1998
"... An important step in the development of dependable systems is the validation of their fault tolerance properties. Fault injection has been widely used for this purpose, however with the rapid increase in processor complexity, traditional techniques are also increasingly more difficult to apply. This ..."
Abstract
-
Cited by 71 (3 self)
- Add to MetaCart
An important step in the development of dependable systems is the validation of their fault tolerance properties. Fault injection has been widely used for this purpose, however with the rapid increase in processor complexity, traditional techniques are also increasingly more difficult to apply. This paper presents a new software implemented fault injection and monitoring environment, called Xception, which is targeted for the modern and complex processors. Xception uses the advanced debugging and performance monitoring features existing in most of the modern processors to inject quite realistic faults by software, and to monitor the activation of the faults and their impact on the target system behavior in detail. Faults are injected with minimum interference with the target application. The target application is not modified, no software traps are inserted, and it is not necessary to execute the target application in special trace mode (the application is executed at full speed). Xception provides a comprehensive set of fault triggers, including spatial and temporal fault triggers, and triggers related to the manipulation of data in memory. Faults injected by Xception can affect any process running on the target system (including the kernel), and it is possible to inject faults in applications for which the source code is not available. Experimental results are presented to demonstrate the accuracy and potential of Xception in the evaluation of the dependability properties of the complex computer systems available nowadays.
Experimental Evaluation of the Fail-Silent Behavior
- in Programs with Consistency Checks, Proc. FTCS-26
, 1994
"... Previous work has shown that using only simple behavior based error detection mechanisms invisible to the programmer (e.g. memory protection) the percentage of fail-silent violations can be higher than 10%. Since the study of these errors has shown that they were mostly pure data errors, in this pap ..."
Abstract
-
Cited by 44 (5 self)
- Add to MetaCart
Previous work has shown that using only simple behavior based error detection mechanisms invisible to the programmer (e.g. memory protection) the percentage of fail-silent violations can be higher than 10%. Since the study of these errors has shown that they were mostly pure data errors, in this paper we evaluate the effectiveness of software techniques checking the semantics of the data such as ABFT and Assertions to detect these remaining errors. The results of injecting physical pin-level faults show that these tests can prevent about 40 % of the fail-silent model violations that have escaped to the simple hardware-based error detection techniques. Moreover, the analysis of the remaining errors has shown that most of them remained undetected due to short range control flow breaks. When very simple software-based control flow checking was associated to the semantic tests, the target system behaved — without any dedicated error detection hardware — according to the fail-silent model for more than 98 % of all the faults injected.
Checking Linked Data Structures
, 1994
"... In the program checking paradigm, the original program is run on the desired input, and its output is checked by another program called achecker. Recently, the notion of program checking has been extended from its original formulation of checking functions to checking a sequence of operations which ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
In the program checking paradigm, the original program is run on the desired input, and its output is checked by another program called achecker. Recently, the notion of program checking has been extended from its original formulation of checking functions to checking a sequence of operations which query and alter the state of an object external to the program, e.g., checking the interactions between a client and the manager (server) of a data structure. In this expanded paradigm, the checker acts as an intermediary between the client, which generates the requests, and the server, which processes them. The checker is allowed a small amount of reliable memory and may provide a probabilistic guarantee of correctness for the client. We present off-line and on-line checkers for data structures such as linked lists, trees, and graphs. Previously, the only data structures for which such checkers existed were random access memories, stacks, and queues.
Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations
, 1995
"... This paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the "Network Of Workstation" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW envir ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
This paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the "Network Of Workstation" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorization, QR factorization, and preconditioned conjugate gradient. These implementations are able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional. We discuss the details of how the algorithms are tuned for fault-tolerance, and present the performance results on a PVM network of SUN workstations, and on the IBM SP2.
Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing
, 1997
"... Networks of workstations (NOWs) offer a cost effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolera ..."
Abstract
-
Cited by 19 (11 self)
- Add to MetaCart
Networks of workstations (NOWs) offer a cost effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load or availability. As long as there are at least n processors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet.
Software Fault Tolerance: A Tutorial
, 2000
"... Since its founding, NASA has been dedicated to the advancement of aeronautics and space science. The NASA Scientific and Technical Information (STI) Program Office plays a key part in helping NASA maintain this important role. The NASA STI Program Office is operated by Langley Research Center, the l ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
Since its founding, NASA has been dedicated to the advancement of aeronautics and space science. The NASA Scientific and Technical Information (STI) Program Office plays a key part in helping NASA maintain this important role. The NASA STI Program Office is operated by Langley Research Center, the lead center for NASA's scientific and technical information. The NASA STI Program Office provides access to the NASA STI Database, the largest collection of aeronautical and space science STI in the world. The Program Office is also NASA's institutional mechanism for disseminating the results of its research and development activities. These results are published by NASA in the NASA STI Report Series, which includes the following report types: TECHNICAL PUBLICATION. Reports of completed research or a major significant phase of research that present the results of NASA programs and include extensive data or theoretical analysis. Includes compilations of significant scientific and technical data and information deemed to be of continuing reference value. NASA counterpart of peer-reviewed formal professional papers, but having less stringent limitations on manuscript length and extent of graphic presentations. TECHNICAL MEMORANDUM. Scientific and technical findings that are preliminary or of specialized interest, e.g., quick release reports, working papers, and bibliographies that contain minimal annotation. Does not contain extensive analysis.
An Improved Rate-Monotonic Admission Control and Its Applications
- IEEE Transactions on Computers
, 2003
"... Abstract—Rate-monotonic scheduling (RMS) is a widely used real-time scheduling technique. This paper proposes RBound, a new admission control for RMS. RBound has two interesting properties. First, it achieves high processor utilization under certain conditions. We show how to obtain these conditions ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Abstract—Rate-monotonic scheduling (RMS) is a widely used real-time scheduling technique. This paper proposes RBound, a new admission control for RMS. RBound has two interesting properties. First, it achieves high processor utilization under certain conditions. We show how to obtain these conditions in a multiprocessor environment and propose a multiprocessor scheduling algorithm that achieves a near optimal processor utilization. Second, the framework developed for RBound remains close to the original RMS framework (that is, task dispatching is still done via a fixed-priority scheme based on the task periods). In particular, we show how RBound can be used to guarantee a timely recovery in the presence of faults and still achieve high processor utilization. We also show how RBound can be used to increase the processor utilization when aperiodic tasks are serviced by a priority exchange server or a deferrable server. Index Terms—Real-time, scheduling, rate monotonic, operating systems. 1
A source-to-source compiler for generating dependable software
, 2001
"... Over the last years, an increasing number of safety-critical
tasks have been demanded to computer systems. In
particular, safety-critical computer-based applications
are hitting market area where cost is a major issue, and
thus solutions are required which conjugate fault
tolerance with low costs. I ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Over the last years, an increasing number of safety-critical
tasks have been demanded to computer systems. In
particular, safety-critical computer-based applications
are hitting market area where cost is a major issue, and
thus solutions are required which conjugate fault
tolerance with low costs. In this paper, a source-to-source
compiler supporting a Software-Implemented Hardware
Fault Tolerance approach is proposed, based on a set of
source code transformation rules. The proposed approach
hardens a program against transient memory errors by
introducing software redundancy: every computation is
performed twice and results are compared, and control-flow
invariants are checked explicitly. By exploiting the
tool’s capabilities, several benchmark applications have
been hardened against transient errors. Fault Injection
campaigns have been performed to evaluate the fault
detection capability of the hardened applications. In
addition we analyzed the proposed approach in terms of
space and time overheads.
Checking Mergeable Priority Queues
- In Digest of the 24th Symposium on Fault-Tolerant Computing
, 1994
"... We present an efficient algorithm which can check the answers given by the fundamental abstract data types priority queues and mergeable priority queues. This is the first linear-time checker for mergeable priority queues. These abstract data types are widely used in routing, scheduling, simulation, ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
We present an efficient algorithm which can check the answers given by the fundamental abstract data types priority queues and mergeable priority queues. This is the first linear-time checker for mergeable priority queues. These abstract data types are widely used in routing, scheduling, simulation, computational geometry and many other algorithmic domains. We have implemented our answer checker and have performed experiments comparing the speed of our checker to recently benchmarked priority queue and mergeable priority queue implementations, and our checker is substantially faster than the best of these implementations. 1 Introduction This paper concerns the fundamental abstract data types of priority queues (PQs) and mergeable priority queues (MPQs). These abstract data types have been recognized as centrally important from the early days of computer-algorithm design. They appear in seminal algorithm texts such as Knuth's [10] and Aho, Hopcroft and Ullman's [1]. Data structure impl...
Tolerance to Multiple Transient Faults for Aperiodic Tasks in Hard Real-Time Systems
- IEEE Transactions on Computers
, 1999
"... Real-time systems are being increasingly used in several applications which are time-critical in nature. Fault tolerance is an essential requirement of such systems, due to the catastrophic consequences of not tolerating faults. In this paper, we study a scheme that guarantees the timely recovery ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Real-time systems are being increasingly used in several applications which are time-critical in nature. Fault tolerance is an essential requirement of such systems, due to the catastrophic consequences of not tolerating faults. In this paper, we study a scheme that guarantees the timely recovery from multiple faults within hard real-time constraints in uniprocessor systems. Assuming earliest-deadline-first scheduling (EDF) for aperiodic preemptive tasks, we develop a necessary and sufficient feasibility-check algorithm for fault-tolerant scheduling with complexity O(n 2 \Delta k), where n is the number of tasks to be scheduled and k is the maximum number of faults to be tolerated. INDEX TERMS: Real-time scheduling, earliest-deadline first, fault-tolerant schedules, fault recovery. 1 Introduction The interest in embedded systems has been growing steadily in the recent past, specially those systems in which timing constraints are essential for the correct execution of the sys...

