Results 1 -
5 of
5
Closure and Convergence: A Foundation of Fault-Tolerant Computing
- IEEE Transactions on Software Engineering
, 1993
"... We give a formal definition of what it means for a system to "tolerate" a class of "faults". The definition consists of two conditions: One, if a fault occurs when the system state is within a set of "legal" states, the resulting state is within some larger set and, if faults continue occurring, the ..."
Abstract
-
Cited by 103 (28 self)
- Add to MetaCart
We give a formal definition of what it means for a system to "tolerate" a class of "faults". The definition consists of two conditions: One, if a fault occurs when the system state is within a set of "legal" states, the resulting state is within some larger set and, if faults continue occurring, the system state remains within that larger set (Closure). And two, if faults stop occurring, the system eventually reaches a state within the legal set (Convergence). We demonstrate the applicability of our definition for specifying and verifying the fault-tolerance properties of a variety of digital and computer systems. Further, using the definition, we obtain a simple classification of fault-tolerant systems and discuss methods for their systematic design. as traditionally been studied in the context of specifi...
Component Based Design of Multitolerant Systems
- IEEE Transactions on Software Engineering
, 1998
"... The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with t ..."
Abstract
-
Cited by 49 (10 self)
- Add to MetaCart
The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with that undependability source is represented as a type of tolerance. Multitolerance thus refers to the ability of the system to tolerate multiple fault-classes, each in a possibly different way. In this paper, we present a component based method for designing multitolerance. Two types of components are employed by the method, namely detectors and correctors. A theory of detectors, correctors, and their interference-free composition with intolerant programs is developed, that enables stepwise addition of components to provide tolerance to a new fault-class while preserving the tolerances to the previously added fault-classes. We illustrate the method by designing a fully distributed, mul...
Constraint Satisfaction as a Basis for Designing Nonmasking Fault-Tolerance
, 1996
"... We present a method for the design of nonmasking fault-tolerant programs. In our method, a set of constraints is associated with each program. As long as faults do not occur, the constraints are continually satisfied under the execution of program actions. Whenever some of the constraints are violat ..."
Abstract
-
Cited by 23 (9 self)
- Add to MetaCart
We present a method for the design of nonmasking fault-tolerant programs. In our method, a set of constraints is associated with each program. As long as faults do not occur, the constraints are continually satisfied under the execution of program actions. Whenever some of the constraints are violated, due to certain faults, all constraints are eventually reestablished by subsequent execution of the program actions. To design programs thus, two types of program actions are distinguished: "closure" actions and "convergence " actions. Closure actions are the actions that perform the intended computation of the program when all of the constraints are satisfied. Convergence actions are the actions that reestablish the constraints when they have been violated. Sufficient conditions for the validation of closure and convergence actions are formalized in terms of a "constraint graph". These conditions are illustrated by designing nonmasking fault-tolerant programs for diffusing computations, ...
Component Based Design of Multitolerance
- IEEE Transactions on Software Engineering
, 1998
"... The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with that ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with that undependability source is represented as a type of tolerance. Multitolerance thus refers to the ability of the system to tolerate multiple fault-classes, each in a possibly different way. In this paper, we present a component based method for designing multitolerance. Two types of components are employed by the method, namely detectors and correctors. A theory of detectors, correctors, and their interference-free composition with intolerant programs is developed, that enables stepwise addition of components to provide tolerance to a new fault-class while preserving the tolerances to the previously added fault-classes. We illustrate the method by designing a fully distributed, multitolerant ...
Fault-Tolerant Reconfiguration of Trees and Rings in Distributed Systems
"... We design two programs that maintain the nodes of any distributed system in a rooted spanning tree and in a unidirectional ring, respectively, in the presence of any finite number of fail-stop failures and repairs of system nodes and communication channels. Our programs are fully distributed, have o ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We design two programs that maintain the nodes of any distributed system in a rooted spanning tree and in a unidirectional ring, respectively, in the presence of any finite number of fail-stop failures and repairs of system nodes and communication channels. Our programs are fully distributed, have optimal time and space complexity, and illustrate two different methods for the design of nonmasking fault-tolerant programs. Categories and Subject Descriptors C.2.4 [Computer Communication Systems] Distributed Systems D.1.3 [Programming Techniques] Concurrent Programming D.2.4 [Program Verification] Reliability D.2.10 [Program Design] Methodologies D.4.5 [Operating Systems] Fault-tolerance G.2.2 [Discrete Mathematics] Graph Algorithms 0 Research supported in part by NSF grant CCR-9308640 and OSU Grant 221506 1 Introduction Cooperation between the nodes of distributed systems is commonly realized by organizing the nodes into a convenient logical structure such as a ring, a star, or a tre...

