Results 1 - 10
of
16
Distributed Reset
- IEEE Transactions on Computers
, 1990
"... We design a reset subsystem that can be embedded in an arbitrary distributed system in order to allow the system processes to reset the system when necessary. Our design is layered, and comprises three main components: a leader election, a spanning tree construction, and a diffusing computation. Eac ..."
Abstract
-
Cited by 137 (23 self)
- Add to MetaCart
We design a reset subsystem that can be embedded in an arbitrary distributed system in order to allow the system processes to reset the system when necessary. Our design is layered, and comprises three main components: a leader election, a spanning tree construction, and a diffusing computation. Each of these components is self-stabilizing in the following sense. If the coordination between the up processes in the system is ever lost (due to failures or repairs of processes and channels) then each component eventually reaches a state where coordination is regained. This capability makes our reset subsystem very robust: it can tolerate fail-stop failures and repairs of processes and channels even when a reset is in progress. Categories and Subject Descriptors: C.2.4 [Computer Communication Systems]: Distributed Systems--distributed applications, network operating systems ; D.1.3 [Programming Techniques]: Concurrent Programming ; D.4.5 [Operating Systems]: Reliability--verification, fa...
Closure and Convergence: A Foundation of Fault-Tolerant Computing
- IEEE Transactions on Software Engineering
, 1993
"... We give a formal definition of what it means for a system to "tolerate" a class of "faults". The definition consists of two conditions: One, if a fault occurs when the system state is within a set of "legal" states, the resulting state is within some larger set and, if faults continue occurring, the ..."
Abstract
-
Cited by 103 (28 self)
- Add to MetaCart
We give a formal definition of what it means for a system to "tolerate" a class of "faults". The definition consists of two conditions: One, if a fault occurs when the system state is within a set of "legal" states, the resulting state is within some larger set and, if faults continue occurring, the system state remains within that larger set (Closure). And two, if faults stop occurring, the system eventually reaches a state within the legal set (Convergence). We demonstrate the applicability of our definition for specifying and verifying the fault-tolerance properties of a variety of digital and computer systems. Further, using the definition, we obtain a simple classification of fault-tolerant systems and discuss methods for their systematic design. as traditionally been studied in the context of specifi...
Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments
- ACM Computing Surveys
, 1999
"... Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like f ..."
Abstract
-
Cited by 57 (9 self)
- Add to MetaCart
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.
Component Based Design of Multitolerant Systems
- IEEE Transactions on Software Engineering
, 1998
"... The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with t ..."
Abstract
-
Cited by 49 (10 self)
- Add to MetaCart
The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with that undependability source is represented as a type of tolerance. Multitolerance thus refers to the ability of the system to tolerate multiple fault-classes, each in a possibly different way. In this paper, we present a component based method for designing multitolerance. Two types of components are employed by the method, namely detectors and correctors. A theory of detectors, correctors, and their interference-free composition with intolerant programs is developed, that enables stepwise addition of components to provide tolerance to a new fault-class while preserving the tolerances to the previously added fault-classes. We illustrate the method by designing a fully distributed, mul...
Constraint Satisfaction as a Basis for Designing Nonmasking Fault-Tolerance
, 1996
"... We present a method for the design of nonmasking fault-tolerant programs. In our method, a set of constraints is associated with each program. As long as faults do not occur, the constraints are continually satisfied under the execution of program actions. Whenever some of the constraints are violat ..."
Abstract
-
Cited by 23 (9 self)
- Add to MetaCart
We present a method for the design of nonmasking fault-tolerant programs. In our method, a set of constraints is associated with each program. As long as faults do not occur, the constraints are continually satisfied under the execution of program actions. Whenever some of the constraints are violated, due to certain faults, all constraints are eventually reestablished by subsequent execution of the program actions. To design programs thus, two types of program actions are distinguished: "closure" actions and "convergence " actions. Closure actions are the actions that perform the intended computation of the program when all of the constraints are satisfied. Convergence actions are the actions that reestablish the constraints when they have been violated. Sufficient conditions for the validation of closure and convergence actions are formalized in terms of a "constraint graph". These conditions are illustrated by designing nonmasking fault-tolerant programs for diffusing computations, ...
Component Based Design of Multitolerance
- IEEE Transactions on Software Engineering
, 1998
"... The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with that ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with that undependability source is represented as a type of tolerance. Multitolerance thus refers to the ability of the system to tolerate multiple fault-classes, each in a possibly different way. In this paper, we present a component based method for designing multitolerance. Two types of components are employed by the method, namely detectors and correctors. A theory of detectors, correctors, and their interference-free composition with intolerant programs is developed, that enables stepwise addition of components to provide tolerance to a new fault-class while preserving the tolerances to the previously added fault-classes. We illustrate the method by designing a fully distributed, multitolerant ...
Synthesis of Concurrent Programs for an Atomic Read/Write Model of Computation
- in PODC96
, 2001
"... this paper, we show how to mechanically synthesize in more realistic computational models solutions to synchronization problems. We illustrate the method by synthesizing Peterson's solution to the mutual exclusion problem ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
this paper, we show how to mechanically synthesize in more realistic computational models solutions to synchronization problems. We illustrate the method by synthesizing Peterson's solution to the mutual exclusion problem
A Case-Study in Component-Based Mechanical Verification of Fault-Tolerant Programs
- In Proceedings of 4th Workshop on SelfStabilization. IEEE Computer Society
, 1999
"... In this paper, we present a case study to demonstrate that the decomposition of a fault-tolerant program into its components is useful in its mechanical verification. More specifically, we discuss our experience in using the theorem prover PVS to verify Dijkstra's token ring program in a component-b ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
In this paper, we present a case study to demonstrate that the decomposition of a fault-tolerant program into its components is useful in its mechanical verification. More specifically, we discuss our experience in using the theorem prover PVS to verify Dijkstra's token ring program in a component-based manner. We also demonstrate the advantages of component based mechanical verification. Keywords : Component-based verification, Faulttolerance, Program decomposition, Mechanical verification, Self-stabilization 1 Introduction In this paper, we argue that the decomposition of a faulttolerant program into its components is beneficial in its mechanical verification, and that such a decomposition admits reuse of the proofs for other fault-tolerant programs as well as the variations of the given fault-tolerant program. Arora and Kulkarni [3] have shown that a fault-tolerant program can be decomposed into a fault-intolerant program and a set of `tolerance'-components, namely detectors and...
Maintaining Digital Clocks In Step
, 1992
"... A system of simultaneously triggered clocks is designed to be stabilizing: if the clock values ever differ, the system is guaranteed to converge to a state where all clock values are identical, and are subsequently maintained to be identical. For an N-clock system, the design uses N registers of 2 ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
A system of simultaneously triggered clocks is designed to be stabilizing: if the clock values ever differ, the system is guaranteed to converge to a state where all clock values are identical, and are subsequently maintained to be identical. For an N-clock system, the design uses N registers of 2 log N bits each and guarantees convergence to identical values within N 2 "triggers". Keywords: stabilization, reliability, distributed algorithms, digital clocks, convergence. 1 Introduction Digital systems are often designed to be synchronous; that is, a system-wide clock pulse is used to ensure that system parts operate in "lock-step". One building block that is commonly used in the design of such systems is a digital clock. Operationally, the task of a digital clock is to maintain a count of the number of clock pulses as they occur. The use of digital clocks in circuit design is frequent; for example, ffl Synchronization : A circuit may include some parts that need to synchronize wit...
Modular Progress Proofs Of Asynchronous Programs
, 1993
"... v Table of Contents vi 1. Introduction 1 1.1 Subject : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.2 Method : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 1.3 Organization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
v Table of Contents vi 1. Introduction 1 1.1 Subject : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.2 Method : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 1.3 Organization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 1.4 Caveats : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 I Theory 7 2. Preliminaries 8 2.1 Notational conventions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 2.2 Predicate transformers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 2.3 Programs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 2.4 Properties : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14 2.5 Closures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 2.6 Guarding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 3. Cont...

