Results 1 - 10
of
17
Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments
- ACM Computing Surveys
, 1999
"... Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like f ..."
Abstract
-
Cited by 57 (9 self)
- Add to MetaCart
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.
Component Based Design of Multitolerant Systems
- IEEE Transactions on Software Engineering
, 1998
"... The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with t ..."
Abstract
-
Cited by 49 (10 self)
- Add to MetaCart
The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with that undependability source is represented as a type of tolerance. Multitolerance thus refers to the ability of the system to tolerate multiple fault-classes, each in a possibly different way. In this paper, we present a component based method for designing multitolerance. Two types of components are employed by the method, namely detectors and correctors. A theory of detectors, correctors, and their interference-free composition with intolerant programs is developed, that enables stepwise addition of components to provide tolerance to a new fault-class while preserving the tolerances to the previously added fault-classes. We illustrate the method by designing a fully distributed, mul...
Self-Stabilizing Distributed Constraint Satisfaction
, 1991
"... Distributed architectures and solutions are described for classes of constraint satisfaction problems, called network consistency problems. An inherent assumption of these architectures is that the communication network mimics the structure of the constraint problem. The solutions are required to be ..."
Abstract
-
Cited by 30 (3 self)
- Add to MetaCart
Distributed architectures and solutions are described for classes of constraint satisfaction problems, called network consistency problems. An inherent assumption of these architectures is that the communication network mimics the structure of the constraint problem. The solutions are required to be self-stabilizing and to treat arbitrary networks, which makes them suitable for dynamic or error-prone environments. We first show that even for relatively simple constraint networks, such as rings, there is no self-stabilizing solution that guarantees convergence from every initial state of the system using a completely uniform, asynchronous model (where all processors are identical). An almost-uniform, asynchronous, network consistency protocol with one specially designated node is shown and proven correct. We also show that some restricted topologies such as trees can accommodate the uniform, asynchronous model when neighboring nodes cannot take simultaneous steps. 1 Introduction Consid...
Designing Masking Fault-tolerance via Nonmasking Fault-tolerance (Extended Abstract)
- IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
, 1998
"... Masking fault-tolerance guarantees that programs continually satisfy their specification in the presence of faults. By way of contrast, nonmasking fault-tolerance does not guarantee as much: it merely guarantees that when faults stop occurring, program executions converge to states from where progra ..."
Abstract
-
Cited by 27 (11 self)
- Add to MetaCart
Masking fault-tolerance guarantees that programs continually satisfy their specification in the presence of faults. By way of contrast, nonmasking fault-tolerance does not guarantee as much: it merely guarantees that when faults stop occurring, program executions converge to states from where programs continually (re)satisfy their specification. In this paper, we show that a practical method to design masking fault-tolerance is to first design nonmasking fault-tolerance and to then transform the nonmasking fault-tolerant program minimally so as to achieve masking fault-tolerance. We demonstrate this method by designing novel fully distributed programs for termination detection, mutual exclusion, and leader election, that are masking tolerant of any finite number of process fail-stops and/or repairs.
SAT-Based Synthesis of Fault-Tolerance
"... We present a technique where we use SAT solvers in automatic synthesis of fault-tolerant distributed programs from their faultintolerant version. Since adding fault-tolerance to distributed programs is NP-complete, we use state-of-the-art SAT solvers to benefit from efficient heuristics integrated i ..."
Abstract
-
Cited by 20 (12 self)
- Add to MetaCart
We present a technique where we use SAT solvers in automatic synthesis of fault-tolerant distributed programs from their faultintolerant version. Since adding fault-tolerance to distributed programs is NP-complete, we use state-of-the-art SAT solvers to benefit from efficient heuristics integrated in SAT solvers to deal with the exponential complexity of adding fault-tolerance. Also, such SAT-based technique has the potential to use multiple instances of SAT solvers simultaneously so that independent sub-problems can be solved in parallel during synthesis.
Component Based Design of Multitolerance
- IEEE Transactions on Software Engineering
, 1998
"... The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with that ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with that undependability source is represented as a type of tolerance. Multitolerance thus refers to the ability of the system to tolerate multiple fault-classes, each in a possibly different way. In this paper, we present a component based method for designing multitolerance. Two types of components are employed by the method, namely detectors and correctors. A theory of detectors, correctors, and their interference-free composition with intolerant programs is developed, that enables stepwise addition of components to provide tolerance to a new fault-class while preserving the tolerances to the previously added fault-classes. We illustrate the method by designing a fully distributed, multitolerant ...
Graybox Stabilization
, 2001
"... Research in system stabilization has traditionally relied on the availability of a complete system implementation. As such, it would appear that the scalability and reusability of stabilization is limited in practice. Towards redressing this perception, in this paper, we show for the first time that ..."
Abstract
-
Cited by 8 (7 self)
- Add to MetaCart
Research in system stabilization has traditionally relied on the availability of a complete system implementation. As such, it would appear that the scalability and reusability of stabilization is limited in practice. Towards redressing this perception, in this paper, we show for the first time that system stabilization may be designed knowing only the system specification but not the system implementation. We refer to stabilization designed thus as being "graybox" and identify "local everywhere-eventually specifications" as being amenable to design of graybox stabilization. We illustrate the design of graybox stabilization using timestamp-based distributed mutual exclusion as our example.
An Exercise in Proving Convergence through Transfer Functions
- Proc. 4th Workshop on Self-stabilizing Systems
, 1999
"... Self-stabilizing algorithms must fulfill two requirements generally called closure and convergence. We are interested in the convergence property and discuss a new method on proving it. Usually, proving the convergence of self-stabilizing algorithms requires a well-foundedness argument: briefly spok ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Self-stabilizing algorithms must fulfill two requirements generally called closure and convergence. We are interested in the convergence property and discuss a new method on proving it. Usually, proving the convergence of self-stabilizing algorithms requires a well-foundedness argument: briefly spoken, it involves exhibiting a convergence function which is shown to decrease with every transition of the algorithm starting in an illegal state. Devising such a convergence function can be a difficult task, since it must bear in itself the essence of stabilization which lies within the algorithm. In this paper, we explore how to utilize results from control theory to proving the stability of self-stabilizing algorithms. We define a simple stabilization task and adapt stability criteria for linear control circuits to construct a self-stabilizing algorithm which solves the task. In contrast to the usual procedure in which finding a convergence function is an afterthought of algorithm design, our approach can be seen as starting with a convergence function which is implicitly given through a so-called transfer function. Then, we construct an algorithm around it. It turns out that this methodology seems to adapt well to those settings which are quite difficult to handle by the traditional methodologies of self-stabilization.
Hierarchical presynthesized components for automatic addition of fault-tolerance: A case study
- In the extended abstracts of the ACM workshop on the Specification and Verification of Component-Based Systems (SAVCBS
, 2004
"... We present a case study of automatic addition of faulttolerance to distributed programs using presynthesized distributed components. Specifically, we extend the scope of automatic addition of fault-tolerance using presynthesized components to the case where we automatically add hierarchical componen ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
We present a case study of automatic addition of faulttolerance to distributed programs using presynthesized distributed components. Specifically, we extend the scope of automatic addition of fault-tolerance using presynthesized components to the case where we automatically add hierarchical components to fault-intolerant programs, whereas in our previous work, we have shown the addition of linear presynthesized components to programs. Towards this end, we present an automatically generated diffusing computation program that provides nonmasking fault-tolerance – where, in the presence of faults, the nonmasking program guarantees recovery to states from where it satisfies its safety and liveness specifications. Since presynthesized components provide reuse in the synthesis of fault-tolerant distributed programs, we expect that our method will pave the way for automatic addition of fault-tolerance to large-scale programs. Keywords: Fault-tolerance, Automatic addition of faulttolerance, Formal methods, Program synthesis, Distributed programs 1
Efficient Reconfiguration of Trees: A Case Study in Methodical Design of Nonmasking Fault-Tolerant Programs
, 1994
"... . We illustrate a formal method for the design of nonmasking fault-tolerant programs, by demonstrating how the method enables us to effectively design a new and efficient program. Our program maintains the processes of any given distributed system in a spanning tree, tolerates any finite number of f ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
. We illustrate a formal method for the design of nonmasking fault-tolerant programs, by demonstrating how the method enables us to effectively design a new and efficient program. Our program maintains the processes of any given distributed system in a spanning tree, tolerates any finite number of fail-stop failures and repairs of system processes and channels, and requires only O(n) time and O(n log n) space to reconfigure the tree, where n is the number of nonfaulty processes. The program is, moreover, simple and fully distributed. Categories and Subject Descriptors C.2.4 [Computer Communication Systems] Distributed Systems D.1.3 [Programming Techniques] Concurrent Programming D.2.4 [Program Verification] Reliability D.2.10 [Program Design] Methodologies G.2.2 [Discrete Mathematics] Graph Algorithms ? Research supported in part by NSF Grant CCR-9308640 and OSU Grant 221506. A preliminary version of this paper appears in the Proceedings of the Third International Symposium on Formal...

