Results 1 - 10
of
15
Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments
- ACM Computing Surveys
, 1999
"... Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like f ..."
Abstract
-
Cited by 57 (9 self)
- Add to MetaCart
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.
Detectors and Correctors: A Theory of Fault-Tolerance Components
- International Conference on Distributed Computing Systems
, 1998
"... In this paper, weshow that twotypes of tolerance components, namely detectors and correctors, appear in a rich class of fault-tolerant systems. This class includes systems designed using the wellknown techniques of encapsulation and re nement, as well as systems designed using extant fault-tolerance ..."
Abstract
-
Cited by 55 (10 self)
- Add to MetaCart
In this paper, weshow that twotypes of tolerance components, namely detectors and correctors, appear in a rich class of fault-tolerant systems. This class includes systems designed using the wellknown techniques of encapsulation and re nement, as well as systems designed using extant fault-tolerance methods such as replication and the state-machine approach. Our demonstration is via a theory of detectors and correctors, which characterizes the particular role of these components in achieving various types of fault-tolerance. Based on this theory and on our experience with using these components in designs, we suggest that detectors and correctors provide apowerful basis for e cient, component-based design of fault-tolerance.
Synthesis of fault-tolerant concurrent programs
- Proceedings of the 17th ACM Symposium on Principles of Distributed Computing (PODC
, 1998
"... Methods for mechanically synthesizing concurrent programs from temporal logic specifications obviate the need to manually construct a program and compose a proof of its correctness. A serious drawback of extant synthesis methods, however, is that they produce concurrent programs for models of comput ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Methods for mechanically synthesizing concurrent programs from temporal logic specifications obviate the need to manually construct a program and compose a proof of its correctness. A serious drawback of extant synthesis methods, however, is that they produce concurrent programs for models of computation that are often unrealistic. In particular, these methods assume completely fault-free operation, i.e., the programs they produce are fault-intolerant. In this paper, we show how to mechanically synthesize fault-tolerant concurrent programs for various fault classes. We illustrate our method by synthesizing fault-tolerant solutions to the mutual exclusion and barrier synchronization problems. Categories and Subject Descriptors: F.3.1 [Logics and Meanings of Programs]: Specifying and Verifying and Reasoning about Programs—logics of programs, mechanical verification, specification
Designing Masking Fault-tolerance via Nonmasking Fault-tolerance (Extended Abstract)
- IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
, 1998
"... Masking fault-tolerance guarantees that programs continually satisfy their specification in the presence of faults. By way of contrast, nonmasking fault-tolerance does not guarantee as much: it merely guarantees that when faults stop occurring, program executions converge to states from where progra ..."
Abstract
-
Cited by 27 (11 self)
- Add to MetaCart
Masking fault-tolerance guarantees that programs continually satisfy their specification in the presence of faults. By way of contrast, nonmasking fault-tolerance does not guarantee as much: it merely guarantees that when faults stop occurring, program executions converge to states from where programs continually (re)satisfy their specification. In this paper, we show that a practical method to design masking fault-tolerance is to first design nonmasking fault-tolerance and to then transform the nonmasking fault-tolerant program minimally so as to achieve masking fault-tolerance. We demonstrate this method by designing novel fully distributed programs for termination detection, mutual exclusion, and leader election, that are masking tolerant of any finite number of process fail-stops and/or repairs.
Local Tolerance to Unbounded Byzantine Faults
- In IEEE SRDS
, 2002
"... An ideal approach to deal with faults in large-scale distributed systems is to contain the eects of faults as locally as is possible and, additionally, to ensure some type of tolerance within each fault-aected locality. Existing results using this approach accommodate only limited faults (such as cr ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
An ideal approach to deal with faults in large-scale distributed systems is to contain the eects of faults as locally as is possible and, additionally, to ensure some type of tolerance within each fault-aected locality. Existing results using this approach accommodate only limited faults (such as crashes) or assume that fault occurrence is bounded in space and/or time. In this paper, we de ne and explore possibility /impossibility of local tolerance with respect to arbitrary faults (such as Byzantine faults) whose occurrence may be unbounded in space and in time. Our positive results include programs for graph coloring and dining philosophers, with proofs that the size of their tolerance locality is optimal. The type of tolerance achieved within fault-aected localities is self-stabilization. That is, starting from an arbitrary state of the distributed system, each non-faulty process eventually reaches a state from where it behaves correctly as long as the only faults that occur henceforth (regardless of their number) are outside the locality of this process.
Self-stabilization of byzantine protocols
- Proc. of the 7th Symposium on Self-Stabilizing Systems (SSS’05 Barcelona
, 2005
"... Abstract. Awareness of the need for robustness in distributed systems increases as distributed systems become integral parts of day-to-day systems. Self-stabilizing while tolerating ongoing Byzantine faults are wishful properties of a distributed system. Many distributed tasks (e.g. clock synchroniz ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Abstract. Awareness of the need for robustness in distributed systems increases as distributed systems become integral parts of day-to-day systems. Self-stabilizing while tolerating ongoing Byzantine faults are wishful properties of a distributed system. Many distributed tasks (e.g. clock synchronization) possess e cient non-stabilizing solutions tolerating Byzantine faults or conversely non-Byzantine but self-stabilizing solutions. In contrast, designing algorithms that self-stabilize while at the same time tolerating an eventual fraction of permanent Byzantine failures present a special challenge due to the ambition of malicious nodes to hamper stabilization if the systems tries to recover from a corrupted state. This di culty might be indicated by the remarkably few algorithms that are resilient to both fault models. We present the scheme that takes a Byzantine distributed algorithm and produces its self-stabilizing Byzantine counterpart, while having a relatively low overhead of O(f ′ ) communication rounds, where f ′ is the number of actual faults. Our protocol is based on a tight Byzantine self-stabilizing pulse synchronization procedure. The synchronized pulses are used as events for initializing Byzantine agreement on every node's local state. The set of local states is used for global predicate detection. Should the global state represent an illegal system state then the target algorithm is reset. rst 1
Specifications for Fault Tolerance: A Comedy of Failures
, 1998
"... A substantial difficulty in rigorously reasoning about fault tolerant distributed algorithms is the necessity to formally describe faulty behavior. In this paper, we present a unified and formal approach to specify such behavior. It is based on the observation that faulty behavior can be regarded as ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
A substantial difficulty in rigorously reasoning about fault tolerant distributed algorithms is the necessity to formally describe faulty behavior. In this paper, we present a unified and formal approach to specify such behavior. It is based on the observation that faulty behavior can be regarded as a special form of (programmable) system behavior. Consequently, a failure model is defined to be a program transformation which can be used to evaluate the correctness properties of fault tolerant algorithms. We re-formulate several failure models which are pervasive in the literature in terms of our approach and show some interesting relations between them. In order to show the feasibility of this approach, we apply our methodology to the problem of reliable broadcast. Categories and Subject Descriptors: C.4 [Performance of Systems]: Fault tolerance; modeling techniques; F.3.1 [Specifying and Verifying and Reasoning about Programs ]: mechanical verification; specification techniques Gener...
Modular Composition of Redundancy Management Protocols in Distributed Systems: An Outlook on Simplifying . . .
"... In recent years, formal methods (FMs) have been extensively used for verification and validation (V&V) of dependable distributed protocols. Over our studies in utilizing FMs for V&V, we have observed that a number of protocols providing for distributed and dependable services can often be formulated ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
In recent years, formal methods (FMs) have been extensively used for verification and validation (V&V) of dependable distributed protocols. Over our studies in utilizing FMs for V&V, we have observed that a number of protocols providing for distributed and dependable services can often be formulated using a small set of basic functional primitives or their variations. Thus, from the formal viewpoint, the objective of this paper is to introduce techniques, utilizing concepts of category theory, that could effectively identify and reuse basic formal modules in order to simplify formal specification and verification for a spectrum of protocols.
Dijkstra's Self-Stabilizing Algorithm in Unsupportive Environments
- Proc. Fifth Workshop Self-Stabilizing Systems (WSS 2001
, 2001
"... The rst self-stabilizing algorithm published by Dijkstra in 1973 assumed the existence of a central daemon, that activates one processor at time to change state as a function of its own state and the state of a neighbor. Subsequent research has reconsidered this algorithm without the assumption ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
The rst self-stabilizing algorithm published by Dijkstra in 1973 assumed the existence of a central daemon, that activates one processor at time to change state as a function of its own state and the state of a neighbor. Subsequent research has reconsidered this algorithm without the assumption of a central daemon, and under dierent forms of communication, such as the model of link registers. In all of these investigations, one common feature is the atomicity of communication, whether by shared variables or read/write registers. This paper weakens the atomicity assumptions for the communication model, proposing versions of Dijkstra's algorithm that tolerate various weaker forms of atomicity, including cases of regular and safe registers. The paper also presents an implementation of Dijkstra's algorithm based on registers that have probabilistically correct behavior, which requires a notion of weak stabilization, where Markov chains are used to evaluate the probability to be in a safe con guration.
On Simplifying Modular Specification and Verification of Distributed Protocols
, 2001
"... Computer systems supporting high assurance and high consequences applications typically utilize dependable distributed protocols to manage system resources and to provide sustained delivery of services in the presence of failures. The inherent complexity entailed in the design and analysis of such p ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Computer systems supporting high assurance and high consequences applications typically utilize dependable distributed protocols to manage system resources and to provide sustained delivery of services in the presence of failures. The inherent complexity entailed in the design and analysis of such protocols, is increasingly necessitating the use of formal techniques in establishing the correctness of the protocol level operations. Exploiting modular design aspects appearing in most dependable distributed protocols, we have introduced techniques utilizing concepts of category theory for constructing formal library routines of a set of constituent functional primitives, and their use in establishing the correctness of the protocol operation. In this paper, we develop on our proposed category-theory-based approach for modular composition through formulating (a) a group membership protocol which can also form the next hierarchical building blocks for other dependable protocol operations, and (b) a checkpointing protocol utilizing the group membership function as one of its building block. Subtleties in building-block interactions and their inuence on the overall correctness of the composite protocols are also highlighted.

