Results 1 - 10
of
48
Unreliable Failure Detectors for Reliable Distributed Systems
- Journal of the ACM
, 1996
"... We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with ..."
Abstract
-
Cited by 807 (17 self)
- Add to MetaCart
We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
Building Secure and Reliable Network Applications
, 1996
"... ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to th ..."
Abstract
-
Cited by 209 (16 self)
- Add to MetaCart
ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to the invoker, and exceptions are raised if (and only if) an error occurs. Given a completely reliable communication environment, which never loses, duplicates, or reorders messages, and given client and server processes that never fail, RPC would be trivial to solve. The sender would merely package the invocation into one or more messages, and transmit these to the server. The server would unpack the data into local variables, perform the desired operation, and send back the result (or an indication of any exception that occurred) in a reply message. The challenge, then, is created by failures. Were it not for the possibility of process and machine crashes, an RPC protocol capable of overcomi...
How to assign votes in a distributed system
- Journal of the ACM
, 1985
"... Abstract. In a distributed system, one strategy for achieving mutual exclusion of groups of nodes without communication is to assign to each node a number of votes. Only a group with a majority of votes can execute the critical operations, and mutual exclusion is achieved because at any given time t ..."
Abstract
-
Cited by 167 (1 self)
- Add to MetaCart
Abstract. In a distributed system, one strategy for achieving mutual exclusion of groups of nodes without communication is to assign to each node a number of votes. Only a group with a majority of votes can execute the critical operations, and mutual exclusion is achieved because at any given time there is at most one such group. A second strategy, which appears to be similar to votes, is to define a priori a set of groups that intersect each other. Any group of nodes that finds itself in this set can perform the restricted operations. In this paper, both of these strategies are studied in detail and it is shown that they are not equivalent in general (although they are in some cases). In doing so, a number of other interesting properties are proved. These properties will be of use to a system designer who is selecting a vote assignment or a set of groups for a specific application.
Total order broadcast and multicast algorithms: Taxonomy and survey
- ACM COMPUTING SURVEYS
, 2004
"... ..."
Dynamic Fault-Tolerant Clock Synchronization
, 1996
"... This paper gives two simple efficient distributed algorithms: one for keeping clocks in a network synchronized and one for allowing new processors to join the network with their clocks synchronized. Assuming a fault tolerant authentication protocol, the algorithms tolerate both link and processor fa ..."
Abstract
-
Cited by 120 (12 self)
- Add to MetaCart
This paper gives two simple efficient distributed algorithms: one for keeping clocks in a network synchronized and one for allowing new processors to join the network with their clocks synchronized. Assuming a fault tolerant authentication protocol, the algorithms tolerate both link and processor failures of any type. The algorithm for maintaining synchronization works for arbitrary networks (rather than just completely connected networks) and tolerates any number of processor or communication link faults as long as the correct processors remain connected by fault-free paths. It thus represents an improvement over other clock synchronization algorithms such as [LM,WL], although, unlike them, it does require an authentication protocol to handle Byzantine faults. Our algorithm for allowing new processors to join requires that more than half the processors be correct, a requirement that is provably necessary. 1 Introduction In a distributed system it is often necessary for processors to ...
Identifying the minimal transversals of a hypergraph and related problems
- SIAM Journal on Computing
, 1995
"... The paper considers two decision problems on hypergraphs, hypergraph saturation and recognition of the transversal hypergraph, and discusses their significance for several search problems in applied computer science. Hypergraph saturation, i.e., given a hypergraph H, decide if every subset of vertic ..."
Abstract
-
Cited by 114 (7 self)
- Add to MetaCart
The paper considers two decision problems on hypergraphs, hypergraph saturation and recognition of the transversal hypergraph, and discusses their significance for several search problems in applied computer science. Hypergraph saturation, i.e., given a hypergraph H, decide if every subset of vertices is contained in or contains some edge of H, is shown to be co-NP-complete. A certain subproblem of hypergraph saturation, the saturation of simple hypergraphs, is shown to be computationally equivalent to transversal hypergraph recognition, i.e., given two hypergraphs H 1; H 2, decide if the sets in H 2 are all the minimal transversals of H 1. The complexity of the search problem related to the recognition of the transversal hypergraph, the computation of the transversal hypergraph, is an open problem. This task needs time exponential in the input size, but it is unknown whether an output-polynomial algorithm exists for this problem. For several important subcases, for instance if an upper or lower bound is imposed on the edge size or for acyclic hypergraphs, we present output-polynomial algorithms. Computing or recognizing the minimal transversals of a hypergraph is a frequent problem in practice, which is pointed out by identifying important applications in database theory, Boolean switching theory, logic, and AI, particularly in model-based diagnosis.
Replication Management Using the State Machine Approach
, 1993
"... This paper is a tutorial on the state machine approach. It describes the approach and its implementation for two representative environments. Small examples suffice to illustrate the points. However, the approach has been successfully applied to larger examples; some of these are mentioned in 9. Sec ..."
Abstract
-
Cited by 101 (0 self)
- Add to MetaCart
This paper is a tutorial on the state machine approach. It describes the approach and its implementation for two representative environments. Small examples suffice to illustrate the points. However, the approach has been successfully applied to larger examples; some of these are mentioned in 9. Section 2 describes how a system can be viewed in terms of a state machine, clients, and output devices. Coping with failures is the subject of 3 through 6. An important class of optimizations--- based on the use of time---is discussed in 7. Section 8 describes dynamic reconfiguration. The history of the approach and related work is discussed in 9
Fault-scalable Byzantine fault-tolerant services
- In Proceedings of the 20th ACM Symposium on Operating Systems Principles
, 2005
"... A fault-scalable service can be configured to tolerate increasing numbers of faults without significant decreases in performance. The Query/Update (Q/U) protocol is a new tool that enables construction of fault-scalable Byzantine faulttolerant services. The optimistic quorum-based nature of the Q/U ..."
Abstract
-
Cited by 92 (6 self)
- Add to MetaCart
A fault-scalable service can be configured to tolerate increasing numbers of faults without significant decreases in performance. The Query/Update (Q/U) protocol is a new tool that enables construction of fault-scalable Byzantine faulttolerant services. The optimistic quorum-based nature of the Q/U protocol allows it to provide better throughput and fault-scalability than replicated state machines using agreement-based protocols. A prototype service built using the Q/U protocol outperforms the same service built using a popular replicated state machine implementation at all system sizes in experiments that permit an optimistic execution. Moreover, the performance of the Q/U protocol decreases by only 36 % as the number of Byzantine faults tolerated increases from one to five, whereas the performance of the replicated state machine decreases by 83%.
Using Time Instead of Timeout for Fault-Tolerant Distributed Systems
- ACM Transactions on Programming Languages and Systems
, 1984
"... SRI International A general method is described for implementing a distributed system with any desired degree of fault-tolerance. Instead of relying upon explicit timeouts, processes execute a simple clock-driven algorithm. Reliable clock synchronization and a solution to the Byzantine Generals Prob ..."
Abstract
-
Cited by 75 (9 self)
- Add to MetaCart
SRI International A general method is described for implementing a distributed system with any desired degree of fault-tolerance. Instead of relying upon explicit timeouts, processes execute a simple clock-driven algorithm. Reliable clock synchronization and a solution to the Byzantine Generals Problem are assumed.

