Results 1 - 10
of
230
Unreliable Failure Detectors for Reliable Distributed Systems
- Journal of the ACM
, 1996
"... We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with ..."
Abstract
-
Cited by 807 (17 self)
- Add to MetaCart
We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
A Note on Distributed Computing
- IEEE Micro
, 1994
"... We argue that objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space. These differences are required because distributed systems require that the programmer be aware of latency, have a differen ..."
Abstract
-
Cited by 170 (0 self)
- Add to MetaCart
We argue that objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space. These differences are required because distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure. We look at a number of distributed systems that have attempted to paper over the distinction between local and remote objects, and show that such systems fail to support basic requirements of robustness and reliability. These failures have been masked in the past by the small size of the distributed systems that have been built. In the enterprise-wide distributed systems foreseen in the near future, however, such a masking will be impossible. We conclude by discussing what is required of both systems-level and application-level programmers and designers if one is to take distribution seriously.
Early consensus in an asynchronous system with a Weak failure detector
- Distributed Comput
, 1997
"... Summary. Consensus is one of the most fundamental problems in the context of fault-tolerant distributed computing. The problem consists, given a set Ω of processes having each an initial value v, in deciding among Ω on a common value v. In 1985, Fischer, Lynch and Paterson proved that the consensus ..."
Abstract
-
Cited by 105 (22 self)
- Add to MetaCart
Summary. Consensus is one of the most fundamental problems in the context of fault-tolerant distributed computing. The problem consists, given a set Ω of processes having each an initial value v, in deciding among Ω on a common value v. In 1985, Fischer, Lynch and Paterson proved that the consensus problem is not solvable in an asynchronous system subject to a single process crash. In 1991, Chandra and Toueg showed that, by augmenting the asynchronous system model with a well defined unreliable failure detector, consensus becomes solvable. They also give an algorithm that solves consensus using the �S failure detector. In this paper we propose a new consensus algorithm, also using the �S failure detector, that is more efficient than the Chandra-Toueg consensus algorithm. We measure efficiency by introducing the notion of latency degree, which defines the minimal number of communication steps needed to solve consensus. The Chandra-Toueg algorithm has a latency degree of 3 (it requires at least three communication steps), whereas our early consensus algorithm requires only two communication steps (latency degree of 2). We believe that this is an interesting result, which adds to our current understanding of the cost of consensus algorithms based on �S.
Adding Group Communication and Fault-Tolerance to CORBA
, 1995
"... Groupware and fault-tolerant distributed systems stimulate the need for structuring activities around objectgroups and reliable multicast communication. The objectgroup abstraction permits to treat a collection of networkobjects as if they were a single object# clients can invoke operations on objec ..."
Abstract
-
Cited by 102 (7 self)
- Add to MetaCart
Groupware and fault-tolerant distributed systems stimulate the need for structuring activities around objectgroups and reliable multicast communication. The objectgroup abstraction permits to treat a collection of networkobjects as if they were a single object# clients can invoke operations on object-groups without needing to know the exact membership of the group. Object-groups mainly serve to increase reliability through replication, performance through parallelism, or to distribute data from one sender to a large number of receivers efficiently. This paper describes how object-groups and reliable multicast communication can be added to a CORBA compliant Object Request Broker. It also presents ELECTRA --- a CORBA Object Request Broker whose architecture is pervaded by the group concept.
A new approach to developing and implementing eager database replication protocols
- ACM TODS
"... Database replication is traditionally seen as a way to increase the availability and performance of distributed databases. Although a large number of protocols providing data consistency and fault-tolerance have been proposed, few of these ideas have ever been used in commercial products due to thei ..."
Abstract
-
Cited by 101 (12 self)
- Add to MetaCart
Database replication is traditionally seen as a way to increase the availability and performance of distributed databases. Although a large number of protocols providing data consistency and fault-tolerance have been proposed, few of these ideas have ever been used in commercial products due to their complexity and performance implications. Instead, current products allow inconsistencies and often resort to centralized approaches which eliminates some of the advantages of replication. As an alternative, we propose a suite of replication protocols that addresses the main problems related to database replication. On the one hand, our protocols maintain data consistency and the same transactional semantics found in centralized systems. On the other hand, they provide flexibility and reasonable performance. To do so, our protocols take advantage of the rich semantics of group communication primitives and the relaxed isolation guarantees provided by most databases. This allows us to eliminate the possibility of deadlocks, reduce the message overhead and increase performance. A detailed simulation study shows the feasibility of the approach and the flexibility with which different types of bottlenecks can be circumvented.
Route driven gossip: Probabilistic reliable multicast in ad hoc networks
- IN PROC. OF INFOCOM
, 2003
"... Traditionally, reliable multicast protocols are deterministic in nature. It is precisely this determinism that tends to become their limiting factor when aiming at reliability and scalability, particularly in highly dynamic networks, e.g., ad hoc networks. As probabilistic protocols, gossip-based ..."
Abstract
-
Cited by 89 (4 self)
- Add to MetaCart
Traditionally, reliable multicast protocols are deterministic in nature. It is precisely this determinism that tends to become their limiting factor when aiming at reliability and scalability, particularly in highly dynamic networks, e.g., ad hoc networks. As probabilistic protocols, gossip-based multicast protocols, recently (re-)discovered in wired networks, appear to be a viable means to “fight fire with fire ” by exploiting the nondeterministic nature of ad hoc networks. This paper presents a protocol that is designed to meet a more practical specification of probabilistic reliability; this gossipbased multicast protocol, called Route Driven Gossip (RDG), can be deployed on any basic on-demand routing protocol. RDG is custom-tailored to ad hoc networks, achieving a high level of reliability without relying on any inherent multicast primitive. We illustrate our RDG protocol by layering it on top of the “bare” DSR protocol. We prove the reliability and scalability of RDG through both analysis and simulation.
Understanding replication in databases and distributed systems
- In Proceedings of 20th International Conference on Distributed Computing Systems (ICDCS’2000
, 2000
"... Replication is an area of interest to both distributed systems and databases. The solutions developed from these two perspectives are conceptually similar but differ in many aspects: model, assumptions, mechanisms, guarantees provided, and implementation. In this paper, we provide an abstract and “n ..."
Abstract
-
Cited by 81 (7 self)
- Add to MetaCart
Replication is an area of interest to both distributed systems and databases. The solutions developed from these two perspectives are conceptually similar but differ in many aspects: model, assumptions, mechanisms, guarantees provided, and implementation. In this paper, we provide an abstract and “neutral ” framework to compare replication techniques from both communities. The framework has been designed to emphasize the role played by different mechanisms and to facilitate comparisons. The paper describes the replication techniques used in both communities, compares them, and points out ways in which they can be integrated to arrive to better, more robust replication protocols. 1.
Secure and efficient asynchronous broadcast protocols (Extended Abstract)
- Advances in Cryptology: CRYPTO 2001
, 2001
"... Broadcast protocols are a fundamental building block for implementing replication in fault-tolerant distributed systems. This paper addresses secure service replication in an asynchronous environment with a static set of servers, where a malicious adversary may corrupt up to a threshold of servers ..."
Abstract
-
Cited by 59 (19 self)
- Add to MetaCart
Broadcast protocols are a fundamental building block for implementing replication in fault-tolerant distributed systems. This paper addresses secure service replication in an asynchronous environment with a static set of servers, where a malicious adversary may corrupt up to a threshold of servers and controls the network. We develop a formal model using concepts from modern cryptography, give modular definitions for several broadcast problems, including reliable, atomic, and secure causal broadcast, and present protocols implementing them. Reliable broadcast is a basic primitive, also known as the Byzantine generals problem, providing agreement on a delivered message. Atomic broadcast imposes additionally a total order on all delivered messages. We present a randomized atomic broadcast protocol based on a new, efficient multi-valued asynchronous Byzantine agreement primitive with an external validity condition. Apparently, no such efficient asynchronous atomic broadcast protocol maintaining liveness and safety in the Byzantine model has appeared previously in the literature. Secure causal broadcast extends atomic broadcast by encryption to guarantee a causal order among the delivered messages. Our protocols use threshold cryptography for signatures, encryption, and coin-tossing.
Middle-R: Consistent Database Replication at the Middleware Level
- ACM Trans. Comput. Syst
, 2005
"... The widespread use of clusters and web farms has increased the importance of data replication. In this paper, we show how to implement consistent and scalable data replication at the middleware level. We do this by combining transactional concurrency control with group communication primitives. The ..."
Abstract
-
Cited by 59 (7 self)
- Add to MetaCart
The widespread use of clusters and web farms has increased the importance of data replication. In this paper, we show how to implement consistent and scalable data replication at the middleware level. We do this by combining transactional concurrency control with group communication primitives. The paper presents different replication protocols, argues their correctness, describes their implementation as part of a generic middleware tool, and proves their feasibility with an extensive performance evaluation. The solution proposed is well suited for a variety of applications including web farms and distributed object platforms.
From group communication to transactions in distributed systems
- Communications of the ACM
, 1996
"... Because toolkits for developing process groups do not allow applications to issue reliable multicasts to multiple groups, a new development model distinguishing between groups as logical addressing mechanisms and reliable communication primitives is needed to create reliable distributed applications ..."
Abstract
-
Cited by 57 (9 self)
- Add to MetaCart
Because toolkits for developing process groups do not allow applications to issue reliable multicasts to multiple groups, a new development model distinguishing between groups as logical addressing mechanisms and reliable communication primitives is needed to create reliable distributed applications. The design of structuring concepts that facilitate development of reliable and complex applications and implementation of associated mechanisms is today one of the most important research tasks in computer science. In this context, the 1970s saw the emergence of transactional

