Results 1 - 10
of
10
Unreliable Failure Detectors for Reliable Distributed Systems
- Journal of the ACM
, 1996
"... We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with ..."
Abstract
-
Cited by 807 (17 self)
- Add to MetaCart
We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
A Gossip-Style Failure Detection Service
, 1998
"... Failure Detection is valuable for system management, replication, load balancing, and other distributed services. To date, Failure Detection Services scale badly in the number of members that are being monitored. This paper describes a new protocol based on gossiping that does scale well and provide ..."
Abstract
-
Cited by 190 (22 self)
- Add to MetaCart
Failure Detection is valuable for system management, replication, load balancing, and other distributed services. To date, Failure Detection Services scale badly in the number of members that are being monitored. This paper describes a new protocol based on gossiping that does scale well and provides timely detection. We analyze the protocol, and then extend it to discover and leverage the underlying network topology for much improved resource utilization. We then combine it with another protocol, based on broadcast, that is used to handle partition failures.
On the Formal Specification of Group Membership Services
, 1995
"... The problem of group membership has been the focus of much theoretical and experimental work on fault-tolerant distributed systems. This has resulted in a voluminous literature and several formal specifications of this problem have been given. In this paper, we examine the two most referenced formal ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
The problem of group membership has been the focus of much theoretical and experimental work on fault-tolerant distributed systems. This has resulted in a voluminous literature and several formal specifications of this problem have been given. In this paper, we examine the two most referenced formal specifications of group membership and show that they are unsatisfactory: One has flaws in the formalism and allows undesirable executions, and the other can be satisfied by useless protocols. 1 Introduction Group membership is an important component of several experimental or commercial fault-tolerant distributed systems such as the Highly Available System [Cri87], Isis [Bir93], Horus [vRBC + 93], Transis [ADKM92a], Amoeba [KT91], Newtop [EMS95], and Relacs [BDGB94]. Roughly speaking, a group membership protocol manages the formation and maintenance of a set of processes called a group. For example, a group may be a set of processes that are cooperating towards a common task (e.g., th...
Fail-Awareness in Timed Asynchronous Systems
, 2003
"... We address the problem of the impossibility of implementing synchronous fault-tolerant service specifications in asynchronous distributed systems. We introduce a method for weakening a synchronous service specification so that it becomes implementable in "timed" asynchronous systems, that is, asynch ..."
Abstract
-
Cited by 43 (15 self)
- Add to MetaCart
We address the problem of the impossibility of implementing synchronous fault-tolerant service specifications in asynchronous distributed systems. We introduce a method for weakening a synchronous service specification so that it becomes implementable in "timed" asynchronous systems, that is, asynchronous systems in which processes have access to local hardware clocks. The method (1) adds to a service interface an exception indicator so that a client knows at any time if a server is currently providing its standard "synchronous" semantics or some other specified exceptional semantics, (2) the standard behavior provided when the exception indicator does not signal an exception is "similar" to the original synchronous service behavior, and (3) a server has to provide its standard semantics whenever the underlying communication and process services exhibit "synchronous behavior ". To illustrate our method, we show how the specification of a synchronous datagram service and an internal clock synchronization service can be transformed into a fail-aware service specification. Further illustrations of the usefulness of fail-aware services are provided by describing a railway crossing service and a fail-aware weak group membership service.
Filterfresh: Hot Replication of Java RMI Server Objects
- In Proc. of the 4th Conf. on Object-Oriented Technologies and Systems
, 1998
"... This paper presents the design and implementation of a Java package called Filterfresh for building replicated fault-tolerant servers. Maintaining the correctness and integrity of replicated servers is supported by a GroupManager object instantiated with each replica to form a logical group. The Gr ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper presents the design and implementation of a Java package called Filterfresh for building replicated fault-tolerant servers. Maintaining the correctness and integrity of replicated servers is supported by a GroupManager object instantiated with each replica to form a logical group. The Group Managers use a Group Membership algorithm to maintain a consistent group view and a Reliable Multicast mechanism to communicate with other Group Managers. We then demonstrate how Filterfresh can be integrated into the Java RMI facilities. First we use the GroupManager class to construct a faulttolerant RMI registry called FT Registry---a group of replicated RMI registry servers. Second, we describe our implementation of the FT Unicast---a client-side mechanism that tolerates and masks server failures below the stub layer, transparent to the client. We also present initial performance results, and discuss how general purpose RMI servers can be made highly available using the Filterfresh p...
On real-time and non real-time distributed computing
- In 9th Intl. Workshop on Distributed Algorithms (WDAG-9
, 1995
"... Abstract. In this paper, taking an algorithmic viewpoint, we explore the differences existing between the class of non real-time computing problems (R~) versus the class of real-time computing problems (~). We show how a problem in class RN can be transformed into its counterpart in class ~. Claims ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Abstract. In this paper, taking an algorithmic viewpoint, we explore the differences existing between the class of non real-time computing problems (R~) versus the class of real-time computing problems (~). We show how a problem in class RN can be transformed into its counterpart in class ~. Claims of real-time behavior made for solutions to prob-lems in class ~ are examined. Ah example of a distributed computing..... I.... problem arising m class Is studmd, along with its solutmn. It is shown why off-line strategies or scheduling algorithms that are not driven by real-time/timeliness requirements ~ are incorrect for class ~. Finally, a unified approach to conceiving and measuring the efficiency of solutions to problems in classes R ~ and ~ is proposed and illustrated with a few examples. 1
Filterfresh: Hot Replication of Java RMI Server Objects
- In Proceedings of the 4th Conference on Object-Oriented Technologies and Systems (COOTS
, 1998
"... This paper presents the design and implementation of a Java package called Filterfresh for building replicated fault-tolerant servers. Maintaining the correctness and integrity of replicated servers is supported by a GroupManager object instantiated with each replica to form a logical group. The Gro ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper presents the design and implementation of a Java package called Filterfresh for building replicated fault-tolerant servers. Maintaining the correctness and integrity of replicated servers is supported by a GroupManager object instantiated with each replica to form a logical group. The Group Managers use a Group Membership algorithm to maintain a consistent group view and a Reliable Multicast mechanism to communicate with other Group Managers. We then demonstrate how Filterfresh can be integrated into the Java RMI facilities. First we use the GroupManager class to construct a faulttolerant RMI registry called FT Registry|a group of replicated RMI registry servers. Second, we describe our implementation of the FT Unicast|a client-side mechanism that tolerates and masks server failures below the stub layer, transparent to the client. We also present initial performance results, and discuss how general purpose RMI servers can be made highly available using the Filterfresh package. 1
A Gossip-Style Failure Detection
- Service,” Proc. Conf. Middleware
, 1998
"... Failure Detection is valuable for system management, replication, load balancing, and other distributed services. To date, Failure Detection Services scale badly in the number of members that are being monitored. This paper describes a new protocol based on gossiping that does scale well and provide ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Failure Detection is valuable for system management, replication, load balancing, and other distributed services. To date, Failure Detection Services scale badly in the number of members that are being monitored. This paper describes a new protocol based on gossiping that does scale well and provides timely detection. We analyze the protocol, and then extend it to discover and leverage the underlying network topology for much improved resource utilization. We then combine it with another protocol, based on broadcast, that is used to handle partition failures. 1
On the Possibility of Consensus in Asynchronous Systems
- In Proceedings of the 1995 Pacific Rim International Symposium on Fault-Tolerant Systems
, 1995
"... We demonstrate that the leader election and consensus problems are solvable in a timed asynchronous distributed system provided a majority of processes are always eventually able to communicate in a timely manner for a sufficiently long time. Failures and recoveries affecting the other processes and ..."
Abstract
- Add to MetaCart
We demonstrate that the leader election and consensus problems are solvable in a timed asynchronous distributed system provided a majority of processes are always eventually able to communicate in a timely manner for a sufficiently long time. Failures and recoveries affecting the other processes and the communications between them do not prevent consensus. The timed asynchronous system model describes with accuracy existing asynchronous distributed systems such as those based on networks of workstations. We describe two protocols that implement leadership and consensus services and prove their correctness. 1 Introduction Most current distributed systems are asynchronous in the sense that they do not guarantee an upper bound on communication delays and process scheduling delays. The "standard" theoretical model [12] used to describe asynchronous distributed systems is characterized by the following two properties: (1) each non-crashed process executes with a finite, positive speed but...
Client--Access Protocols for Replicated Services
- IEEE Transactions on Software Engineering
, 1999
"... The paper addresses the problem of client--service interaction in the case of replicated service provision. Existing systems that follow the State Machine approach concentrate on the synchronisation of the server replicas and do not consider the problem of client interaction with the server group. C ..."
Abstract
- Add to MetaCart
The paper addresses the problem of client--service interaction in the case of replicated service provision. Existing systems that follow the State Machine approach concentrate on the synchronisation of the server replicas and do not consider the problem of client interaction with the server group. Client interaction is analysed and a number of access protocols are proposed to meet a range of client requirements. The paper demonstrates that protocols for the "open" group model---clients external to the group of servers---satisfy the requirements of the State Machine approach, even when replication is transparent to the clients. Experimental performance results indicate that the "open" model is clearly desirable when the service is used by a large, dynamic set of clients. 1. Introduction With the ever increasing introduction of computing systems in many aspects of today's life, availability of critical computing services becomes of great importance. The State Machine approach [20] is a...

