Results 1 - 10
of
15
Unreliable Failure Detectors for Reliable Distributed Systems
- Journal of the ACM
, 1996
"... We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with ..."
Abstract
-
Cited by 807 (17 self)
- Add to MetaCart
We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
Building Secure and Reliable Network Applications
, 1996
"... ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to th ..."
Abstract
-
Cited by 209 (16 self)
- Add to MetaCart
ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to the invoker, and exceptions are raised if (and only if) an error occurs. Given a completely reliable communication environment, which never loses, duplicates, or reorders messages, and given client and server processes that never fail, RPC would be trivial to solve. The sender would merely package the invocation into one or more messages, and transmit these to the server. The server would unpack the data into local variables, perform the desired operation, and send back the result (or an indication of any exception that occurred) in a reply message. The challenge, then, is created by failures. Were it not for the possibility of process and machine crashes, an RPC protocol capable of overcomi...
Uniform Reliable Multicast in a Virtually Synchronous Environment
- In IEEE 13th Intl. Conf. Distributed Computing Systems
, 1993
"... This paper presents the definition and solution to the uniform reliable multicast problem in the virtually synchronous environment defined by the Isis system. A uniform reliable multicast of a message m has the property that if m has been received by any destination process (faulty or not), then m ..."
Abstract
-
Cited by 65 (19 self)
- Add to MetaCart
This paper presents the definition and solution to the uniform reliable multicast problem in the virtually synchronous environment defined by the Isis system. A uniform reliable multicast of a message m has the property that if m has been received by any destination process (faulty or not), then m is received by all processes that reach a decision. Uniform reliable multicast provides a solution to the distributed commit problem. The paper defines two multicast primitives in the virtually synchronous model: reliable multicast (called view-atomic) and uniform reliable multicast (called uniform view-atomic). The view-atomic multicast is used to implement the uniform view-atomic primitive. As view-atomicity is based on the concept of process group membership, the paper establishes a connection between the process group membership and the distributed commit problems. 1 Introduction A distributed application is composed of processes communicating through message passing. Point to point is t...
Time-Optimal Message-Efficient Work Performance in the Presence of Faults
, 1994
"... Performing work in parallel by a multitude of processes in a distributed environment is currently a fast growing area of computer applications (due to its cost effectiveness). Adaptation of such applications to changes in system's parallelism (i.e., the availability of processes) is essential for im ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
Performing work in parallel by a multitude of processes in a distributed environment is currently a fast growing area of computer applications (due to its cost effectiveness). Adaptation of such applications to changes in system's parallelism (i.e., the availability of processes) is essential for improved performance and reliability. In this work we consider one aspect of coping with dynamic processes failures in such a setting, namely the following scenario formulated by Dwork, Halpern and Waarts [DHW92]: a system of n synchronous processes that communicate only by sending messages to one another. These processes must perform m independent units of work. Processes may fail by crashing and wait-freeness is required, i.e. that whenever at least one process survives, all m units of work will be performed. We consider the notion of fast algorithms in this setting, yet we are not willing to trade improved time for a high cost in communication. Thus, we require message efficiency as well. ...
Resolving Message Complexity of Byzantine Agreement and Beyond
- in Proc. 36th IEEE Symposium on Foundations of Computer Science
, 1995
"... Byzantine Agreement among processors is a basic primitive in distributed computing. It comes in a number of basic fault models: "Crash", "Omission " and "Malicious" adversarial behaviors. The message complexity of the primitive has been known for the strong failure models of Malicious and Omission a ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
Byzantine Agreement among processors is a basic primitive in distributed computing. It comes in a number of basic fault models: "Crash", "Omission " and "Malicious" adversarial behaviors. The message complexity of the primitive has been known for the strong failure models of Malicious and Omission adversary since the early 80's, while the question for the more benign Crash failure model has been open. In this paper we show how to solve agreement in the presence of crash failures using O(n) messages which is optimal, thus settling a thirteen year old open problem. Our solution has almost linear time and our new algorithmic techniques have further implications: ffl A family of "early stopping" agreement protocols with improved message-complexity. ffl A new solution to "Checkpoint" yielding a substantial improvement of the protocol for distributed work performance under adaptive parallelism in a network of workstations. Columbia University and Tel-Aviv University. galil@cs.columbia.edu...
Non-Blocking Atomic Commitment
- In Sape Mullender, editor, Distributed Systems
, 1993
"... via anonymous FTP from the areaftp.cs.unibo.it:/pub/TR/UBLCS in compressed PostScript format. Abstracts are available from the same host in the directory /pub/TR/ABSTRACTS in plain text format. All local authors can be reached via e-mail at the address last-name@cs.unibo.it. ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
via anonymous FTP from the areaftp.cs.unibo.it:/pub/TR/UBLCS in compressed PostScript format. Abstracts are available from the same host in the directory /pub/TR/ABSTRACTS in plain text format. All local authors can be reached via e-mail at the address last-name@cs.unibo.it.
Understanding the power of the virtually-synchronous model (Extended Abstract)
, 1993
"... this paper is to define a clear semantics of the virtually-synchronous model, and to show that distributed commit can be solved in the model. This is in a sense not surprising, as it has been shown that distributed consensus can be solved in the asynchronous model with a very weak failure detector [ ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
this paper is to define a clear semantics of the virtually-synchronous model, and to show that distributed commit can be solved in the model. This is in a sense not surprising, as it has been shown that distributed consensus can be solved in the asynchronous model with a very weak failure detector [6]. Considering this result, the virtually-synchronous model becomes extremely powerful, and more basic than the transaction model, providing an interesting broader picture of the problem of building faulttolerant applications. Section 2 briefly introduces the notion of failure detector and presents the virtually-synchronous model, both informally with respect to the Isis model, and more formally concerning the definition of reliable multicasts. Section 3 then considers the problem of distributed commit in the virtually-synchronous model. 2 Virtually-synchronous model The virtually-synchronous model (vs-model) considers processes with fail-stop failure semantics and incorporates a failure detector FD. FD sends information on crashed and recovered processes in the form of views, which are sets of processes that are considered as alive by the failure detector. The failure detector is allowed to be imperfect, i.e to make incorrect failure detections. Building a failure detector is done by a so called GMP (Group Membership Problem) protocol. [10] describes an implementation, considering fail-stop processes and network partitions. The FD can be seen as sending views to processes, and we explicitly consider the important distinction between reception and delivery of a view by a process, which is common for protocols ordering communication-related events in a distributed system [12]. Considering the failure detector, the virtuallysynchronous model (vs-model) can be defined informa...
Unreliable Failure Detectors For Asynchronous Distributed Systems
- in the Proceedings of the 10 th Annual ACM Symposium on Principles of Distributed Computing
, 1993
"... equivalent in asynchronous systems. Thus all our results regarding the solvability of Consensus using failure detectors, apply to Atomic Broadcast as well. The work in this thesis was funded by an IBM graduate fellowship and grants from NSF, DARPA/NASA, the IBM Endicott Programming Laboratory, Siem ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
equivalent in asynchronous systems. Thus all our results regarding the solvability of Consensus using failure detectors, apply to Atomic Broadcast as well. The work in this thesis was funded by an IBM graduate fellowship and grants from NSF, DARPA/NASA, the IBM Endicott Programming Laboratory, Siemens Corp and the Natural Siences and Engineering Research Council of Canada. Biographical Sketch Tushar Deepak Chandra was born in New Delhi, India on November 13, 1966. He spent his childhood in various cities in India: Bombay, Calcutta and finally Kanpur. After completing high school at the Doon school, he went on to do a Bachelor of Technology in Computer Science at the Indian Institute of Technology at Kanpur. He joined the graduate program in Computer Science at Cornell University in August 1988. iii This thesis is dedicated to my parents who taught me how to think. iv Acknowledgements A large number of people contributed either directly or i
A Lightweight Solution to Uniform Atomic Broadcast for Asynchronous Systems: Proofs
, 1996
"... Chandra and Toueg proposed in [CT93] a new approach to overcome the impossibility of reaching deterministically Consensus -- and by corollary Atomic Broadcast -- in asynchronous systems subject to crash failures. They augment the asynchronous system with a possibly Unreliable Failure Detector which ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Chandra and Toueg proposed in [CT93] a new approach to overcome the impossibility of reaching deterministically Consensus -- and by corollary Atomic Broadcast -- in asynchronous systems subject to crash failures. They augment the asynchronous system with a possibly Unreliable Failure Detector which provides some information about the operational state of processes. In this report, we present an extension of the Consensus problem that we call Uniform Prefix Agreement. This extension enables all the processes to propose a flow of messages during an execution -- instead of one as in the Consensus problem -- and uses all these proposed messages to compose its decision value. Prefix Agreement is based on an Unreliable Failure Detectors. We use repeated executions of Prefix Agreement to build an efficient and lightweight Uniform Atomic Broadcast algorithm. This report describes the Uniform Prefix Agreement and Uniform Atomic Broadcast algorithms, and provides proofs of their correctnes...
An Architecture for Dynamic Scalable Self-Managed Distributed Transactions
- IN PROCEEDINGS OF DOA 2004
, 2004
"... This paper presents a middleware architecture and a generic orchestrating protocol for implementing distributed atomic transactions for large scale dynamic systems in a self-managing manner. In particular, the proposed solution is fully distributed, allows dynamic changes in the environment, and nod ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
This paper presents a middleware architecture and a generic orchestrating protocol for implementing distributed atomic transactions for large scale dynamic systems in a self-managing manner. In particular, the proposed solution is fully distributed, allows dynamic changes in the environment, and nodes are neither assumed to be aware of the size of the system nor of its entire composition. The architecture includes two modules and three services. The modules are expected to be instantiated and executed among relatively small sets of nodes in the context of a single transaction and, therefore, can be implemented using known classical distributed computing approaches. On the other hand, services are long lived abstractions that may involve all nodes and should be implemented using known peer-to -peer techniques. The proposed architecture is also interesting in the sense that it brings together several seemingly distinct research areas, including distributed consensus, group membership, notification services (publish/subscribe), scalable conflict detection (or locking), and scalable persistent storage. The paper

