Results 1  10
of
246
Unreliable Failure Detectors for Reliable Distributed Systems
 Journal of the ACM
, 1996
"... We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with ..."
Abstract

Cited by 977 (19 self)
 Add to MetaCart
(Show Context)
We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
WaitFree Synchronization
 ACM Transactions on Programming Languages and Systems
, 1993
"... A waitfree implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes. The problem of constructing a waitfree implementation of one data object from another lie ..."
Abstract

Cited by 763 (27 self)
 Add to MetaCart
(Show Context)
A waitfree implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes. The problem of constructing a waitfree implementation of one data object from another lies at the heart of much recent work in concurrent algorithms, concurrent data structures, and multiprocessor architectures. In the first part of this paper, we introduce a simple and general technique, based on reduction to a consensus protocol, for proving statements of the form "there is no waitfree implementation of X by Y ." We derive a hierarchy of objects such that no object at one level has a waitfree implementation in terms of objects at lower levels. In particular, we show that atomic read/write registers, which have been the focus of much recent attention, are at the bottom of the hierarchy: they cannot be used to construct waitfree implementations of many simple and familiar da...
The Weakest Failure Detector for Solving Consensus
, 1996
"... We determine what information about failures is necessary and sufficient to solve Consensus in asynchronous distributed systems subject to crash failures. In [CT91], it is shown that 3W, a failure detector that provides surprisingly little information about which processes have crashed, is sufficien ..."
Abstract

Cited by 435 (21 self)
 Add to MetaCart
(Show Context)
We determine what information about failures is necessary and sufficient to solve Consensus in asynchronous distributed systems subject to crash failures. In [CT91], it is shown that 3W, a failure detector that provides surprisingly little information about which processes have crashed, is sufficient to solve Consensus in asynchronous systems with a majority of correct processes. In this paper, we prove that to solve Consensus, any failure detector has to provide at least as much information as 3W. Thus, 3W is indeed the weakest failure detector for solving Consensus in asynchronous systems with a majority of correct processes.
A Methodology for Implementing Highly Concurrent Data Objects
, 1993
"... A concurrent object is a data structure shared by concurrent processes. Conventional techniques for implementing concurrent objects typically rely on critical sections: ensuring that only one process at a time can operate on the object. Nevertheless, critical sections are poorly suited for asynchro ..."
Abstract

Cited by 333 (11 self)
 Add to MetaCart
(Show Context)
A concurrent object is a data structure shared by concurrent processes. Conventional techniques for implementing concurrent objects typically rely on critical sections: ensuring that only one process at a time can operate on the object. Nevertheless, critical sections are poorly suited for asynchronous systems: if one process is halted or delayed in a critical section, other, nonfaulty processes will be unable to progress. By contrast, a concurrent object implementation is lock free if it always guarantees that some process will complete an operation in a finite number of steps, and it is wait free if it guarantees that each process will complete an operation in a finite number of steps. This paper proposes a new methodology for constructing lockfree and waitfree implementations of concurrent objects. The object’s representation and operations are written as stylized sequential programs, with no explicit synchronization. Each sequential operation is automatically transformed into a lockfree or waitfree operation using novel synchronization and memory management algorithms. These algorithms are presented for a multiple instruction/multiple data (MIMD) architecture in which n processes communicate by applying atomic read, wrzte, load_linked, and store_conditional operations to a shared memory.
Fast Randomized Consensus using Shared Memory
 Journal of Algorithms
, 1988
"... We give a new randomized algorithm for achieving consensus among asynchronous processes that communicate by reading and writing shared registers. The fastest previously known algorithm has exponential expected running time. Our algorithm is polynomial, requiring an expected O(n 4 ) operations ..."
Abstract

Cited by 133 (32 self)
 Add to MetaCart
(Show Context)
We give a new randomized algorithm for achieving consensus among asynchronous processes that communicate by reading and writing shared registers. The fastest previously known algorithm has exponential expected running time. Our algorithm is polynomial, requiring an expected O(n 4 ) operations. Applications of this algorithm include the elimination of critical sections from concurrent data structures and the construction of asymptotically unbiased shared coins.
The Topological Structure of Asynchronous Computability
 JOURNAL OF THE ACM
, 1996
"... We give necessary and sufficient combinatorial conditions characterizing the tasks that can be solved by asynchronous processes, of which all but one can fail, that communicate by reading and writing a shared memory. We introduce a new formalism for tasks, based on notions from classical algebra ..."
Abstract

Cited by 121 (11 self)
 Add to MetaCart
We give necessary and sufficient combinatorial conditions characterizing the tasks that can be solved by asynchronous processes, of which all but one can fail, that communicate by reading and writing a shared memory. We introduce a new formalism for tasks, based on notions from classical algebraic and combinatorial topology, in which a task's possible input and output values are each associated with highdimensional geometric structures called simplicial complexes. We characterize computability in terms of the topological properties of these complexes. This characterization has a surprising geometric interpretation: a task is solvable if and only if the complex representing the task's allowable inputs can be mapped to the complex representing the task's allowable outputs by a function satisfying certain simple regularity properties. Our formalism thus replaces the "operational" notion of a waitfree decision task, expressed in terms of interleaved computations unfolding ...
The Consensus Problem in Unreliable Distributed Systems (A Brief Survey)
, 2000
"... Agreement problems involve a system of processes, some of which may be faulty. A fundamental problem of faulttolerant distributed computing is for the reliable processes to reach a consensus. We survey the considerable literature on this problem that has developed over the past few years and giv ..."
Abstract

Cited by 115 (3 self)
 Add to MetaCart
(Show Context)
Agreement problems involve a system of processes, some of which may be faulty. A fundamental problem of faulttolerant distributed computing is for the reliable processes to reach a consensus. We survey the considerable literature on this problem that has developed over the past few years and give an informal overview of the major theoretical results in the area.
Reaching approximate agreement in the presence of faults
 Journal of the ACM
, 1986
"... Abstract. This paper considers a variant of the Byzantine Generals problem, in which processes start with arbitrary real values rather than Boolean values or values from some bounded range, and in which approximate, rather than exact, agreement is the desired goal. Algorithms are presented to reach ..."
Abstract

Cited by 107 (10 self)
 Add to MetaCart
(Show Context)
Abstract. This paper considers a variant of the Byzantine Generals problem, in which processes start with arbitrary real values rather than Boolean values or values from some bounded range, and in which approximate, rather than exact, agreement is the desired goal. Algorithms are presented to reach approximate agreement in asynchronous, as well as synchronous systems. The asynchronous agreement algorithm is an interesting contrast to a result of Fischer et al, who show that exact agreement with guaranteed termination is not attainable in an asynchronous system with as few as one faulty process. The algorithms work by successive approximation, with a provable convergence rate that depends on the ratio between the number of faulty processes and the total number of processes. Lower bounds on the convergence rate for algorithms of this form are proved, and the algorithms presented are shown to
Fundamentals of FaultTolerant Distributed Computing in Asynchronous Environments
 ACM Computing Surveys
, 1999
"... Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like f ..."
Abstract

Cited by 82 (9 self)
 Add to MetaCart
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the closetoreality asynchronous messagepassing model of distributed computing.