Results 1 - 10
of
15
Optimistic recovery in distributed systems
- ACM Transactions on Computer Systems
, 1985
"... Optimistic Recovery is a new technique supporting application-independent transparent recovery from processor failures in distributed systems. In optimistic recovery communication, computation and checkpointing proceed asynchronously. Synchronization is replaced by causal dependency trock-ing, which ..."
Abstract
-
Cited by 284 (5 self)
- Add to MetaCart
Optimistic Recovery is a new technique supporting application-independent transparent recovery from processor failures in distributed systems. In optimistic recovery communication, computation and checkpointing proceed asynchronously. Synchronization is replaced by causal dependency trock-ing, which enables a posteriori reconstruction of a consistent distributed system state following a failure using process rollback and message replay. Because there is no synchronization among computation, communication, and checkpointing, optimistic recovery can tolerate the failure of an arbitrary number of processors and yields better throughput and response time than other general recovery techniques whenever failures are infre-quent.
Closure and Convergence: A Foundation of Fault-Tolerant Computing
- IEEE Transactions on Software Engineering
, 1993
"... We give a formal definition of what it means for a system to "tolerate" a class of "faults". The definition consists of two conditions: One, if a fault occurs when the system state is within a set of "legal" states, the resulting state is within some larger set and, if faults continue occurring, the ..."
Abstract
-
Cited by 103 (28 self)
- Add to MetaCart
We give a formal definition of what it means for a system to "tolerate" a class of "faults". The definition consists of two conditions: One, if a fault occurs when the system state is within a set of "legal" states, the resulting state is within some larger set and, if faults continue occurring, the system state remains within that larger set (Closure). And two, if faults stop occurring, the system eventually reaches a state within the legal set (Convergence). We demonstrate the applicability of our definition for specifying and verifying the fault-tolerance properties of a variety of digital and computer systems. Further, using the definition, we obtain a simple classification of fault-tolerant systems and discuss methods for their systematic design. as traditionally been studied in the context of specifi...
The Consensus Problem in Unreliable Distributed Systems (A Brief Survey)
, 2000
"... Agreement problems involve a system of processes, some of which may be faulty. A fundamental problem of fault-tolerant distributed computing is for the reliable processes to reach a consensus. We survey the considerable literature on this problem that has developed over the past few years and giv ..."
Abstract
-
Cited by 102 (2 self)
- Add to MetaCart
Agreement problems involve a system of processes, some of which may be faulty. A fundamental problem of fault-tolerant distributed computing is for the reliable processes to reach a consensus. We survey the considerable literature on this problem that has developed over the past few years and give an informal overview of the major theoretical results in the area.
Programming Simultaneous Actions Using Common Knowledge
- Algorithmica
, 1988
"... This work applies the theory of knowledge in distributed systems to the design of efficient fault-tolerant protocols. We define a large class of problems requiring coordinated, simultaneous action in synchronous systems, and give a method of transforming specifications of such problems into protocol ..."
Abstract
-
Cited by 86 (23 self)
- Add to MetaCart
This work applies the theory of knowledge in distributed systems to the design of efficient fault-tolerant protocols. We define a large class of problems requiring coordinated, simultaneous action in synchronous systems, and give a method of transforming specifications of such problems into protocols that are optimal in all runs: for every possible input to the system and faulty processor behavior, these protocols are guaranteed to perform the simultaneous actions as soon as any other protocol could possibly perform them. This transformation is performed in two steps. In the first step, we extract directly from the problem specification a high-level protocol programmed using explicit tests for common knowledge. In the second step, we carefully analyze when facts become common knowledge, thereby providing a method of efficiently implementing these protocols in many variants of the omissions failure model. In the generalized omissions model, however, our analysis shows that testing for common knowledge is NP-hard. Given the close correspondence between common knowledge and simultaneous actions, we are able to show that no optimal protocol for any such problem can be computationally efficient in this model. The analysis in this paper exposes many subtle differences between the failure models, including the precise point at which this gap in complexity occurs.
Transaction management in the R* distributed database Management System
- ACM Transactions on Database Systems
, 1986
"... This paper deals with the transaction management aspects of the R * distributed database system. It concentrates primarily on the description of the R * commit protocols, Presumed Abort (PA) and Presumed Commit (PC). PA and PC are extensions of the well-known, two-phase (2P) commit protocol. PA is o ..."
Abstract
-
Cited by 73 (0 self)
- Add to MetaCart
This paper deals with the transaction management aspects of the R * distributed database system. It concentrates primarily on the description of the R * commit protocols, Presumed Abort (PA) and Presumed Commit (PC). PA and PC are extensions of the well-known, two-phase (2P) commit protocol. PA is optimized for read-only transactions and a class of multisite update transactions, and PC is optimized for other classes of multisite update transactions. The optimizations result in reduced intersite message traffic and log writes, and, consequently, a better response time. The paper also discusses R*‘s approach toward distributed deadlock detection and resolution.
Recovery management in QuickSilver
- ACM Transactions on Computer Systems
, 1988
"... developed at the IBM Almaden Research Center, which uses atomic tran.sactions as a unified failure recovery mechanism for a client-server structured distributed system. Transactions allow failure atomicity for related activities at a single server or at a number of independent servers. Rather than b ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
developed at the IBM Almaden Research Center, which uses atomic tran.sactions as a unified failure recovery mechanism for a client-server structured distributed system. Transactions allow failure atomicity for related activities at a single server or at a number of independent servers. Rather than bundling transaction management into a dedicated language or recoverable object manager, Quicksilver exposes the basic commit protocol and log recovery primi-tives, allowing clients and servers to tailor their recovery techniques to their specific needs. Servers can implement their own log recovery protocols rather than being required to use a system-defined protocol. These decisions allow servers to make their own choices to balance simplicity, efficiency, and recoverability. Categories and Subject Descriptors: D.4.3 [Operating Systems]: File System Management-distrib-uted file systems; file organization; maintenance; D.4.5 [Operating Systems]: Reliability-FauZt-tolerance; checkpoint/restart; H.2.4 [Database Management]: Systems--distributed systems; trun.s-action processing
Consensus on transaction commit
- ACM Transactions on Database Systems
"... This is a preliminary release of an article accepted by ACM Transactions on Database Systems. The definitive version is currently in production at ACM and, when released, will supersede this version. The definitive version should differ from this report only in formatting. Copyright 2005 by the Asso ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
This is a preliminary release of an article accepted by ACM Transactions on Database Systems. The definitive version is currently in production at ACM and, when released, will supersede this version. The definitive version should differ from this report only in formatting. Copyright 2005 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to Post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., The distributed transaction commit problem requires reaching agreement on whether a transaction is committed or aborted. The classic Two-Phase
Efficient commit protocols for the tree of processes model of distributed transactions
- Proc. 2nd ACM SIGACT/SIGOPS Symposium on Principles of Distributed Computing
, 1983
"... ABSTRACT: This paper describes two efficient distributed transaction commit protocols, the ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
ABSTRACT: This paper describes two efficient distributed transaction commit protocols, the
Non-Blocking Atomic Commitment
- In Sape Mullender, editor, Distributed Systems
, 1993
"... via anonymous FTP from the areaftp.cs.unibo.it:/pub/TR/UBLCS in compressed PostScript format. Abstracts are available from the same host in the directory /pub/TR/ABSTRACTS in plain text format. All local authors can be reached via e-mail at the address last-name@cs.unibo.it. ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
via anonymous FTP from the areaftp.cs.unibo.it:/pub/TR/UBLCS in compressed PostScript format. Abstracts are available from the same host in the directory /pub/TR/ABSTRACTS in plain text format. All local authors can be reached via e-mail at the address last-name@cs.unibo.it.
Abstractions for Constructing Dependable Distributed Systems
, 1992
"... ions for Constructing Dependable Distributed Systems Shivakant Mishra 1 and Richard D. Schlichting TR 92-19 Abstract Distributed systems, in which multiple machines are connected by a communications network, are often used to build highly dependable computing systems. However, constructing the softw ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
ions for Constructing Dependable Distributed Systems Shivakant Mishra 1 and Richard D. Schlichting TR 92-19 Abstract Distributed systems, in which multiple machines are connected by a communications network, are often used to build highly dependable computing systems. However, constructing the software required to realize such dependability is a difficult task since it requires the programmer to build fault-tolerant software that can continue to function despite failures. To simplify this process, canonical structuring techniques or programming paradigms have been developed, including the object/action model, the primary/backup approach, the state machine approach, and conversations. In this paper, some of the system abstractions designed to support these paradigms are described. These abstractions, which are termed fault-tolerant services, can be categorized into two types. One type provides functionality similar to standard hardware or operating system services, but with improved ...

