Results 1 - 10
of
232
Understanding Fault-Tolerant Distributed Systems
- Communications of the ACM
, 1993
"... We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design ..."
Abstract
-
Cited by 296 (23 self)
- Add to MetaCart
We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give examples of systems which adopt one approach or the other. The aim is to introduce some order in the complex discipline of designing and understanding fault-tolerant distributed systems. 1 Introduction Computing systems consist of a multitude of hardware and software components that are bound to fail eventually. In many systems, such component failures can lead to unanticipated, potentially disruptive failure behavior and to service unavailability. Some systems are designed to be fault-tolerant: they either exhibit a well-defined failure behavior when components fail or mask component failures to users, that is, continue to provid...
Building Secure and Reliable Network Applications
, 1996
"... ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to th ..."
Abstract
-
Cited by 209 (16 self)
- Add to MetaCart
ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to the invoker, and exceptions are raised if (and only if) an error occurs. Given a completely reliable communication environment, which never loses, duplicates, or reorders messages, and given client and server processes that never fail, RPC would be trivial to solve. The sender would merely package the invocation into one or more messages, and transmit these to the server. The server would unpack the data into local variables, perform the desired operation, and send back the result (or an indication of any exception that occurred) in a reply message. The challenge, then, is created by failures. Were it not for the possibility of process and machine crashes, an RPC protocol capable of overcomi...
Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing
, 1988
"... In a distributed system using message logging and checkpointing to provide fault tolerance, there is always a unique maximum recoverable system state, regardless of the message logging protocol used. The proof of this relies on the observation that the set of system states that have occurred during ..."
Abstract
-
Cited by 199 (13 self)
- Add to MetaCart
In a distributed system using message logging and checkpointing to provide fault tolerance, there is always a unique maximum recoverable system state, regardless of the message logging protocol used. The proof of this relies on the observation that the set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. The maximum recoverable system state never decreases, and if all messages are eventually logged, the domino e ect cannot occur. This paper presents a general model for reasoning about recovery in such a system and, based on this model, an efficient algorithm for determining the maximum recoverable system state at any time. This work uni es existing approaches to fault tolerance based on message logging and checkpointing, and improves on existing methods for optimistic recovery in distributed systems.
Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit
- IEEE TRANSACTIONS ON COMPUTERS
, 1992
"... Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message loggin ..."
Abstract
-
Cited by 181 (10 self)
- Add to MetaCart
Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message logging, namely limited rollback and fast output commit, and the advantage of optimistic message logging, namely low failure-free overhead. These advantages come at the expense of a complex recovery scheme.
The Performance of Consistent Checkpointing
- In Proceedings of the 11th Symposium on Reliable Distributed Systems
, 1992
"... Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eigh ..."
Abstract
-
Cited by 181 (9 self)
- Add to MetaCart
Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eight compute-intensive distributed applications on a network of 16 diskless Sun-3/60 workstations, comparing the performance without checkpointing to the performance with consistent checkpoints taken at 2-minute intervals. For six of the eight applications, the running time increased by less than 1% as a result of the checkpointing. The highest overhead measured for any of the applications was 5.8%. Incremental checkpointing and copy-on-write checkpointing were the most effective techniques in lowering the running time overhead. These techniques reduce the amount of data written to stable storage and allow the checkpoint to proceed concurrently with the execution of the processes. The overhead ...
Efficient algorithms for distributed snapshots and global virtual time approximation
- Journal of Parallel and Distributed Computing
, 1993
"... Abstract. This paper presents snapshot algorithms for determining a consistent global state of a distributed system without significantly affecting the underlying computation. These algorithms do not require channels to be FIFO or messages to be acknowledged. Only a small amount of storage is needed ..."
Abstract
-
Cited by 117 (1 self)
- Add to MetaCart
Abstract. This paper presents snapshot algorithms for determining a consistent global state of a distributed system without significantly affecting the underlying computation. These algorithms do not require channels to be FIFO or messages to be acknowledged. Only a small amount of storage is needed. An important application of a snapshot algorithm is Global Virtual Time determination for distributed simulations. The paper proposes new and efficient Global Virtual Time approximation schemes based on snapshot algorithms and distributed termination detection principles. 1
Overview of Multidatabase Transaction Management
- VLDB JOURNAL
, 1992
"... A multidatabase system (MDBS) is a facility that allows users access to data located in multiple autonomous database management systems (DBMSs). In such a system, global transactions are executed under the control of the MDBS. Independently, local transactions are executed under the control of the l ..."
Abstract
-
Cited by 80 (13 self)
- Add to MetaCart
A multidatabase system (MDBS) is a facility that allows users access to data located in multiple autonomous database management systems (DBMSs). In such a system, global transactions are executed under the control of the MDBS. Independently, local transactions are executed under the control of the local DBMSs. Each local DBMS integrated by the MDBS may employ a different transaction management scheme. In addition, each local DBMS has complete control over all transactions (global and local) executing at its site, including the ability to abort at any point any of the transactions executing at its site. Typically, no design or internal DBMS structure changes are allowed in order to accommodate the MDBS. Furthermore, the local DBMSs may not be aware of each other, and, as a consequence, cannot coordinate their actions. Thus, traditional techniques for ensuring transaction atomicity and consistency in homogeneous distributed database systems may not be appropriate for an MDBS environment....
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
- in Proceedings, LACSI Symposium, Sante Fe
, 2003
"... As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback ..."
Abstract
-
Cited by 67 (7 self)
- Add to MetaCart
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernellevel process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI. 1
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1994
"... A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available communication bandwidth. To minimize the lost computation during ..."
Abstract
-
Cited by 63 (8 self)
- Add to MetaCart
A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available communication bandwidth. To minimize the lost computation during recovery from node failures, periodic collection of a consistent snapshot of the system (checkpoint) is required. Locating the mobile nodes contributes to the checkpointing and recovery costs. Synchronous snapshot collection algorithms, designed for static networks, either force every node in the system to take a new local snapshot, or block the underlying computation during snapshot collection. Hence, they are not suitable for mobile computing systems. This paper presents a synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection. If a node initiates snapsho...

