Results 1 - 10
of
21
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
, 1996
"... this paper, we use the terms event logging and message logging interchangeably ..."
Abstract
-
Cited by 474 (24 self)
- Add to MetaCart
this paper, we use the terms event logging and message logging interchangeably
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations (Extended Abstract)
"... This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, faulttolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance ..."
Abstract
-
Cited by 84 (6 self)
- Add to MetaCart
This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, faulttolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance.
Lazy Checkpoint Coordination for Bounding Rollback Propagation
- in Proc. IEEE Symp. Reliable Distributed Syst
, 1993
"... In this paper, we propose the technique of lazy checkpoint coordination which preserves process autonomy while employing communication-induced checkpoint coordination for bounding rollback propagation. The notion of laziness is introduced to control the coordination frequency and allow a flexible tr ..."
Abstract
-
Cited by 54 (7 self)
- Add to MetaCart
In this paper, we propose the technique of lazy checkpoint coordination which preserves process autonomy while employing communication-induced checkpoint coordination for bounding rollback propagation. The notion of laziness is introduced to control the coordination frequency and allow a flexible trade-off between the cost of checkpoint coordination and the average rollback distance. Worst-case overhead analysis provides a means for estimating the extra checkpoint overhead. Communication trace-driven simulation for several parallel programs is used to evaluate the benefits of the proposed scheme. 1 Introduction Uncoordinated checkpointing [1--3] for parallel and distributed systems allows maximum process autonomy and independent design of recovery capability for each process. However, in a general nondeterministic execution, cascading rollback propagation may result in the domino effect [4] which can prevent progression of the recovery line. It has been shown that message reordering [...
On the Use and Implementation of Message Logging
- In 24th International Symposium on Fault-Tolerant Computing
, 1994
"... Message logging has long been advocated as offering better failure-free performance than coordinated checkpointing. On the contrary, we present a number of experiments showing that for than coordinated checkpointing. Message logging protocolscompute-intensive applications executing in parallel on cl ..."
Abstract
-
Cited by 51 (2 self)
- Add to MetaCart
Message logging has long been advocated as offering better failure-free performance than coordinated checkpointing. On the contrary, we present a number of experiments showing that for than coordinated checkpointing. Message logging protocolscompute-intensive applications executing in parallel on clusters of workstations, message logging has higher failure-free overhead , however, resuls in much shorter output latency than coordinated check pointing. Therefore, message logging should be used for applications involving substantial interactions with the outside world, while coordinated checkpointing should be used otherwise. We also present an unorthodox message logging design that uses coordinated checkpointing with message logging, departing from the conventional approaches that use independent checkpointing. This combination of message logging and coordinated checkpointing offers several advantages, including improved failure-free performance, bounded recovery time, simplified garbage collection and reduced complexity. Meanwhile the new protcolos, retain the advantages of the conventional message logging protocols with respect to output commit. Finally, we discuss three "lessons learned" from an implementation of various message logging protocols. First, during output commit, only the dependency information for the message in the log needs to be written to the stable storage. It is not necessary to write the message data to stable storage, leading to faster output commit. Second, the use of copy-on-write in the implementation of message logging substantially reduces the logging overhead for communication-intensive programs. Finally, we provide quantitative evidence supporting previous qualitative claims about the superiority of sender-based message logging over receiver-based logging.
Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems
- In Proceedings of the 11th Symposium on Reliable Distributed Systems
, 1992
"... Message-passing systems with communication protocol transparent to the applications typically require message logging to ensure consistency between checkpoints. This paper describes a periodic independent checkpointing scheme with optimistic logging to reduce performance degradation during normal ex ..."
Abstract
-
Cited by 49 (8 self)
- Add to MetaCart
Message-passing systems with communication protocol transparent to the applications typically require message logging to ensure consistency between checkpoints. This paper describes a periodic independent checkpointing scheme with optimistic logging to reduce performance degradation during normal execution while keeping the recovery cost acceptable. Both time and space overhead for message logging can be reduced by detecting messages that need not be logged. A checkpoint space reclamation algorithm is presented to reclaim all checkpoints which are not useful for any possible future recovery. Communication trace-driven simulation for several hypercube programs is used to evaluate the techniques. 1 Introduction Numerous approaches to checkpointing and rollback recovery have been proposed in the literature for parallel systems. In terms of checkpointing techniques, they can be classified into two basic categories. Coordinated checkpointing schemes synchronize computation with checkpoint...
A Low-Overhead Recovery Technique Using Quasi-Synchronous Checkpointing
- Proc. IEEE Int. Conference on Distributed Computing Systems
, 1996
"... In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progressi ..."
Abstract
-
Cited by 42 (2 self)
- Add to MetaCart
In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has the easeness and low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing. There is no extra message overhead involved during checkpointing and the additional checkpointing overhead is nominal. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. The recovery algorithm exploits this feature to restore the system to a state consistent with the latest checkpoint of a failed process. The recovery algorithm has no domino effect and a failed process needs only to rollback to its latest ch...
A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability
- Proc. IEEE Int. Symposium on Fault Tolerant Computing
, 1997
"... Considering an application in which processes take local checkpoints independently (called basic checkpoints), this paper develops a protocol that forces them to take some additional local checkpoints (called forced checkpoints) in order that the resulting checkpoint and communication pattern satisf ..."
Abstract
-
Cited by 35 (8 self)
- Add to MetaCart
Considering an application in which processes take local checkpoints independently (called basic checkpoints), this paper develops a protocol that forces them to take some additional local checkpoints (called forced checkpoints) in order that the resulting checkpoint and communication pattern satisfies the Rollback Dependency Trackability (RDT) property. This property states that all dependencies between local checkpoints are on-line trackable by using a transitive dependency vector. Compared to other protocols ensuring the RDT property, the proposed protocol is less conservative in the sense that it takes less additional local checkpoints. It attains this goal by a subtle tracking of causal dependencies on already taken checkpoints; this tracking is then used to prevent the occurrence of hidden dependencies. As indicated by simulation study, the proposed protocol compares favorably with other protocols; moreover, it additionally associates on-the-fly with each local checkpoint C the m...
RENEW: A tool for fast and efficient implementation of checkpoint protocols
- In Proceedings of the 28th IEEE Fault-Tolerant Computing Symposium (FTCS
, 1998
"... This paper describes the design, implementation, and evaluation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. To achieve this goal, RENEW provides a flexible set of operations that facilitates the integration of a pr ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
This paper describes the design, implementation, and evaluation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. To achieve this goal, RENEW provides a flexible set of operations that facilitates the integration of a protocol in the system with reduced programming effort. To support a broad range of applications, RE-NEW exports, as its external interface, the industry endorsed Message Passing Interface (MPI). Three distinct classes of protocols were evaluated using the RENEW environment with SPEC and NAS benchmarks on a network of workstations connected by ATM. It was observed that the communication-induced protocol emulated the behavior of the coordinated protocol, with comparable performance. The message logging protocol degraded the performance. Even though the message logging protocol was slower due to log replay, all three protocols required a similar amount of time to restore the applicationto the same state as before failure occurred and recovery was initiated. 1
Using Message Semantics to Reduce Rollback in Optimistic Message Logging Recovery Schemes
, 1994
"... Recovery from failures can be achieved through asynchronous checkpointing and optimistic message logging. These schemes have low overheads during failure-free operations. Central to these protocols is the determination of a maximal consistent global state, which is recoverable. Message semantics is ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Recovery from failures can be achieved through asynchronous checkpointing and optimistic message logging. These schemes have low overheads during failure-free operations. Central to these protocols is the determination of a maximal consistent global state, which is recoverable. Message semantics is not exploited in most existing recovery protocols to determine the recoverable state. We propose to identify messages that are not influential in a computation through message semantics. These messages can be logically removed from the computation without changing its meaning or result. We show that considering these messages in the recoverable state computation gives rise to recoverable states that dominate the recoverable state defined under conventional model. We then develop an algorithm for identifying these messages. This technique can also be applied to ensure a more timely commitment for output in a distributed computation. Keywords: message semantics, commutativity, recovery, optim...
PREACHES - Portable Recovery and Checkpointing in Heterogeneous Systems
- Proceedings of IEEE Fault-Tolerant Computing Symposium
, 1998
"... Checkpointing in a homogeneous environment, where both checkpointing and recovery are performed on the same type of machine and operating system, has been studied extensively. As heterogeneous distributed systems become pervasive, it is desirable to extend the capability of checkpointing to non-homo ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Checkpointing in a homogeneous environment, where both checkpointing and recovery are performed on the same type of machine and operating system, has been studied extensively. As heterogeneous distributed systems become pervasive, it is desirable to extend the capability of checkpointing to non-homogeneous environments. This paper describes a prototype, PREACHES, that achieves portable checkpointing of single process applications in heterogeneous systems using checkpoint propagation. The checkpoint propagation technique generates machine-dependent checkpoints for each different architecture in the heterogeneous environment. When failure occurs, the failed process can be restarted on a specified machine with the checkpoint that is appropriate for the architecture. An implementation of PREACHES on a heterogeneous network of workstations has been successfully developed based on TCP/IP communication. PREACHES also provides automatic and fast recovery for single process programs. 1 Introdu...

