Results 1 -
4 of
4
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations (Extended Abstract)
"... This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, faulttolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance ..."
Abstract
-
Cited by 84 (6 self)
- Add to MetaCart
This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, faulttolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance.
Preventing useless checkpoints in distributed computations
- IN PROCEEDINGS OF THE IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS
, 1997
"... A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following important problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design a communication-induced che ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following important problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design a communication-induced checkpointing protocol that directs processes to take additional local (forced) checkpoints to ensure that no local checkpoint is useless. A general and efficient protocol answering this problem is proposed. It is shown that several existing protocols that solve the same problem are particular instances of it. The design of this general protocol is motivated by the use of communication-induced checkpointing protocols in “consistent global checkpoint”-based distributed applications. Detection of stable or unstable properties, rollback-recovery, and determination of distributed breakpoints are examples of such applications.
On the Effectiveness of Distributed Checkpoint Algorithms for Domino-free Recovery
, 1998
"... The paper focuses on fault-tolerant distributed computations where processes can take local checkpoints without coordinating with each other. Several distributed on-line algorithms are presented which avoid roll-back propagation by forcing additional local checkpoints in processes. The effectiveness ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The paper focuses on fault-tolerant distributed computations where processes can take local checkpoints without coordinating with each other. Several distributed on-line algorithms are presented which avoid roll-back propagation by forcing additional local checkpoints in processes. The effectiveness of the algorithms is evaluated in several application examples, showing their limited capability of bounding the number of additional checkpoints. 1. Introduction In 1 a distributed computation composed of several communicating processes, the capability of recovering from a fault can be achieved by making processes periodically checkpoint and save their computational state to a stable storage [17, 7]. Then, in the case of a fault (either due to a node crash or to a failure in the communication network), the distributed execution can be restored by restarting the execution of each process from one of its local checkpoints, to form a so called global checkpoint [2]. The re-execution of a f...
S. Neogy 1
"... The processes of the distributed system considered in this paper use loosely synchronized clocks. The paper describes a method of taking checkpoints by such processes in a truly distributed manner, that is, in the absence of a global checkpoint coordinator. The constituent processes take checkpoints ..."
Abstract
- Add to MetaCart
The processes of the distributed system considered in this paper use loosely synchronized clocks. The paper describes a method of taking checkpoints by such processes in a truly distributed manner, that is, in the absence of a global checkpoint coordinator. The constituent processes take checkpoints according to their own clocks at predetermined checkpoint instants. A global consistent set of such asynchronous checkpoints needs to be formed to avoid the domino effect. This is achieved by adding suitable information to the existing clock synchronization messages looking at which the processes synchronize their checkpoints to form a global consistent checkpoint. Communication in this system is synchronous, so, processes may be blocked for communication at checkpointing instants. The blocked processes save the state they were in just before being blocked. It is shown here that the set of such i-th checkpoints is consistent and hence the rollback required by the system in case of failure is only up to the last saved state.

