Results 1 - 10
of
16
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
- in Proceedings, LACSI Symposium, Sante Fe
, 2003
"... As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback ..."
Abstract
-
Cited by 67 (7 self)
- Add to MetaCart
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernellevel process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI. 1
Maximum and Minimum Consistent Global Checkpoints and Their Applications
- in Proc. IEEE Symp. Reliable Distributed Syst
, 1995
"... This paper considers the problem of constructing the maximum and the minimum consistent global checkpoints that contain a target set of checkpoints, and identify it as a generic issue in recovery-related applications. We formulate the problem as a reachability analysis problem on a directed rollback ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
This paper considers the problem of constructing the maximum and the minimum consistent global checkpoints that contain a target set of checkpoints, and identify it as a generic issue in recovery-related applications. We formulate the problem as a reachability analysis problem on a directed rollback-dependency graph, and develop efficient algorithms to calculate the two consistent global checkpoints for both general nondeterministic executions and piecewise deterministic executions. We also demonstrate that the approach provides a generalization and unifying framework for many existing and potential applications including software error recovery, mobile computing recovery, parallel debugging and output commits. 1 Introduction A checkpoint is a snapshot of the state of a process,saved on nonvolatile storage to survive process failures. It can be reloaded into volatile memory in case of a failure to reduce the amount of lost work. In a message-passing system consisting of N processes, a...
Efficient Message Logging for Uncoordinated Checkpointing Protocols
, 1996
"... A message is in-transit with respect to a global state if its sending is recorded in this global state, while its receipt is not. Checkpointing algorithms have to log such in-transit messages in order to restore the state of channels when a computation has to be resumed from a consistent global stat ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
A message is in-transit with respect to a global state if its sending is recorded in this global state, while its receipt is not. Checkpointing algorithms have to log such in-transit messages in order to restore the state of channels when a computation has to be resumed from a consistent global state after a failure has occurred. Coordinated checkpointing algorithms log those in-transit messages exactly on stable storage. Because of their lack of synchronization, uncoordinated checkpointing algorithms conservatively log more messages. This paper presents an uncoordinated checkpointing protocol that logs all in-transit messages and the smallest possible number of non in-transit messages. As a consequence, the protocol saves stable storage space and enables quicker recoveries. An appropriate tracking of message causal dependencies constitutes the core of the protocol.
Limited-size Logging for Fault-Tolerant Distributed Shared Memory with Independent Checkpointing
, 2000
"... This paper presents a fault tolerance algorithm for a home-based lazy release consistency distributed shared memory (DSM) system based on volatile logging and independent checkpointing. The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of comput ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
This paper presents a fault tolerance algorithm for a home-based lazy release consistency distributed shared memory (DSM) system based on volatile logging and independent checkpointing. The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of computers as well as collaborative shared-memory applications on wide-area meta-clusters over the Internet. The challenge in building such systems lies in controlling the size of the logs and to garbage collect the unnecessary checkpoints in the absence of global coordination. In this paper we dene a set of rules for lazy log trimming (LLT) and checkpoint garbage collection (CGC) and prove that they do not aect the recoverability of the system. We have implemented our logging algorithm in a home-based DSM system and showed on three representative applications that our scheme eectively bounds the size of the logs and the number of checkpointed page versions kept in stable storage. 1 Int...
Systematic Analysis of Index-Based Checkpointing Algorithms using Simulation
- In IX Brazilian Symposium on FaultTolerant Computing (SCTF
, 2001
"... Index-based checkpointing allows the use of simple and efficient algorithms for domino -effect free construction of recovery lines. In this paper, we use a simulation toolkit to analyze the behavior of index-based algorithms. We present a performance study of the well-known algorithm proposed by Bri ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Index-based checkpointing allows the use of simple and efficient algorithms for domino -effect free construction of recovery lines. In this paper, we use a simulation toolkit to analyze the behavior of index-based algorithms. We present a performance study of the well-known algorithm proposed by Briatico, Ciuffoletti, and Simoncini and explore the impact of some optimizations of this algorithm presented in the literature. Our results indicate that an expensive and complex optimization may not reduce the number of forced checkpoints in comparison to a simpler optimization. Keywords: distributed checkpointing, rollback recovery, logical clocks, simulation of distributed systems. 1
User-level Checkpointing of POSIX Threads
, 1999
"... Multiple threads running in a single, shared address space is a simple model for writing parallel programs for symmetric multiprocessor (SMP) machines and for overlapping I/O and computation in programs run on either SMP or single processor machines. Often a long running program 's user would like t ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Multiple threads running in a single, shared address space is a simple model for writing parallel programs for symmetric multiprocessor (SMP) machines and for overlapping I/O and computation in programs run on either SMP or single processor machines. Often a long running program 's user would like the program to save its state periodically in a checkpoint from which it can recover in case of a failure. Previous user-level checkpointing libraries to checkpoint Unix processes do not support multithreaded programs. This paper describes a user-level checkpointing library to checkpoint multithreaded programs that use the POSIX threads library provided by Solaris 2. The checkpointing library increases the amount of time required to make some thread library calls. The checkpointing library added between less than 1 % and 10 % to the execution times of tested benchmark programs. Saving the program's state to a checkpoint further increased the execution time, but the percentage of total execut...
Fault Recovery for Distributed Shared Memory Systems
"... Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. Ho ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via "checkpointing" techniques that allow applications to "roll back" to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems. Table of Contents 1. Introduction 2. Message Passing Systems This research was supported by the Nationa...
Efficient Checkpoint-based Failure Recovery Techniques in Mobile Computing Systems
- Journal of Information Science and Engineering
"... Conventional distributed and domino effect-free failure recovery techniques are inappropriate for mobile computing systems because each mobile host is forced to take a new checkpoint (based on coordinated checkpointing). Otherwise, multiple local checkpoints may need to be stored in stable storage ( ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Conventional distributed and domino effect-free failure recovery techniques are inappropriate for mobile computing systems because each mobile host is forced to take a new checkpoint (based on coordinated checkpointing). Otherwise, multiple local checkpoints may need to be stored in stable storage (based on communication-induced checkpointing). Hence, this investigation presents a novel domino effect-free failure recovery technique that combines the merits of the above two checkpointing technologies for mobile computing systems. The algorithm is a three-phase protocol that ensures a consistent checkpoint. The first phase uses a coordinated checkpointing protocol among mobile support stations. In the second phase, a communication-induced checkpointing protocol is used between each mobile support station and its mobile hosts. In the last phase, each mobile support station sends a checkpoint request to its mobile host which hasn’t received any message from the mobile support station during the second phase. Numerical results are provided which compare the proposed algorithm with both a quasi-synchronous failure recovery algorithm and a hybrid checkpoint recovery algorithm for mobile computing systems. According to the comparison, our scheme outperforms other schemes in terms of checkpoint overhead. Moreover, the proposed algorithm has several merits: domino effect-free, nonblocking, twice the checkpoint size, and scalability.
Parallel Checkpoint/Restart for MPI Applications
"... As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernellevel process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI. 1
A User-level Checkpointing Library for POSIX Threads Programs
, 1999
"... Several user-level checkpointing libraries that checkpoint Unix processes have been developed. However, they do not support multithreaded programs. This paper describes a user-level checkpointing library to checkpoint multithreaded programs that use the POSIX threads library provided by Solaris 2. E ..."
Abstract
- Add to MetaCart
Several user-level checkpointing libraries that checkpoint Unix processes have been developed. However, they do not support multithreaded programs. This paper describes a user-level checkpointing library to checkpoint multithreaded programs that use the POSIX threads library provided by Solaris 2. Experiments with programs from the SPLASH-2 benchmark suite showed a 3% to 10% increase in execution time with checkpointing enabled, plus an additional overhead for saving the program's state. The checkpointing library described here is available at http://www.dcs.uky.edu/chkpt/. 1. Introduction A multithreaded program's state can be divided into private state and shared state. A thread's private state includes its program counter, stack pointer, and registers. Its shared state includes everything common to all threads in the process, such as the address space and open file state. A multithreaded checkpointing library must save and recover the program's shared state and each thread's priva...

