Results 1 - 10
of
47
The Performance of Consistent Checkpointing
- In Proceedings of the 11th Symposium on Reliable Distributed Systems
, 1992
"... Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eigh ..."
Abstract
-
Cited by 181 (9 self)
- Add to MetaCart
Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eight compute-intensive distributed applications on a network of 16 diskless Sun-3/60 workstations, comparing the performance without checkpointing to the performance with consistent checkpoints taken at 2-minute intervals. For six of the eight applications, the running time increased by less than 1% as a result of the checkpointing. The highest overhead measured for any of the applications was 5.8%. Incremental checkpointing and copy-on-write checkpointing were the most effective techniques in lowering the running time overhead. These techniques reduce the amount of data written to stable storage and allow the checkpoint to proceed concurrently with the execution of the processes. The overhead ...
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
- in Proceedings, LACSI Symposium, Sante Fe
, 2003
"... As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback ..."
Abstract
-
Cited by 67 (7 self)
- Add to MetaCart
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernellevel process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI. 1
Adaptive Recovery for Mobile Environments
- Communications of the ACM
, 1997
"... Mobile computing allows ubiquitous and continuous access to computing resources while the users travel or work at a client’s site. The flexibility introduced by mobile computing brings new challenges to the area of fault tolerance. Failures that were rare with fixed hosts become common, and host dis ..."
Abstract
-
Cited by 54 (6 self)
- Add to MetaCart
Mobile computing allows ubiquitous and continuous access to computing resources while the users travel or work at a client’s site. The flexibility introduced by mobile computing brings new challenges to the area of fault tolerance. Failures that were rare with fixed hosts become common, and host disconnection makes fault detection and message coordination difficult. This paper describes a new checkpoint protocol that is well adapted to mobile environments. The protocol uses time to indirectly coordinate the creation of new global states, avoiding all message exchanges. The protocol uses two different types of checkpoints to adapt to the current network characteristics, and to trade off performance with recovery time. 1
Lazy Checkpoint Coordination for Bounding Rollback Propagation
- in Proc. IEEE Symp. Reliable Distributed Syst
, 1993
"... In this paper, we propose the technique of lazy checkpoint coordination which preserves process autonomy while employing communication-induced checkpoint coordination for bounding rollback propagation. The notion of laziness is introduced to control the coordination frequency and allow a flexible tr ..."
Abstract
-
Cited by 54 (7 self)
- Add to MetaCart
In this paper, we propose the technique of lazy checkpoint coordination which preserves process autonomy while employing communication-induced checkpoint coordination for bounding rollback propagation. The notion of laziness is introduced to control the coordination frequency and allow a flexible trade-off between the cost of checkpoint coordination and the average rollback distance. Worst-case overhead analysis provides a means for estimating the extra checkpoint overhead. Communication trace-driven simulation for several parallel programs is used to evaluate the benefits of the proposed scheme. 1 Introduction Uncoordinated checkpointing [1--3] for parallel and distributed systems allows maximum process autonomy and independent design of recovery capability for each process. However, in a general nondeterministic execution, cascading rollback propagation may result in the domino effect [4] which can prevent progression of the recovery line. It has been shown that message reordering [...
On the Use and Implementation of Message Logging
- In 24th International Symposium on Fault-Tolerant Computing
, 1994
"... Message logging has long been advocated as offering better failure-free performance than coordinated checkpointing. On the contrary, we present a number of experiments showing that for than coordinated checkpointing. Message logging protocolscompute-intensive applications executing in parallel on cl ..."
Abstract
-
Cited by 51 (2 self)
- Add to MetaCart
Message logging has long been advocated as offering better failure-free performance than coordinated checkpointing. On the contrary, we present a number of experiments showing that for than coordinated checkpointing. Message logging protocolscompute-intensive applications executing in parallel on clusters of workstations, message logging has higher failure-free overhead , however, resuls in much shorter output latency than coordinated check pointing. Therefore, message logging should be used for applications involving substantial interactions with the outside world, while coordinated checkpointing should be used otherwise. We also present an unorthodox message logging design that uses coordinated checkpointing with message logging, departing from the conventional approaches that use independent checkpointing. This combination of message logging and coordinated checkpointing offers several advantages, including improved failure-free performance, bounded recovery time, simplified garbage collection and reduced complexity. Meanwhile the new protcolos, retain the advantages of the conventional message logging protocols with respect to output commit. Finally, we discuss three "lessons learned" from an implementation of various message logging protocols. First, during output commit, only the dependency information for the message in the log needs to be written to the stable storage. It is not necessary to write the message data to stable storage, leading to faster output commit. Second, the use of copy-on-write in the implementation of message logging substantially reduces the logging overhead for communication-intensive programs. Finally, we provide quantitative evidence supporting previous qualitative claims about the superiority of sender-based message logging over receiver-based logging.
Transparent Optimistic Rollback Recovery
"... Optimistic rollback recovery methods can efficiently and transparently provide fault tolerance for applica-tions executing in a distributed system. With roll- ..."
Abstract
-
Cited by 44 (4 self)
- Add to MetaCart
Optimistic rollback recovery methods can efficiently and transparently provide fault tolerance for applica-tions executing in a distributed system. With roll-
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks
, 1995
"... Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message loggingand replay is desirable for several reasons, including not requiring synchronization between processes during failure-free o ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message loggingand replay is desirable for several reasons, including not requiring synchronization between processes during failure-free operation. However, previous optimistic rollback recovery protocols either have required synchronization during recovery, or have permitted a failure at one process to potentially trigger an exponential number of process rollbacks. In this paper, we present an optimistic rollback recovery protocol that provides completely asynchronous recovery, while also reducing the number of times a process must roll back in response to a failure to at most one. This protocol is based on comparing timestampvectors across multiple levels of partial order time.
Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
- IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING
, 2004
"... Over the past two decades, rollback-recovery via checkpoint-restart has been used with reasonable success for longrunning applications, such as scientific workloads that take from few hours to few months to complete. Currently, several commercial systems and publicly available libraries exist to su ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
Over the past two decades, rollback-recovery via checkpoint-restart has been used with reasonable success for longrunning applications, such as scientific workloads that take from few hours to few months to complete. Currently, several commercial systems and publicly available libraries exist to support various flavors of checkpointing. Programmers typically use these systems if they are satisfactory or otherwise embed checkpointing support themselves within the application. In this paper, we project the performance and functionality of checkpointing algorithms and systems as we know them today into the future. We start by surveying the current technology roadmap and particularly how Peta-Flop capable systems may be plausibly constructed in the next few years. We consider how rollback-recovery as practiced today will fare when systems may have to be constructed out of thousands of nodes. Our projections predict that, unlike current practice, the effect of rollback-recovery may play a more prominent role in how systems may be configured to reach the desired performance level. System planners may have to devote additional resources to enable rollbackrecovery and the current practice of using “cheap commodity ” systems to form large-scale clusters may face serious obstacles. We suggest new avenues for research to react to these trends.
An index-based checkpointing algorithm for autonomous distributed systems
- IEEE Transactions on Parallel and Distributed Systems
, 1997
"... This paper presents an index-based checkpointing algorithm for distributed systems with the aim of reducing the total number of checkpoints while ensuring that each checkpoint belongs to at least one consistent global checkpoint (or recovery line). The algorithm is based on an equivalence relation d ..."
Abstract
-
Cited by 25 (5 self)
- Add to MetaCart
This paper presents an index-based checkpointing algorithm for distributed systems with the aim of reducing the total number of checkpoints while ensuring that each checkpoint belongs to at least one consistent global checkpoint (or recovery line). The algorithm is based on an equivalence relation de ned between pairs of successive checkpoints of a process which allows, in some cases, to advance the recovery line of the computation without forcing checkpoints in other processes. The algorithm is well suited for autonomous and heterogeneous environments where each process does not know any private information about other processes and private information of the same type of distinct processes is not related (e.g., clock granularity, localcheckpointing strategy, etc.). We also present asimulation study which compares the checkpointing-recovery overhead of this algorithm to the ones of previous solutions.

