Results 1 - 10
of
149
Libckpt: Transparent Checkpointing under Unix
, 1995
"... Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from whichitcan be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint f ..."
Abstract
-
Cited by 251 (15 self)
- Add to MetaCart
Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from whichitcan be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode whichis almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work.
Understanding the Limitations of Causally and Totally Ordered Communication
- In Proceedings of the 14th ACM Symposium on Operating Systems Principles
, 1993
"... Causally and totally ordered communication support (CATOCS) has been proposed as important to provide as part of the basic building blocks for constructing reliable distributed systems. In this paper, we identify four major limitations to CATOCS, investigate the applicability of CATOCS to several cl ..."
Abstract
-
Cited by 139 (1 self)
- Add to MetaCart
Causally and totally ordered communication support (CATOCS) has been proposed as important to provide as part of the basic building blocks for constructing reliable distributed systems. In this paper, we identify four major limitations to CATOCS, investigate the applicability of CATOCS to several classes of distributed applications in light of these limitations, and the potential impact of these facilities on communication scalability and robustness. From this investigation, we find limited merit and several potential problems in using CATOCS. The fundamental difficulty with the CATOCS is that it attempts to solve state problems at the communication level in violation of the
Microkernels meet recursive virtual machines
, 1996
"... This paper describes a novel approach to providing modular and extensible operating system functionality and encapsulated environments based on a synthesis of microkernel and virtual machine concepts. We have developed a software-based virtualizable architecture called Fluke that allows recursive vi ..."
Abstract
-
Cited by 113 (25 self)
- Add to MetaCart
This paper describes a novel approach to providing modular and extensible operating system functionality and encapsulated environments based on a synthesis of microkernel and virtual machine concepts. We have developed a software-based virtualizable architecture called Fluke that allows recursive virtual machines (virtual machines running on other virtual machines) to be implemented efficiently by a microkernel running on generic hardware. A complete virtual machine interface is provided at each level; efficiency derives from needing to implement only new functionality at each level. This infrastructure allows common OS functionality, such as process management, demand paging, fault tolerance, and debugging support, to be provided by cleanly modularized, independent, stackable virtual machine monitors, implemented as user processes. It can also provide uncommon or unique OS features, including the above features specialized for particular applications ’ needs, virtual machines transparently distributed cross-node, or security monitors that allow arbitrary untrusted binaries to be executed safely. Our prototype implementation of this model indicates that it is practical to modularize operating systems this way. Some types of virtual machine layers impose almost no overhead at all, while others impose some overhead (typically 0–35%), but only on certain classes of applications.
Diskless Checkpointing
, 1997
"... Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkp ..."
Abstract
-
Cited by 91 (3 self)
- Add to MetaCart
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.
Dome: Parallel programming in a heterogeneous multi-user environment
, 1995
"... Writing parallel programs for distributed multi-user computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolera ..."
Abstract
-
Cited by 76 (4 self)
- Add to MetaCart
Writing parallel programs for distributed multi-user computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolerance. Dome programmers, with modest effort, can write parallel programs that are automatically distributed over a heterogeneous network, dynamically load balanced as the program runs, and able to survive compute node and network failures. This paper provides the motivation for and an overview of Dome, including a preliminary performance evaluation of dynamic load balancing for distributed vectors. Dome programs are shorter and easier to write than the equivalent programs written with message passing primitives. The performance overhead of Dome is characterized, and it is shown that this overhead can be recouped by dynamic load balancing in imbalanced systems. Finally, we show that a parallel ...
Optimistic Deltas for WWW Latency Reduction
- In Proceedings of the 1997 Usenix Technical Conference
, 1997
"... When a machine is connected to the Internet via a slow network, such as a 28.8 Kbps modem, the cumulative latency to communicate over the Internet to World Wide Web servers and then transfer documents over the slow network can be significant. We have built a system that optimistically transfers data ..."
Abstract
-
Cited by 74 (15 self)
- Add to MetaCart
When a machine is connected to the Internet via a slow network, such as a 28.8 Kbps modem, the cumulative latency to communicate over the Internet to World Wide Web servers and then transfer documents over the slow network can be significant. We have built a system that optimistically transfers data that may be out of date, then sends either a subsequent confirmation that the data is current or a delta to change the older version to the current one. In addition, if both sides of the slow link already store the same older version, just the delta need be transferred to update it. Our mechanism is optimistic because it assumes that much of the time there will be sufficient idle time to transfer most or all of the older version before the newer version is available, and because it assumes that the changes between the two versions will be small relative to the actual document. Timings of retrievals of random URLs in the Internet support the former assumption, while experiments using a versi...
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
- in Proceedings, LACSI Symposium, Sante Fe
, 2003
"... As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback ..."
Abstract
-
Cited by 67 (7 self)
- Add to MetaCart
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernellevel process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI. 1
CLIP: A Checkpointing Tool for Message-Passing Parallel Programs
, 1997
"... Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpo ..."
Abstract
-
Cited by 60 (9 self)
- Add to MetaCart
Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpointer: CLIP. CLIP is a user-level library that provides semitransparent checkpointing for parallel programs on the Intel Paragon multicomputer. It is publicly available to Paragon users at no cost. Conceptually, checkpointing a multicomputer is quite straightforward. However, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design decisions to be made. Sometimes ease-of-use must be sacrificed for efficiency and/or correctness. This paper details what these decisions are, and how they were made in CLIP. We also present performance data when checkpointing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.
Portable Checkpointing for Heterogeneous Architectures
- In Symposium on Fault-Tolerant Computing
, 1997
"... Current approaches for checkpointing assume system homogeneity, where checkpointing and recovery are both performed on the same processor architecture and operating system configuration. Sometimes it is desirable or necessary to recover a failed computation on a different processor architecture. For ..."
Abstract
-
Cited by 56 (4 self)
- Add to MetaCart
Current approaches for checkpointing assume system homogeneity, where checkpointing and recovery are both performed on the same processor architecture and operating system configuration. Sometimes it is desirable or necessary to recover a failed computation on a different processor architecture. For such situations checkpointing and recovery must be portable. In this paper, we argue that source-to-source compilation is an appropriate concept for this purpose. We describe the compilation techniques that we developed for the design of the c2ftc prototype. The c2ftc compiler enables machine-independent checkpoints by automatic generation of checkpointing and recovery code. Sequential C programs are compiled into fault tolerant C programs, whose checkpoints can be migrated across heterogeneous networks, and restarted on binary incompatible architectures. Experimental results on several systems provide evidence that the performance penalty of portable checkpointing is negligible for reali...
Lazy Checkpoint Coordination for Bounding Rollback Propagation
- in Proc. IEEE Symp. Reliable Distributed Syst
, 1993
"... In this paper, we propose the technique of lazy checkpoint coordination which preserves process autonomy while employing communication-induced checkpoint coordination for bounding rollback propagation. The notion of laziness is introduced to control the coordination frequency and allow a flexible tr ..."
Abstract
-
Cited by 54 (7 self)
- Add to MetaCart
In this paper, we propose the technique of lazy checkpoint coordination which preserves process autonomy while employing communication-induced checkpoint coordination for bounding rollback propagation. The notion of laziness is introduced to control the coordination frequency and allow a flexible trade-off between the cost of checkpoint coordination and the average rollback distance. Worst-case overhead analysis provides a means for estimating the extra checkpoint overhead. Communication trace-driven simulation for several parallel programs is used to evaluate the benefits of the proposed scheme. 1 Introduction Uncoordinated checkpointing [1--3] for parallel and distributed systems allows maximum process autonomy and independent design of recovery capability for each process. However, in a general nondeterministic execution, cascading rollback propagation may result in the domino effect [4] which can prevent progression of the recovery line. It has been shown that message reordering [...

