Results 1 - 10
of
30
Libckpt: Transparent Checkpointing under Unix
, 1995
"... Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from whichitcan be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint f ..."
Abstract
-
Cited by 251 (15 self)
- Add to MetaCart
Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from whichitcan be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode whichis almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work.
Diskless Checkpointing
, 1997
"... Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkp ..."
Abstract
-
Cited by 91 (3 self)
- Add to MetaCart
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.
Dome: Parallel programming in a heterogeneous multi-user environment
, 1995
"... Writing parallel programs for distributed multi-user computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolera ..."
Abstract
-
Cited by 76 (4 self)
- Add to MetaCart
Writing parallel programs for distributed multi-user computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolerance. Dome programmers, with modest effort, can write parallel programs that are automatically distributed over a heterogeneous network, dynamically load balanced as the program runs, and able to survive compute node and network failures. This paper provides the motivation for and an overview of Dome, including a preliminary performance evaluation of dynamic load balancing for distributed vectors. Dome programs are shorter and easier to write than the equivalent programs written with message passing primitives. The performance overhead of Dome is characterized, and it is shown that this overhead can be recouped by dynamic load balancing in imbalanced systems. Finally, we show that a parallel ...
CLIP: A Checkpointing Tool for Message-Passing Parallel Programs
, 1997
"... Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpo ..."
Abstract
-
Cited by 60 (9 self)
- Add to MetaCart
Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpointer: CLIP. CLIP is a user-level library that provides semitransparent checkpointing for parallel programs on the Intel Paragon multicomputer. It is publicly available to Paragon users at no cost. Conceptually, checkpointing a multicomputer is quite straightforward. However, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design decisions to be made. Sometimes ease-of-use must be sacrificed for efficiency and/or correctness. This paper details what these decisions are, and how they were made in CLIP. We also present performance data when checkpointing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.
Application Level Fault Tolerance in Heterogeneous Networks of Workstations
- Journal of Parallel and Distributed Computing
, 1997
"... We have explored methods for checkpointing and restarting processes within the Distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although t ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
We have explored methods for checkpointing and restarting processes within the Distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implemented application level checkpointing which places the checkpoint and restart mechanisms within Dome's C++ objects. Application level checkpointing has been implemented with a library-based technique for the programmer and a more transparent preprocessor-based technique. Dome's implementation of checkpointing successfully checkpoints and restarts processes on different numbers of machines and different architectures. Results from executing Dome programs across a NOW with realistic failure rates have been experimentally determined and are compared with theoretical results. The overhead of checkpoi...
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems
- IN 28TH INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING
, 1997
"... In the past twenty years, there has been a wealth of theoretical research on minimizing the expected running time of a program in the presence of failures by employing checkpointing and rollback recovery. In the same time period, there has been little experimental research to corroborate these resul ..."
Abstract
-
Cited by 42 (5 self)
- Add to MetaCart
In the past twenty years, there has been a wealth of theoretical research on minimizing the expected running time of a program in the presence of failures by employing checkpointing and rollback recovery. In the same time period, there has been little experimental research to corroborate these results. In this paper, we study the results of three separate projects that monitor failure in workstation networks. Our goals are twofold. The first is to see how these results correlate with the theoretical results, and the second is to assess their impact on strategies for checkpointing long-running computations on workstations and networks of workstations. A surprising result of our work is that although the base assumptions of the theoretical research do not hold, many of the results are still applicable.
An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance
, 1997
"... Checkpointing is the act of saving the state of a running program so that it may be reconstructed later in time. It is an important basic functionality in computing systems that paves the way for powerful tools in many fields of computer science. This article provides a comprehensive overview of che ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
Checkpointing is the act of saving the state of a running program so that it may be reconstructed later in time. It is an important basic functionality in computing systems that paves the way for powerful tools in many fields of computer science. This article provides a comprehensive overview of checkpointing in uniprocessor and parallel processing systems, including definitions, uses of checkpointing, and implementation details. Also included in this overview is a brief discussion of checkpoint consistency, which is a major concern in parallel processing systems, and a thorough discussion of issues related to the performance of checkpointing. It is intended that the reader of this article should receive a thorough grounding in checkpointing, with enough detail to implement an efficient checkpointer if so desired.
Adaptive incremental checkpointing for massively parallel systems
- In ICS ’04: Proceedings of the 18th annual international conference on Supercomputing
, 2004
"... Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in these systems. However, huge memory footprints of parallel applications place severe limitations on scalability of normal ch ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in these systems. However, huge memory footprints of parallel applications place severe limitations on scalability of normal checkpointing techniques. Incremental checkpointing is a well researched technique that addresses scalability concerns, but most of the implementations require paging support from hardware and the underlying operating system, which may not be always available. In this paper, we propose a software based adaptive incremental checkpoint technique which uses a secure hash function to uniquely identify changed blocks in memory. Our algorithm is the first self-optimizing algorithm that dynamically computes the optimal block boundaries, based on the history of changed blocks. This provides better opportunities for minimizing checkpoint file size. Since the hash is computed in software, we do not need any system support for this. We have implemented and tested this mechanism on the BlueGene/L system. Our results on several well-known benchmarks are encouraging, both in terms of reduction in average checkpoint file size and adaptivity towards application’s memory access patterns.
Compressed Differences: An Algorithm for Fast Incremental Checkpointing
, 1995
"... The overhead of saving checkpoints to stable storage is the dominant performance cost in checkpointing systems. In this paper, we present a complete study of compressed differences, a new algorithm for fast incremental checkpointing. Compressed differences reduce the overhead of checkpointing by sav ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
The overhead of saving checkpoints to stable storage is the dominant performance cost in checkpointing systems. In this paper, we present a complete study of compressed differences, a new algorithm for fast incremental checkpointing. Compressed differences reduce the overhead of checkpointing by saving only the words that have changed in the current checkpointing interval while monitoring those changes using page protection. We describe two checkpointing algorithms based on compressed differences, called standard and online compressed differences. These algorithms are analyzed in detail to determine the conditions that are necessary for them to improve the performance of checkpointing. We then present results of implementing these algorithms in a uniprocessor checkpointing system. These results both corroborate the analysis and show that in this environment, standard compressed differences almost invariably improve the performance of both sequential and incremental checkpointing.
MARS - A Framework for Minimizing the Job Execution Time in a Metacomputing Environment
- Future Generation Comput. Syst
, 1995
"... Utilizing a collection of workstations and supercomputers in a metacomputing environment does not only offer an enormous amount of computing power, but also raises new problems. The true potential of WAN-based distributed computing can only be exploited if the application-to-architecture mapping ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
Utilizing a collection of workstations and supercomputers in a metacomputing environment does not only offer an enormous amount of computing power, but also raises new problems. The true potential of WAN-based distributed computing can only be exploited if the application-to-architecture mapping reflects the different processor speeds, network performances and the application's communication characteristics. In this paper, we present the Metacomputer Adaptive Runtime System MARS,a framework for minimizing the execution time of distributed applications on a WAN metacomputer. Work-load balancing and task migration is based on dynamic information on the processor load and network performance. Moreover, MARS uses accumulated statistical data on previous execution runs of the same application to derive an improved task-to-process mapping. Migration decisions are based on (1) the current system load, (2) the network load and (3) previously obtained application-specific characteri...

