Results 1 - 10
of
16
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
, 1996
"... this paper, we use the terms event logging and message logging interchangeably ..."
Abstract
-
Cited by 474 (24 self)
- Add to MetaCart
this paper, we use the terms event logging and message logging interchangeably
Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging
- In USENIX Annual Technical Conference, General Track
, 2004
"... Unfortunately, finding software bugs is a very challenging task because many bugs are hard to reproduce. While debugging a program, it would be very useful to rollback a crashed program to a previous execution point and deterministically re-execute the "buggy " code region. However, most p ..."
Abstract
-
Cited by 82 (6 self)
- Add to MetaCart
Unfortunately, finding software bugs is a very challenging task because many bugs are hard to reproduce. While debugging a program, it would be very useful to rollback a crashed program to a previous execution point and deterministically re-execute the "buggy " code region. However, most previous work on rollback and replay support was designed to survive hardware or operating system failures, and is therefore too heavyweight for the fine-grained rollback and replay needed for software debugging. This paper presents Flashback, a lightweight OS extension that provides fine-grained rollback and replay to help debug software. Flashback uses shadow processes to efficiently roll back in-memory state of a process, and logs a process ' interactions with the system to support deterministic replay. Both shadow processes and logging of system calls are implemented in a lightweight fashion specifically designed for the purpose of software debugging. We have implemented a prototype of Flashback in the Linux operating system. Our experimental results with micro-benchmarks and real applications show that Flashback adds little overhead and can quickly roll back a debugged program to a previous execution point and deterministically replay from that point.
Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations
, 1995
"... This paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the "Network Of Workstation" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW envir ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
This paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the "Network Of Workstation" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorization, QR factorization, and preconditioned conjugate gradient. These implementations are able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional. We discuss the details of how the algorithms are tuned for fault-tolerance, and present the performance results on a PVM network of SUN workstations, and on the IBM SP2.
Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing
, 1997
"... Networks of workstations (NOWs) offer a cost effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolera ..."
Abstract
-
Cited by 19 (11 self)
- Add to MetaCart
Networks of workstations (NOWs) offer a cost effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load or availability. As long as there are at least n processors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet.
Fault Tolerance in MPI Programs
- Special issue of the Journal High Performance Computing Applications (IJHPCA
, 2002
"... This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modify ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI speci cation. We conclude that within certain constraints, MPI can provide a useful context for writing application programs that exhibit signi cant degrees of fault tolerance.
Coordinated checkpointing without direct coordination
- in Proceedings of the 3rd IEEE International Computer Performance and Dependability Symposium
, 1998
"... Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. Longrunning parallel applications and high-availability applications are two potential users of checkpointing, although with different requirements. Parallel applications need low failure-free overhea ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. Longrunning parallel applications and high-availability applications are two potential users of checkpointing, although with different requirements. Parallel applications need low failure-free overheads, and high-availability applications require fast and bounded recoveries. In this paper, we describe a new coordinated checkpoint protocol capable of satisfying both types of applications. The protocol uses time to avoid all types of direct coordination (e.g., message exchanges and message tagging), reducing the overheads to almost a minimum. To ensure that rapid recoveries can be attained, the protocol guarantees small checkpoint latencies. The protocol was implemented and tested on a cluster of workstations connected by a 155 Mbit/sec ATM. Experimental results show that the protocol overheads are very small. 1.
RENEW: A tool for fast and efficient implementation of checkpoint protocols
- In Proceedings of the 28th IEEE Fault-Tolerant Computing Symposium (FTCS
, 1998
"... This paper describes the design, implementation, and evaluation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. To achieve this goal, RENEW provides a flexible set of operations that facilitates the integration of a pr ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
This paper describes the design, implementation, and evaluation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. To achieve this goal, RENEW provides a flexible set of operations that facilitates the integration of a protocol in the system with reduced programming effort. To support a broad range of applications, RE-NEW exports, as its external interface, the industry endorsed Message Passing Interface (MPI). Three distinct classes of protocols were evaluated using the RENEW environment with SPEC and NAS benchmarks on a network of workstations connected by ATM. It was observed that the communication-induced protocol emulated the behavior of the coordinated protocol, with comparable performance. The message logging protocol degraded the performance. Even though the message logging protocol was slower due to log replay, all three protocols required a similar amount of time to restore the applicationto the same state as before failure occurred and recovery was initiated. 1
Distributed Computing Systems and Checkpointing
- in Proc. 2nd Int. Symp. High Perf. Distr. Comp
, 1993
"... This paper examines the performance of synchronous checkpointing in a distributed computing environment with and without load redistribution. Performance models are developed, and optimum checkpoint intervals are determined. The analysis extends earlier work by allowing for multiple nodes, state dep ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
This paper examines the performance of synchronous checkpointing in a distributed computing environment with and without load redistribution. Performance models are developed, and optimum checkpoint intervals are determined. The analysis extends earlier work by allowing for multiple nodes, state dependent checkpoint intervals, and a performance metric which is coupled with failurefree performance and the speedup functions associated with implementation of parallel algorithms. Expressions for the optimum checkpoint intervals for synchronous checkpointing with and without load redistribution are derived and the results are then used to determine when load redistribution is advantageous. 1. Introduction The dual emerging technologies associated with gigabit networks [1] and high-speed processors (supercomputers), suggest the possibility of tackling very large, computationally intensive problems by coupling these technologies into a distributed computing environment. The large application...
Performance Issues of a Distributed Frame Buffer on a Multicomputer
"... A multiple-port, distributed frame buffer has been recently proposed to support parallel rendering on multicomputers. This paper describes an implementation of such a distributed frame buffer for the Intel Paragon routing network, and reports its performance results. We have conducted several exper ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
A multiple-port, distributed frame buffer has been recently proposed to support parallel rendering on multicomputers. This paper describes an implementation of such a distributed frame buffer for the Intel Paragon routing network, and reports its performance results. We have conducted several experiments with the system we have developed. Our results indicate that placing a multipleport, distributed frame buffer directly on the host internal routing network can provide high throughput to eliminate the bottleneck of merging a final image from multiple processors to a frame buffer. This architectural approach can also effectively support image composition for sort-last. The synchronization algorithm we have developed requires only one-way communication and minimizes receive overhead for message passing to the frame buffer.
Improving the Speed of A Distributed Checkpointing Algorithm
, 1993
"... This paper shows how Koo and Toueg's distributed checkpointing algorithm can be modified so as to substantially reduce the average message volume. It attempts to avoid O(n 2 ) messages by using dependency knowledge to reduce the number of checkpoint request messages. Lemmas on consistency and term ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper shows how Koo and Toueg's distributed checkpointing algorithm can be modified so as to substantially reduce the average message volume. It attempts to avoid O(n 2 ) messages by using dependency knowledge to reduce the number of checkpoint request messages. Lemmas on consistency and termination are also included. Key Words: Checkpointing, distributed systems, fault-tolerance, performance. 1 Introduction The possibility of tackling very large, computationally intensive problems by coupling large communities of distributed processors through a high-speed network is fast becoming a reality [7]. The computing sites may consist of computational resources from several vendors, and communication between sites may require message transmission over long distances (thousands of miles) through several intermediate hops. Clearly, computing in this environment is much more precarious and we can expect higher resource failure rates than in a standard multiprocessor. Thus, a fundamental...

