Results 1 - 10
of
31
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
, 1996
"... this paper, we use the terms event logging and message logging interchangeably ..."
Abstract
-
Cited by 474 (24 self)
- Add to MetaCart
this paper, we use the terms event logging and message logging interchangeably
Libckpt: Transparent Checkpointing under Unix
, 1995
"... Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from whichitcan be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint f ..."
Abstract
-
Cited by 251 (15 self)
- Add to MetaCart
Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from whichitcan be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode whichis almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work.
The Performance of Consistent Checkpointing
- In Proceedings of the 11th Symposium on Reliable Distributed Systems
, 1992
"... Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eigh ..."
Abstract
-
Cited by 181 (9 self)
- Add to MetaCart
Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eight compute-intensive distributed applications on a network of 16 diskless Sun-3/60 workstations, comparing the performance without checkpointing to the performance with consistent checkpoints taken at 2-minute intervals. For six of the eight applications, the running time increased by less than 1% as a result of the checkpointing. The highest overhead measured for any of the applications was 5.8%. Incremental checkpointing and copy-on-write checkpointing were the most effective techniques in lowering the running time overhead. These techniques reduce the amount of data written to stable storage and allow the checkpoint to proceed concurrently with the execution of the processes. The overhead ...
The Interaction of Architecture and Operating System Design
- In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1991
"... Today's high-performance RISC microprocessors have been highly tuned for integer and floating point application performance. These architectures have paid less attention to operating system requirements. At the same time, new operating system designs often have overlooked modern architectural trends ..."
Abstract
-
Cited by 148 (15 self)
- Add to MetaCart
Today's high-performance RISC microprocessors have been highly tuned for integer and floating point application performance. These architectures have paid less attention to operating system requirements. At the same time, new operating system designs often have overlooked modern architectural trends which may unavoidably change the relative cost of certain primitive operations. The result is that operating system performance is well below application code performance on contemporary RISCs. This paper examines recent directions in computer architecture and operating systems, and the implications of changes in each domain for the other. The requirements of three components of operating system design are discussed in detail: interprocess communication, virtual memory, and thread management. For each component, we relate operating system functional and performance needs to the mechanisms available on commercial RISC architectures such as the MIPS R2000 and R3000, Sun SPARC, IBM RS6000, Mot...
Diskless Checkpointing
, 1997
"... Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkp ..."
Abstract
-
Cited by 91 (3 self)
- Add to MetaCart
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.
Dome: Parallel programming in a heterogeneous multi-user environment
, 1995
"... Writing parallel programs for distributed multi-user computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolera ..."
Abstract
-
Cited by 76 (4 self)
- Add to MetaCart
Writing parallel programs for distributed multi-user computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolerance. Dome programmers, with modest effort, can write parallel programs that are automatically distributed over a heterogeneous network, dynamically load balanced as the program runs, and able to survive compute node and network failures. This paper provides the motivation for and an overview of Dome, including a preliminary performance evaluation of dynamic load balancing for distributed vectors. Dome programs are shorter and easier to write than the equivalent programs written with message passing primitives. The performance overhead of Dome is characterized, and it is shown that this overhead can be recouped by dynamic load balancing in imbalanced systems. Finally, we show that a parallel ...
Architectural Support for Single Address Space Operating Systems
, 1992
"... Recent microprocessor announcements show a trend toward wide-address computers: architectures that support 64 bits of virtual address space. Such architectures facilitate fundamentally new operating system organizations that promote efficient data sharing and cooperation, both between complex applic ..."
Abstract
-
Cited by 63 (5 self)
- Add to MetaCart
Recent microprocessor announcements show a trend toward wide-address computers: architectures that support 64 bits of virtual address space. Such architectures facilitate fundamentally new operating system organizations that promote efficient data sharing and cooperation, both between complex applications and between parts of the operating system itself. One such organization is the single address space operating system, in which all processes run within a single global virtual address space; protection is provided not through conventional address space boundaries, but through protection domains that dictate which pages of the global address space a process can reference. This paper focuses on the architectural implications of single address space operating systems, specifically the interaction between the memory system architecture and the operating system's use of addressing and protection. Our purpose is to explore certain architectural opportunities created by single address space ...
Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems
- In Proceedings of the 11th Symposium on Reliable Distributed Systems
, 1992
"... Message-passing systems with communication protocol transparent to the applications typically require message logging to ensure consistency between checkpoints. This paper describes a periodic independent checkpointing scheme with optimistic logging to reduce performance degradation during normal ex ..."
Abstract
-
Cited by 49 (8 self)
- Add to MetaCart
Message-passing systems with communication protocol transparent to the applications typically require message logging to ensure consistency between checkpoints. This paper describes a periodic independent checkpointing scheme with optimistic logging to reduce performance degradation during normal execution while keeping the recovery cost acceptable. Both time and space overhead for message logging can be reduced by detecting messages that need not be logged. A checkpoint space reclamation algorithm is presented to reclaim all checkpoints which are not useful for any possible future recovery. Communication trace-driven simulation for several hypercube programs is used to evaluate the techniques. 1 Introduction Numerous approaches to checkpointing and rollback recovery have been proposed in the literature for parallel systems. In terms of checkpointing techniques, they can be classified into two basic categories. Coordinated checkpointing schemes synchronize computation with checkpoint...
Transparent Optimistic Rollback Recovery
"... Optimistic rollback recovery methods can efficiently and transparently provide fault tolerance for applica-tions executing in a distributed system. With roll- ..."
Abstract
-
Cited by 44 (4 self)
- Add to MetaCart
Optimistic rollback recovery methods can efficiently and transparently provide fault tolerance for applica-tions executing in a distributed system. With roll-
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks
, 1995
"... Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message loggingand replay is desirable for several reasons, including not requiring synchronization between processes during failure-free o ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message loggingand replay is desirable for several reasons, including not requiring synchronization between processes during failure-free operation. However, previous optimistic rollback recovery protocols either have required synchronization during recovery, or have permitted a failure at one process to potentially trigger an exponential number of process rollbacks. In this paper, we present an optimistic rollback recovery protocol that provides completely asynchronous recovery, while also reducing the number of times a process must roll back in response to a failure to at most one. This protocol is based on comparing timestampvectors across multiple levels of partial order time.

