Results 1 - 10
of
61
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
, 1996
"... this paper, we use the terms event logging and message logging interchangeably ..."
Abstract
-
Cited by 716 (22 self)
- Add to MetaCart
this paper, we use the terms event logging and message logging interchangeably
Adaptive Recovery for Mobile Environments
- Communications of the ACM
, 1997
"... Mobile computing allows ubiquitous and continuous access to computing resources while the users travel or work at a client’s site. The flexibility introduced by mobile computing brings new challenges to the area of fault tolerance. Failures that were rare with fixed hosts become common, and host dis ..."
Abstract
-
Cited by 66 (6 self)
- Add to MetaCart
(Show Context)
Mobile computing allows ubiquitous and continuous access to computing resources while the users travel or work at a client’s site. The flexibility introduced by mobile computing brings new challenges to the area of fault tolerance. Failures that were rare with fixed hosts become common, and host disconnection makes fault detection and message coordination difficult. This paper describes a new checkpoint protocol that is well adapted to mobile environments. The protocol uses time to indirectly coordinate the creation of new global states, avoiding all message exchanges. The protocol uses two different types of checkpoints to adapt to the current network characteristics, and to trade off performance with recovery time. 1
How to Recover Efficiently and Asynchronously when Optimism Fails
- IN PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS
, 1996
"... We propose a new algorithm for recovering asynchronously from failures in a distributed computation. Our algorithm is based on two novel concepts - a fault-tolerant vector clock to maintain causality information in spite of failures, and a history mechanism to detect orphan states and obsolete messa ..."
Abstract
-
Cited by 49 (5 self)
- Add to MetaCart
We propose a new algorithm for recovering asynchronously from failures in a distributed computation. Our algorithm is based on two novel concepts - a fault-tolerant vector clock to maintain causality information in spite of failures, and a history mechanism to detect orphan states and obsolete messages. These two mechanisms together with checkpointing and message-logging are used to restore the system to a consistent state after a failure of one or more processes. Our algorithm is completely asynchronous. It handles multiple failures, does not assume any message ordering, causes the minimum amount of rollback and restores the maximum recoverable state with low overhead. Earlier optimistic protocols lack one or more of the above properties.
Distributed Shared Memory: Where we are and where . . .
"... It has been almost ten years since the birth of the first distributed shared memory (DSM) system, Ivy. While significant progress has been made in the area of improving the performance of DSM and DSM has been the focus of several dozen PhD theses, its overall impact on "real" users and app ..."
Abstract
-
Cited by 47 (1 self)
- Add to MetaCart
It has been almost ten years since the birth of the first distributed shared memory (DSM) system, Ivy. While significant progress has been made in the area of improving the performance of DSM and DSM has been the focus of several dozen PhD theses, its overall impact on "real" users and applications has been small. The goal of this paper is to present our position on what remains to be done before DSM will have a significant impact on real applications. More specifically, we reflect on what we believe have been the major advances in the area, what the important outstanding problems are, and what work needs to be done. Finally, we describe amodest step towards solving these problems, the Quarks DSM system.
Application-level checkpointing for shared memory programs
- In ASPLOS-XI: Proceedings of the 11th international conference on Architectural
, 2004
"... Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR)- the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted ..."
Abstract
-
Cited by 46 (5 self)
- Add to MetaCart
(Show Context)
Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR)- the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs. In this paper, we describe such a system for sharedmemory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
- In Proc. of the 25th Annual Int'l Symp. on Fault-Tolerant Computing
, 1995
"... Rollback techniques that use message logging and deterministic replay can be used in parallel systems to recover a failed node without involving other nodes. Distributed shared memory (DSM) systems cannot directly apply message-passing logging techniques because they use inherently nondeterministic ..."
Abstract
-
Cited by 45 (4 self)
- Add to MetaCart
(Show Context)
Rollback techniques that use message logging and deterministic replay can be used in parallel systems to recover a failed node without involving other nodes. Distributed shared memory (DSM) systems cannot directly apply message-passing logging techniques because they use inherently nondeterministic asynchronous communication. This paper presents new logging schemes that reduce the typically high overhead for logging in DSM. Our algorithm for sequentially consistent systems tracks rather than logs accesses to shared memory. In an extension of this method to lazy release consistency, the per-access overhead of tracking has been completely eliminated. Measurements with parallel applications show a significant reduction in failure-free overhead. 1 Introduction Distributed shared memory (DSM) provides the programming advantages of a shared memory image in a system with physically distributed processing nodes. DSM maintains consistency between processing nodes in software, using the virtual...
Lightweight Logging for Lazy Release Consistent Distributed Shared Memory
- In Proc. of the USENIX 2nd Symp. on Operating Systems Design and Implementation
, 1996
"... This paper presents a new logging and recovery algorithm for lazy release consistent distributed shared memory (DSM). The new algorithm tolerates single node failures by maintaining a distributed log of data dependencies in the volatile memory of processes. The algorithm adds very little overhead to ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
This paper presents a new logging and recovery algorithm for lazy release consistent distributed shared memory (DSM). The new algorithm tolerates single node failures by maintaining a distributed log of data dependencies in the volatile memory of processes. The algorithm adds very little overhead to the memory consistency protocol: it sends no additional messages during failure-free periods; it adds only a minimal amount of data to one of the DSM protocol messages; it introduces no forced rollbacks of non-faulty processes; and it performs no communication-induced accesses to stable storage. Furthermore, the algorithm logs only a very small amount of data, because it uses the log of memory accesses already maintained by the memory consistency protocol. The algorithm was implemented in TreadMarks, a state-of-the-art DSM system. Experimental results show that the algorithm has near zero time overhead and very low space overhead during failure-free execution, thus refuting the common belie...
A Survey Of Recoverable Distributed Shared Memory Systems
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1995
"... Distributed Shared Memory (dsm) systems provide a shared memory abstraction on distributed memory architectures (distributed memory multicomputers, networks of workstations). Such systems ease parallel application programming since the shared memory programming model is often more natural than the ..."
Abstract
-
Cited by 26 (13 self)
- Add to MetaCart
(Show Context)
Distributed Shared Memory (dsm) systems provide a shared memory abstraction on distributed memory architectures (distributed memory multicomputers, networks of workstations). Such systems ease parallel application programming since the shared memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a dsm system increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverable dsm systems (rdsm) that provide a checkpointing mechanism to restart parallel computations, after a site failure.
Integrating Coherency and Recoverability in Distributed Systems
- IN PROCEEDINGS OF THE FIRST SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI'94
, 1994
"... We propose a technique for maintaining coherency of a transactional distributed shared memory, used by applications accessing a shared persistent store. Our goal is to improve support for fine-grained distributed data sharing in collaborative design applications, such as CAD systems and software dev ..."
Abstract
-
Cited by 25 (5 self)
- Add to MetaCart
We propose a technique for maintaining coherency of a transactional distributed shared memory, used by applications accessing a shared persistent store. Our goal is to improve support for fine-grained distributed data sharing in collaborative design applications, such as CAD systems and software development environments. In contrast, traditional research in distributed shared memory has focused on supporting parallel programs; in this paper, we show how distributed programs can benefit from this shared-memory abstraction as well. Our approach, called log-based coherency, integrates coherency support with a standard mechanism for ensuring recoverability of persistent data. In our system, transaction logs are the basis of both recoverability and coherency. We have prototyped log-based coherency as a set of extensions to RVM [Satyanarayanan et al. 94], a runtime package supporting recoverable virtual memory. Our prototype adds coherency support to RVM in a simple way that does not requir...
Using time to improve the performance of coordinated checkpointing
- in Proceedings of the International Computer Performance & Dependability Symposium
, 1996
"... This paper describes and evaluates a coordinated checkpoint protocol that uses time to eliminate several performance overheads that are present in traditional protocols. The time-based protocol does not have to exchange coordination messages, does not need to add information to the processes ’ messa ..."
Abstract
-
Cited by 22 (9 self)
- Add to MetaCart
(Show Context)
This paper describes and evaluates a coordinated checkpoint protocol that uses time to eliminate several performance overheads that are present in traditional protocols. The time-based protocol does not have to exchange coordination messages, does not need to add information to the processes ’ messages, and only accesses stable storage when checkpoints are saved. This protocol uses a simple initialization procedure to set checkpoint timers at the different processes. After the initialization, each process saves its state independently from the other processes. By disallowing processes from sending messages during an interval before the checkpoint time, the protocol prevents intransit messages from occurring. Two coordinated checkpoint protocols were implemented on a CM5, and their performance was compared using several applications. Results showed that the time-based protocol outperforms the two-phase protocol in all applications. 1