Results 1 - 10
of
22
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
, 1996
"... this paper, we use the terms event logging and message logging interchangeably ..."
Abstract
-
Cited by 716 (22 self)
- Add to MetaCart
this paper, we use the terms event logging and message logging interchangeably
Adaptive Recovery for Mobile Environments
- Communications of the ACM
, 1997
"... Mobile computing allows ubiquitous and continuous access to computing resources while the users travel or work at a client’s site. The flexibility introduced by mobile computing brings new challenges to the area of fault tolerance. Failures that were rare with fixed hosts become common, and host dis ..."
Abstract
-
Cited by 66 (6 self)
- Add to MetaCart
(Show Context)
Mobile computing allows ubiquitous and continuous access to computing resources while the users travel or work at a client’s site. The flexibility introduced by mobile computing brings new challenges to the area of fault tolerance. Failures that were rare with fixed hosts become common, and host disconnection makes fault detection and message coordination difficult. This paper describes a new checkpoint protocol that is well adapted to mobile environments. The protocol uses time to indirectly coordinate the creation of new global states, avoiding all message exchanges. The protocol uses two different types of checkpoints to adapt to the current network characteristics, and to trade off performance with recovery time. 1
Coordinated checkpointing without direct coordination
- IEEE International Computer Performance and Dependability Symposium (IPDS98), 7–9
, 1998
"... ..."
(Show Context)
RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols
- IN PROCEEDINGS OF THE 28TH IEEE FAULT-TOLERANT COMPUTING SYMPOSIUM (FTCS
, 1998
"... This paper describes the design, implementation, and evaluation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. To achieve this goal, RENEW provides a flexible set of operations that facilitates the integration of a pr ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
This paper describes the design, implementation, and evaluation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. To achieve this goal, RENEW provides a flexible set of operations that facilitates the integration of a protocol in the system with reduced programming effort. To support a broad range of applications, RENEW exports, as its external interface, the industry endorsed Message Passing Interface (MPI). Three distinct classes of protocols were evaluated using the RENEW environment with SPEC and NAS benchmarks on a network of workstations connected by ATM. It was observed that the communication-induced protocol emulated the behavior of the coordinated protocol, with comparable performance. The message logging protocol degraded the performance. Even though the message logging protocol was slower due to log replay, all three protocols required a similar amount of time to restore the application to the same state as before failure occurred and recovery was initiated.
Adaptive Checkpointing with Storage Management for Mobile Environments
- IEEE Transactions on Reliability
, 1998
"... This paper describes an adaptive protocol that manages storage for base stations. The protocol integrates leasing storage management with a time-based coordinated checkpointing mechanism. The leasing enables storage managers to effectively control disk space. Leasing prevents hanged processes from i ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
(Show Context)
This paper describes an adaptive protocol that manages storage for base stations. The protocol integrates leasing storage management with a time-based coordinated checkpointing mechanism. The leasing enables storage managers to effectively control disk space. Leasing prevents hanged processes from indefinitely retaining storage and, in addition, garbage collection is simple. Time-based 1
PREACHES - Portable Recovery and Checkpointing in Heterogeneous Systems
- Proceedings of IEEE Fault-Tolerant Computing Symposium
, 1998
"... Checkpointing in a homogeneous environment, where both checkpointing and recovery are performed on the same type of machine and operating system, has been studied extensively. As heterogeneous distributed systems become pervasive, it is desirable to extend the capability of checkpointing to non-homo ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Checkpointing in a homogeneous environment, where both checkpointing and recovery are performed on the same type of machine and operating system, has been studied extensively. As heterogeneous distributed systems become pervasive, it is desirable to extend the capability of checkpointing to non-homogeneous environments. This paper describes a prototype, PREACHES, that achieves portable checkpointing of single process applications in heterogeneous systems using checkpoint propagation. The checkpoint propagation technique generates machine-dependent checkpoints for each different architecture in the heterogeneous environment. When failure occurs, the failed process can be restarted on a specified machine with the checkpoint that is appropriate for the architecture. An implementation of PREACHES on a heterogeneous network of workstations has been successfully developed based on TCP/IP communication. PREACHES also provides automatic and fast recovery for single process programs. 1 Introdu...
An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging
- Proc. of the 18th IEEE Symposium on Reliable Distributed Systems
, 1999
"... ..."
(Show Context)
Fault detection using hints from the socket layer
- In Proceedings of the 16th Symposium on Reliable Distributed Systems
, 1997
"... This paper describes a fault detection mechanism that uses the error codes returned by the stream sockets to locate process failures. Since these errors are generated automatically when there is communication with a failed process, the mechanism does not incur in any failure-free overheads. However, ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
This paper describes a fault detection mechanism that uses the error codes returned by the stream sockets to locate process failures. Since these errors are generated automatically when there is communication with a failed process, the mechanism does not incur in any failure-free overheads. However, for some types of faults, detection can only be attained if the surviving processes use certain communication operations. To assess the coverage and latency of the proposed mechanism, faults were injected during the execution of parallel applications. Our results show that in most cases, faults could be found using only the errors from the socket layer. Depending on the type of fault that was injected, detection occurred in an interval ranging from a few milliseconds to less than 9 minutes. 1
Time-based coordinated checkpointing
, 1998
"... Distributed systems are being used to support the execution of applications ranging from long-running scientific simulators to e-commerce on the Internet. In this type of environment, the failure of one of its components, either a computer or the network, may prevent other components from completing ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Distributed systems are being used to support the execution of applications ranging from long-running scientific simulators to e-commerce on the Internet. In this type of environment, the failure of one of its components, either a computer or the network, may prevent other components from completing their tasks. Since the probability of failure increases with the number of computers and execution time, it is likely that these applications will be interrupted unless provision is made for failure handling. In this thesis we address the problem of fault recovery in distributed systems. The thesis describes two variations of a coordinated checkpoint protocol that uses time to re-move most causes of overhead, and to avoid all types of direct coordination. The time-based pro-tocol does not have to transmit extra messages, does not need to tag the application messages, and only accesses the stable storage when the checkpoints are saved. The thesis also describes a new coordinated checkpoint protocol that is well adapted to mobile environments. It uses time to indi-rectly coordinate the creation of new global states, and it saves two different types of checkpoints to adapt its behavior to the current network characteristics. Traditional techniques for fault diagnosis in distributed systems, either based on watch-dogs or polling, exchange performance with detection latency. The thesis introduces a complementary
Synergistic Coordination between Software and Hardware Fault Tolerance Techniques
- in Proceedings of the International Conference on Dependable Systems and Networks (DSN-2001),(Göteborg, Sweden
, 2001
"... This paper describes an approach for enabling the synergistic coordination between two fault tolerance protocols to simultaneously tolerate software and hardware faults in a distributed computing environment. Specifically, our approach is based on a message-driven confidence-driven (MDCD) protocol t ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
This paper describes an approach for enabling the synergistic coordination between two fault tolerance protocols to simultaneously tolerate software and hardware faults in a distributed computing environment. Specifically, our approach is based on a message-driven confidence-driven (MDCD) protocol that we have devised for tolerating software design faults, and a time-based (TB) checkpointing protocol that was developed by Neves and Fuchs for tolerating hardware faults. By carrying out algorithm modifications that are conducive to synergistic coordination between volatile-storage and stable-storage checkpoint establishments, we are able to circumvent the potential interference between the MDCD and TB protocols, and to allow them to effectively complement each other to extend a system's fault tolerance capability. Moreover, the protocolcoordination approach preserves and enhances the features and advantages of the individual protocols that participate in the coordination, keeping the performance cost low.