Results 1 - 10
of
48
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
, 1996
"... this paper, we use the terms event logging and message logging interchangeably ..."
Abstract
-
Cited by 474 (24 self)
- Add to MetaCart
this paper, we use the terms event logging and message logging interchangeably
Message Logging: Pessimistic, Optimistic, and Causal
- IEEE Transactions on Software Engineering
, 1995
"... Message logging protocols are an integral part of a technique for implementing processes that can recover from crash failures. All message logging protocols require that, when recovery is complete, there be no orphan processes, which are surviving processes whose states are inconsistent with the rec ..."
Abstract
-
Cited by 93 (17 self)
- Add to MetaCart
Message logging protocols are an integral part of a technique for implementing processes that can recover from crash failures. All message logging protocols require that, when recovery is complete, there be no orphan processes, which are surviving processes whose states are inconsistent with the recovered state of a crashed process. We give a precise specification of the consistency property "no orphan processes". From this specification, we describe how different existing classes of message logging protocols (namely optimistic, pessimistic, and a class that we call causal) implement this property. We then propose a set of metrics to evaluate the performance of message logging protocols, and characterize the protocols that are optimal with respect to these metrics. Finally, starting from a protocol that relies on causal delivery order, we show how to derive optimal causal protocols that tolerate f overlapping failures and recoveries for a parameter f : 1 f n. 1 Introduction Message ...
Portable Checkpointing for Heterogeneous Architectures
- In Symposium on Fault-Tolerant Computing
, 1997
"... Current approaches for checkpointing assume system homogeneity, where checkpointing and recovery are both performed on the same processor architecture and operating system configuration. Sometimes it is desirable or necessary to recover a failed computation on a different processor architecture. For ..."
Abstract
-
Cited by 56 (4 self)
- Add to MetaCart
Current approaches for checkpointing assume system homogeneity, where checkpointing and recovery are both performed on the same processor architecture and operating system configuration. Sometimes it is desirable or necessary to recover a failed computation on a different processor architecture. For such situations checkpointing and recovery must be portable. In this paper, we argue that source-to-source compilation is an appropriate concept for this purpose. We describe the compilation techniques that we developed for the design of the c2ftc prototype. The c2ftc compiler enables machine-independent checkpoints by automatic generation of checkpointing and recovery code. Sequential C programs are compiled into fault tolerant C programs, whose checkpoints can be migrated across heterogeneous networks, and restarted on binary incompatible architectures. Experimental results on several systems provide evidence that the performance penalty of portable checkpointing is negligible for reali...
Lazy Checkpoint Coordination for Bounding Rollback Propagation
- in Proc. IEEE Symp. Reliable Distributed Syst
, 1993
"... In this paper, we propose the technique of lazy checkpoint coordination which preserves process autonomy while employing communication-induced checkpoint coordination for bounding rollback propagation. The notion of laziness is introduced to control the coordination frequency and allow a flexible tr ..."
Abstract
-
Cited by 54 (7 self)
- Add to MetaCart
In this paper, we propose the technique of lazy checkpoint coordination which preserves process autonomy while employing communication-induced checkpoint coordination for bounding rollback propagation. The notion of laziness is introduced to control the coordination frequency and allow a flexible trade-off between the cost of checkpoint coordination and the average rollback distance. Worst-case overhead analysis provides a means for estimating the extra checkpoint overhead. Communication trace-driven simulation for several parallel programs is used to evaluate the benefits of the proposed scheme. 1 Introduction Uncoordinated checkpointing [1--3] for parallel and distributed systems allows maximum process autonomy and independent design of recovery capability for each process. However, in a general nondeterministic execution, cascading rollback propagation may result in the domino effect [4] which can prevent progression of the recovery line. It has been shown that message reordering [...
Exploring Failure Transparency and the Limits of Generic Recovery
- In Proc. 4th USENIX Symposium on Operating Systems Design and Implementation
, 2000
"... Abstract: We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and application failures, and must do so withou ..."
Abstract
-
Cited by 46 (7 self)
- Add to MetaCart
Abstract: We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and application failures, and must do so without help from the programmer or unduly slowing failure-free performance. We describe two invariants that must be upheld to provide failure transparency: one that ensures sufficient application state is saved to guarantee the user cannot discern failures, and another that ensures sufficient application state is lost to allow recovery from failures affecting application state. We find that several real applications get failure transparency in the presence of simple stop failures with overhead of 0-12%. Less encouragingly, we find that applications violate one invariant in the course of upholding the other for more than 90 % of application faults and 3-15% of operating system faults, rendering transparent recovery impossible for these cases. 1.
A Case for Two-Level Distributed Recovery Schemes
- In ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems
, 1995
"... Most distributed and multiprocessor recovery schemes proposed in the literature are designed to tolerate arbitrary number of failures. In this paper, we demonstrate that, it is often advantageous to use "two-level" recovery schemes. A two-level recovery scheme tolerates the more probable failures w ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
Most distributed and multiprocessor recovery schemes proposed in the literature are designed to tolerate arbitrary number of failures. In this paper, we demonstrate that, it is often advantageous to use "two-level" recovery schemes. A two-level recovery scheme tolerates the more probable failures with low performance overhead, while the less probable failures may be tolerated with a higher overhead. By minimizing the overhead for the more frequently occurring failure scenarios, our approach is expected to achieve lower performance overhead (on average) as compared to existing recovery schemes. To demonstrate the advantages of two-level recovery, we evaluate the performance of a recovery scheme that takes two different types of checkpoints, namely, 1-checkpoints and N - checkpoints. A single failure can be tolerated by rolling the system back to a 1-checkpoint, while multiple failure recovery is possible by rolling back to an N-checkpoint. For such a system, we demonstrate that to mini...
Egida: An extensible toolkit for low-overhead fault-tolerance
- In Symposium on Fault-Tolerant Computing
, 1999
"... We discuss the design and implementation of Egida, an objectoriented toolkit designed to support transparent rollback-recovery. Egida exports a simple specification language that can be used to express arbitrary rollback recovery protocols. From this specification, Egida automatically synthesizes an ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
We discuss the design and implementation of Egida, an objectoriented toolkit designed to support transparent rollback-recovery. Egida exports a simple specification language that can be used to express arbitrary rollback recovery protocols. From this specification, Egida automatically synthesizes an implementation of the specified protocol by gluing together the appropriate objects from an available library of “building blocks”. Egida is extensible and facilitates rapid implementation of rollback recovery protocols with minimal programming effort. We have integrated Egida with the MPICH implementation of the MPI standard. Existing MPI applications can take advantage of Egida without any modifications: fault-tolerance is achieved transparently—all that is needed is a simple re-link of the MPI application with Egida. 1
An architecture for distributed OASIS services
- In IFIP/ACM International Conference on Distributed Systems Platforms
, 2000
"... Abstract. Role based access control promises a more flexible form of access control for distributed systems. Rather than basing access solely on the identity of a principal the decision also takes into account the roles that the principal currently holds. We present a distributed architecture that s ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
Abstract. Role based access control promises a more flexible form of access control for distributed systems. Rather than basing access solely on the identity of a principal the decision also takes into account the roles that the principal currently holds. We present a distributed architecture that supports the OASIS role based access control model. The OASIS model is based on certificates held by the client and validated by credential records held by servers. We wish to replicate and distribute the credential records to support high availability and reduce latency for certificate validation. Protocols are presented for maintaining replicated credential databases and coping with both server and network failures.
The Cost of Recovery in Message Logging Protocols
, 1998
"... Past research in message logging has focused on studying the relative overhead imposed by pessimistic, optimistic, and causal protocols during failure-free executions. In this paper, we give the first experimental evaluation of the performance of these protocols during recovery. We discover that, if ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
Past research in message logging has focused on studying the relative overhead imposed by pessimistic, optimistic, and causal protocols during failure-free executions. In this paper, we give the first experimental evaluation of the performance of these protocols during recovery. We discover that, if a single failure is to be tolerated, pessimistic and causal protocols perform best, because they avoid rollbacks of correct processes. For multiple failures, however, the dominant factor in determining performance becomes where the recovery information is logged (i.e. at the sender, at the receiver, or replicated at a subset of the processes in the system) rather than when this information is logged (i.e. if logging is synchronous or asynchronous). 1 Introduction Message-logging protocols (for example, [2, 3, 4, 6, 9, 10, 14, 15]) are popular techniques for building systems that can tolerate process crash failures. These protocols are built on the assumption that the state of a process is...
Tolerating Mobile Support Station Failures
- of the University of Texas at Dallas
, 1993
"... Mobile computing environment is a growing trend since it provides users access to information irrespective of users' location. Mobile computing systems should be fault tolerant as they are used in several applications like stock trading, courier services, etc. In this paper, we consider the probl ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Mobile computing environment is a growing trend since it provides users access to information irrespective of users' location. Mobile computing systems should be fault tolerant as they are used in several applications like stock trading, courier services, etc. In this paper, we consider the problem of designing mobile computing systems that can tolerate mobile support station failures. Our main goal is to provide solutions such that mobile hosts can continue to to operate in spite of support station failures. We provide two schemes to tolerate support station failures and discuss some important related issues. Keywords Mobile Computing, MSS Failure, Mobile Hosts, Fault Tolerance, Optimistic and Pessimistic Schemes. 1 Introduction The advent of cellular communication, PCN, and wireless LAN has made mobile computing realizable in practice. A mobile computing environment consists of a set of static and mobile hosts. A mobile host is a host that can move while retaining its conn...

