Results 1 - 10
of
155
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
, 1996
"... this paper, we use the terms event logging and message logging interchangeably ..."
Abstract
-
Cited by 474 (24 self)
- Add to MetaCart
this paper, we use the terms event logging and message logging interchangeably
Preserving and Using Context Information in Interprocess Communication
- ACM Transactions on Computer Systems
, 1989
"... ion Psync is based on a conversation abstraction that provides a shared message space through which a collection of processes exchange messages. The general form of this message space is defined by a directed acyclic graph that preserves the partial order of the exchanged messages. For the purpose ..."
Abstract
-
Cited by 210 (24 self)
- Add to MetaCart
ion Psync is based on a conversation abstraction that provides a shared message space through which a collection of processes exchange messages. The general form of this message space is defined by a directed acyclic graph that preserves the partial order of the exchanged messages. For the purpose of this section, we view a conversation as an abstract data type that is implemented in shared memory; Section 3 gives an algorithm for implementing a conversation in an unreliable network. A conversation behaves much like any connection-oriented IPC abstraction: A well-defined set of processes---called participants---explicitly open a conversation, exchange messages through it, and close the conversation. Only processes that have been identified as participants may exchange message through the conversation, and this set is fixed for the duration of the conversation. Processes begin a conversation with the operations: conv = active open(participant set) conv = passive open(pid) The first...
Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail
- In search of the holy grail. Distributed Computing
, 1994
"... : The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results conce ..."
Abstract
-
Cited by 187 (4 self)
- Add to MetaCart
: The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concerning the characterization of causality are presented. Recent work on the detection of causal relationships in distributed computations is surveyed. The issue of observing distributed computations in a causally consistent way and the basic problems of detecting global predicates are discussed. To illustrate the major difficulties, some typical monitoring and debugging approaches are assessed, and it is demonstrated how their feasibility is severely limited by the fundamental problem to master the complexity of causal relationships. Keywords: Distributed Computation, Causality, Distributed System, Causal Ordering, Logical Time, Vector Time, Global Predicate Detection, Distributed Debugging, ...
Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit
- IEEE TRANSACTIONS ON COMPUTERS
, 1992
"... Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message loggin ..."
Abstract
-
Cited by 181 (10 self)
- Add to MetaCart
Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message logging, namely limited rollback and fast output commit, and the advantage of optimistic message logging, namely low failure-free overhead. These advantages come at the expense of a complex recovery scheme.
The Performance of Consistent Checkpointing
- In Proceedings of the 11th Symposium on Reliable Distributed Systems
, 1992
"... Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eigh ..."
Abstract
-
Cited by 181 (9 self)
- Add to MetaCart
Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eight compute-intensive distributed applications on a network of 16 diskless Sun-3/60 workstations, comparing the performance without checkpointing to the performance with consistent checkpoints taken at 2-minute intervals. For six of the eight applications, the running time increased by less than 1% as a result of the checkpointing. The highest overhead measured for any of the applications was 5.8%. Incremental checkpointing and copy-on-write checkpointing were the most effective techniques in lowering the running time overhead. These techniques reduce the amount of data written to stable storage and allow the checkpoint to proceed concurrently with the execution of the processes. The overhead ...
Message Logging: Pessimistic, Optimistic, and Causal
- IEEE Transactions on Software Engineering
, 1995
"... Message logging protocols are an integral part of a technique for implementing processes that can recover from crash failures. All message logging protocols require that, when recovery is complete, there be no orphan processes, which are surviving processes whose states are inconsistent with the rec ..."
Abstract
-
Cited by 93 (17 self)
- Add to MetaCart
Message logging protocols are an integral part of a technique for implementing processes that can recover from crash failures. All message logging protocols require that, when recovery is complete, there be no orphan processes, which are surviving processes whose states are inconsistent with the recovered state of a crashed process. We give a precise specification of the consistency property "no orphan processes". From this specification, we describe how different existing classes of message logging protocols (namely optimistic, pessimistic, and a class that we call causal) implement this property. We then propose a set of metrics to evaluate the performance of message logging protocols, and characterize the protocols that are optimal with respect to these metrics. Finally, starting from a protocol that relies on causal delivery order, we show how to derive optimal causal protocols that tolerate f overlapping failures and recoveries for a parameter f : 1 f n. 1 Introduction Message ...
Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging
- In USENIX Annual Technical Conference, General Track
, 2004
"... Unfortunately, finding software bugs is a very challenging task because many bugs are hard to reproduce. While debugging a program, it would be very useful to rollback a crashed program to a previous execution point and deterministically re-execute the "buggy " code region. However, most p ..."
Abstract
-
Cited by 82 (6 self)
- Add to MetaCart
Unfortunately, finding software bugs is a very challenging task because many bugs are hard to reproduce. While debugging a program, it would be very useful to rollback a crashed program to a previous execution point and deterministically re-execute the "buggy " code region. However, most previous work on rollback and replay support was designed to survive hardware or operating system failures, and is therefore too heavyweight for the fine-grained rollback and replay needed for software debugging. This paper presents Flashback, a lightweight OS extension that provides fine-grained rollback and replay to help debug software. Flashback uses shadow processes to efficiently roll back in-memory state of a process, and logs a process ' interactions with the system to support deterministic replay. Both shadow processes and logging of system calls are implemented in a lightweight fashion specifically designed for the purpose of software debugging. We have implemented a prototype of Flashback in the Linux operating system. Our experimental results with micro-benchmarks and real applications show that Flashback adds little overhead and can quickly roll back a debugged program to a previous execution point and deterministically replay from that point.
Overview of Multidatabase Transaction Management
- VLDB JOURNAL
, 1992
"... A multidatabase system (MDBS) is a facility that allows users access to data located in multiple autonomous database management systems (DBMSs). In such a system, global transactions are executed under the control of the MDBS. Independently, local transactions are executed under the control of the l ..."
Abstract
-
Cited by 80 (13 self)
- Add to MetaCart
A multidatabase system (MDBS) is a facility that allows users access to data located in multiple autonomous database management systems (DBMSs). In such a system, global transactions are executed under the control of the MDBS. Independently, local transactions are executed under the control of the local DBMSs. Each local DBMS integrated by the MDBS may employ a different transaction management scheme. In addition, each local DBMS has complete control over all transactions (global and local) executing at its site, including the ability to abort at any point any of the transactions executing at its site. Typically, no design or internal DBMS structure changes are allowed in order to accommodate the MDBS. Furthermore, the local DBMSs may not be aware of each other, and, as a consequence, cannot coordinate their actions. Thus, traditional techniques for ensuring transaction atomicity and consistency in homogeneous distributed database systems may not be appropriate for an MDBS environment....
Treating bugs as allergies -- a safe method to survive software failures
- IN SOSP
, 2005
"... Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Previous approaches for surviving software failures suffer from several limitations, including requiring application restructuring, failing to address deterministic software bugs, unsafely spe ..."
Abstract
-
Cited by 69 (6 self)
- Add to MetaCart
Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Previous approaches for surviving software failures suffer from several limitations, including requiring application restructuring, failing to address deterministic software bugs, unsafely speculating on program execution, and re-quiring a long recovery time. This paper
Nonblocking and Orphan-Free Message Logging Protocols
, 1993
"... Currently existing message logging protocols demonstrate a classic pessimistic vs. optimistic tradeoff. We show that the optimistic--pessimistic tradeoff is not inherent to the problem of message logging. We construct a message--logging protocol that has the positive features of both optimistic and ..."
Abstract
-
Cited by 61 (14 self)
- Add to MetaCart
Currently existing message logging protocols demonstrate a classic pessimistic vs. optimistic tradeoff. We show that the optimistic--pessimistic tradeoff is not inherent to the problem of message logging. We construct a message--logging protocol that has the positive features of both optimistic and pessimistic protocols: our protocol prevents orphans and allows simple failure recovery; however, it requires no blocking in failure--free runs. Furthermore, this protocol does not introduce any additional message overhead as compared to one implemented for a system in which messages may be lost but processes do not crash. 1 Introduction Message logging protocols are a common method of building a system that can tolerate process crash failures. These protocols require that each process periodically record its local state and log the messages it received after that state. When a process crashes, a new process is created, given the appropriate recorded local state, and then sent the logged me...

