Results 1 - 10
of
19
Understanding Fault-Tolerant Distributed Systems
- Communications of the ACM
, 1993
"... We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design ..."
Abstract
-
Cited by 296 (23 self)
- Add to MetaCart
We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give examples of systems which adopt one approach or the other. The aim is to introduce some order in the complex discipline of designing and understanding fault-tolerant distributed systems. 1 Introduction Computing systems consist of a multitude of hardware and software components that are bound to fail eventually. In many systems, such component failures can lead to unanticipated, potentially disruptive failure behavior and to service unavailability. Some systems are designed to be fault-tolerant: they either exhibit a well-defined failure behavior when components fail or mask component failures to users, that is, continue to provid...
Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing
, 1988
"... In a distributed system using message logging and checkpointing to provide fault tolerance, there is always a unique maximum recoverable system state, regardless of the message logging protocol used. The proof of this relies on the observation that the set of system states that have occurred during ..."
Abstract
-
Cited by 199 (13 self)
- Add to MetaCart
In a distributed system using message logging and checkpointing to provide fault tolerance, there is always a unique maximum recoverable system state, regardless of the message logging protocol used. The proof of this relies on the observation that the set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. The maximum recoverable system state never decreases, and if all messages are eventually logged, the domino e ect cannot occur. This paper presents a general model for reasoning about recovery in such a system and, based on this model, an efficient algorithm for determining the maximum recoverable system state at any time. This work uni es existing approaches to fault tolerance based on message logging and checkpointing, and improves on existing methods for optimistic recovery in distributed systems.
Sender-based message logging
"... Sender-based message logging is a new low-overhead mechanism for providing transparent fault-tolerance in distributed systems. It differs from conventional message logging mechanisms in that each message is logged in volatile memory on the machine from which the message is sent. Keeping the message ..."
Abstract
-
Cited by 123 (10 self)
- Add to MetaCart
Sender-based message logging is a new low-overhead mechanism for providing transparent fault-tolerance in distributed systems. It differs from conventional message logging mechanisms in that each message is logged in volatile memory on the machine from which the message is sent. Keeping the message log in the sender's local memory allows us to recover from a single failure at a time without the expense of synchronously logging each message to stable storage. The message log is then asynchronously written to stable storage, without delaying the computation, as part of the sender's periodic checkpoint. Maintaining the sender-based message log requires at most one extra network packet over non-fault-tolerant reliable message communication and imposes little additional synchronization delay. It can be applied transparently to existing distributed applications and does not require specialized hardware. It is currently being implemented on a network of SUN workstations.
Distributed System Fault Tolerance Using Message Logging and Checkpointing
, 1989
"... Fault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent faulttolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavi ..."
Abstract
-
Cited by 50 (9 self)
- Add to MetaCart
Fault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent faulttolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by adependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. The rst method presented uses a new pessimistic message logging protocol called
Exploring Failure Transparency and the Limits of Generic Recovery
- In Proc. 4th USENIX Symposium on Operating Systems Design and Implementation
, 2000
"... Abstract: We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and application failures, and must do so withou ..."
Abstract
-
Cited by 46 (7 self)
- Add to MetaCart
Abstract: We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and application failures, and must do so without help from the programmer or unduly slowing failure-free performance. We describe two invariants that must be upheld to provide failure transparency: one that ensures sufficient application state is saved to guarantee the user cannot discern failures, and another that ensures sufficient application state is lost to allow recovery from failures affecting application state. We find that several real applications get failure transparency in the presence of simple stop failures with overhead of 0-12%. Less encouragingly, we find that applications violate one invariant in the course of upholding the other for more than 90 % of application faults and 3-15% of operating system faults, rendering transparent recovery impossible for these cases. 1.
Transparent Optimistic Rollback Recovery
"... Optimistic rollback recovery methods can efficiently and transparently provide fault tolerance for applica-tions executing in a distributed system. With roll- ..."
Abstract
-
Cited by 44 (4 self)
- Add to MetaCart
Optimistic rollback recovery methods can efficiently and transparently provide fault tolerance for applica-tions executing in a distributed system. With roll-
Fast Cluster Failover Using Virtual Memory-Mapped Communication
- In Proc. 13th International Conference on Supercomputing
, 1999
"... This paper proposes a novel way to use virtual memory mapped communication (VMMC) to reduce the failover time on clusters. With the VMMC model, applications' virtual address space can be efficiently mirrored on remote memory either automatically or via explicit messages. When a machine fails, its ap ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
This paper proposes a novel way to use virtual memory mapped communication (VMMC) to reduce the failover time on clusters. With the VMMC model, applications' virtual address space can be efficiently mirrored on remote memory either automatically or via explicit messages. When a machine fails, its applications can restart from the most recent checkpoints on the failover node with minimal memory copying and disk I/O overhead. This method requires little change to applications' source code. We developed two fast failover protocols: deliberate update failover protocol (DU) and automatic update failover protocol (AU). The rst can run on any system that supports VMMC, whereas the other requires special network interface support. We implemented these two protocols...
Efficient Transparent Application Recovery In Client-Server Information Systems
- In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data
, 1998
"... Database systems recover persistent data, providing high database availability. However, database applications, typically residing on client or "middle-tier" application-server machines, may lose work because of a server failure. This prevents the masking of server failures from the human user and ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Database systems recover persistent data, providing high database availability. However, database applications, typically residing on client or "middle-tier" application-server machines, may lose work because of a server failure. This prevents the masking of server failures from the human user and substantially degrades application availability. This paper aims to enable high application availability with an integrated method for database server recovery and transparent application recovery in a client-server system. The approach, based on application message logging, is similar to earlier work on distributed system fault tolerance. However, we exploit advanced database logging and recovery techniques and request/reply messaging properties to significantly improve efficiency. Forced log I/Os, frequently required by other methods, are usually avoided. Restart time, for both failed server and failed client, is reduced by checkpointing and log truncation. Our method ensures that a server...
Transparent fault tolerance for web services based architectures
- In Eighth International Europar Conference (EUROPAR’02), Lecture Notes in Computer Science, Padeborn
, 2002
"... Abstract. Service-based architectures enable the development of new classes of Grid and distributed applications. One of the main capabilities provided by such systems is the dynamic and flexible integration of services, according to which services are allowed to be a part of more than one distribut ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Abstract. Service-based architectures enable the development of new classes of Grid and distributed applications. One of the main capabilities provided by such systems is the dynamic and flexible integration of services, according to which services are allowed to be a part of more than one distributed system and simultaneously serve different applications. This increased flexibility in system composition makes it difficult to address classical distributed system issues such as fault-tolerance. While it is relatively easy to make an individual service fault-tolerant, improving fault-tolerance of services collaborating in multiple application scenarios is a challenging task. In this paper, we look at the issue of developing fault-tolerant service-based distributed systems, and propose an infrastructure to implement fault tolerance capabilities transparent to services. 1
An Evaluation of the Recovery-Related Properties of Software Faults
, 2000
"... this document. The last section of this chapter describes in brief the software we used to perform some of the evaluation work done in this thesis. This software is the work of other students during their dissertation work at the University of Michigan. ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
this document. The last section of this chapter describes in brief the software we used to perform some of the evaluation work done in this thesis. This software is the work of other students during their dissertation work at the University of Michigan.

