Abstract:
Software-implemented approaches to fault tolerance are very resilient to change since evolution in hardware technology does not require extensive re-design of specialized hardware. This paper argues the case for implementing fault tolerance in a distributed fashion and reports the approach adopted in the European Delta-4 project. Fault tolerance is achieved by replicating capsules (the run-time representation of application objects) on distributed nodes interconnected by a local area network. Capsule groups can be configured to tolerate either stopping failures or arbitrary failures. Multipoint protocols are used for coordinating capsule groups and for error processing and fault treatment. The paper concludes with a critical analysis of the project's results. 1. Introduction Many, if not most, modern computing systems are distributed systems. Distribution is often motivated by organizational reasons (e.g., sharing of data in integrated information systems) or physical constraints (e.g...
Citations
|
589
|
Implementing fault-tolerant services using the state machine approach: a tutorial
– Schneider
- 1990
|
|
206
|
Atomic broadcast: From simple message diffusion to Byzantine agreement
– Cristian, Aghili, et al.
- 1985
|
|
136
|
Why do computers stop and what can be done about it
– Gray
- 1985
|
|
108
|
Distributed Systems
– Mullender
- 1993
|
|
103
|
Failure mode assumptions and assumption coverage, in
– Powell
- 1992
|
|
51
|
Amp: A highly parallel atomic multicast protocol
– Verissimo, Rodrigues, et al.
- 1989
|
|
48
|
Exploiting Replication in Distributed Systems
– Birman, Joseph
- 1989
|
|
45
|
Fault-tolerance in the advanced automation system
– Cristian, Dancey, et al.
- 1990
|
|
44
|
Reliable multicast between micro-kernels
– Renesse, Birman, et al.
|
|
34
|
The DELTA-4 extra performance architecture (XPA
– Barrett, Hilborne, et al.
- 1990
|
|
28
|
Replicated procedure call
– Cooper
- 1984
|
|
22
|
Delta-4: A Generic Architecture for Dependable
– Powell
- 1991
|
|
21
|
Experimental evaluation of the fault tolerance of an atomic multicast system
– Arlat, Aguera, et al.
- 1990
|
|
15
|
Dependability: Basic Concepts and Terminology, Dependable Computing and Fault-Tolerance
– Laprie
- 1992
|
|
14
|
Active replication in delta-4
– Chérèque, Powell, et al.
- 1992
|
|
13
|
Using Passive Replicates in Delta-4 to provide Dependable Distributed Computing
– Speirs, Barrett
- 1989
|
|
9
|
A Theoretician's View of Fault Tolerant Distributed Computing
– Fischer
- 1990
|
|
8
|
Formal Specification and Mechanical Verification of SIFT: A Fault-Tolerant Flight Control System
– Melliar-Smith, Schwartz
- 1982
|
|
3
|
Dependability Evaluation of Bus and Ring Communication Topologies for the Delta-4 Distributed Fault-Tolerant Architecture
– Kanoun, Powell
- 1991
|
|
2
|
Dependability Testing Report
– Arlat, Crouzet, et al.
- 1991
|