Results 1 - 10
of
36
The process group approach to reliable distributed computing
- Communications of the ACM
, 1993
"... The difficulty of developing reliable distributed softwme is an impediment to applying distributed computing technology in many settings. Expeti _ with the Isis system suggests that a structured approach based on virtually synchronous _ groups yields systems that are substantially easier to develop, ..."
Abstract
-
Cited by 501 (16 self)
- Add to MetaCart
The difficulty of developing reliable distributed softwme is an impediment to applying distributed computing technology in many settings. Expeti _ with the Isis system suggests that a structured approach based on virtually synchronous _ groups yields systems that are substantially easier to develop, exploit sophisticated forms of cooperative computation, and achieve high reliability. This paper reviews six years of resemr,.hon Isis, describing the model, its impl_nentation challenges, and the types of applicatiom to which Isis has been appfied. 1 In oducfion One might expect the reliability of a distributed system to follow directly from the reliability of its con-stituents, but this is not always the case. The mechanisms used to structure a distributed system and to implement cooperation between components play a vital role in determining how reliable the system will be. Many contemporary distributed operating systems have placed emphasis on communication performance, overlooking the need for tools to integrate components into a reliable whole. The communication primitives supported give generally reliable behavior, but exhibit problematic semantics when transient failures or system configuration changes occur. The resulting building blocks are, therefore, unsuitable for facilitating the construction of systems where reliability is impo/tant. This paper reviews six years of research on Isis, a syg_,,m that provides tools _ support the construction of reliable distributed software. The thesis underlying l._lS is that development of reliable distributed software can be simplified using process groups and group programming too/_. This paper motivates the approach taken, surveys the system, and discusses our experience with real applications.
Transis: A Communication Sub-System for High Availability
, 1992
"... This paper describes Transis, a communication sub-system for high availability. Transis is a transport layer package that supports a variety of reliable multicast message passing services between processors. It provides highly tuned multicast and control services for scalable systems with arbitrary ..."
Abstract
-
Cited by 337 (46 self)
- Add to MetaCart
This paper describes Transis, a communication sub-system for high availability. Transis is a transport layer package that supports a variety of reliable multicast message passing services between processors. It provides highly tuned multicast and control services for scalable systems with arbitrary topology. The communication domain comprises of a set of processors that can initiate multicast messages to a chosen subset. Transis delivers them reliably and maintains the membership of connected processors automatically, in the presence of arbitrary communication delays, of message losses and of processor failures and joins. The contribution of this paper is in providing an aggregate definition of communication and control services over broadcast domains. The main benefit is the efficient implementation of these services using the broadcast capability. In addition, the membership algorithm has a novel approach in handling partitions and remerging; in allowing the regular flow of messages...
Providing High Availability Using Lazy Replication
, 1992
"... To provide high availability for services such as mail or bulletin boards, data must be replicated. One way to guarantee consistency of replicated data is to force service operations to occur in the same order at all sites, but this approach is expensive. For some applications a weaker causal operat ..."
Abstract
-
Cited by 124 (3 self)
- Add to MetaCart
To provide high availability for services such as mail or bulletin boards, data must be replicated. One way to guarantee consistency of replicated data is to force service operations to occur in the same order at all sites, but this approach is expensive. For some applications a weaker causal operation order can preserve consistency while providing better performance. This paper describes a new way of implementing causal operations. Our technique also supports two other kinds of operations: operations that are totally ordered with respect to one another, and operations that are totally ordered with respect to all other operations. The method performs well in terms of response time, operation processing capacity, amount of stored state, and number and size of messages; it does better than replication methods based on reliable multicast techniques. This research was supported in part by the National Science Foundation under Grant CCR-8822158 and in part by the Advanced Research Projects ...
Replication Management Using the State Machine Approach
, 1993
"... This paper is a tutorial on the state machine approach. It describes the approach and its implementation for two representative environments. Small examples suffice to illustrate the points. However, the approach has been successfully applied to larger examples; some of these are mentioned in 9. Sec ..."
Abstract
-
Cited by 101 (0 self)
- Add to MetaCart
This paper is a tutorial on the state machine approach. It describes the approach and its implementation for two representative environments. Small examples suffice to illustrate the points. However, the approach has been successfully applied to larger examples; some of these are mentioned in 9. Section 2 describes how a system can be viewed in terms of a state machine, clients, and output devices. Coping with failures is the subject of 3 through 6. An important class of optimizations--- based on the use of time---is discussed in 7. Section 8 describes dynamic reconfiguration. The history of the approach and related work is discussed in 9
Lazy Replication: Exploiting the Semantics of Distributed Services
- IN IEEE COMPUTER SOCIETY TECHNICAL COMMITTEE ON OPERATING SYSTEMS AND APPLICATION ENVIRONMENTS
, 1990
"... To provide high availability for services such as mail or bulletin boards, data must be replicated. One way to guarantee consistency of replicated data is to force service operations to occur in the same order at all sites, but this approach is expensive. In this paper, we propose lazy replication a ..."
Abstract
-
Cited by 86 (2 self)
- Add to MetaCart
To provide high availability for services such as mail or bulletin boards, data must be replicated. One way to guarantee consistency of replicated data is to force service operations to occur in the same order at all sites, but this approach is expensive. In this paper, we propose lazy replication as a way to preserve consistency by exploiting the semantics of the service's operations to relax the constraints on ordering. Three kinds of operations are supported: operations for which the clients define the required order dynamically during the execution, operations for which the service defines the order, and operations that must be globally ordered with respect to both client ordered and service ordered operations. The method performs well in terms of response time, amount of stored state, number of messages, and availability. It is especially well suited to applications in which most operations require only the client-defined order.
SSP chains: Robust, distributed references supporting Acyclic Garbage Collection
, 1992
"... SSP chains are a novel technique for referencing objects in a distributed system. To client software, any object reference appears to be a local pointer; when the target is remote, an SSP chain adds an indeterminate number of levels of indirection. Copying a reference across the distributed system e ..."
Abstract
-
Cited by 79 (18 self)
- Add to MetaCart
SSP chains are a novel technique for referencing objects in a distributed system. To client software, any object reference appears to be a local pointer; when the target is remote, an SSP chain adds an indeterminate number of levels of indirection. Copying a reference across the distributed system extends an SSP chain at one end; migrating the target object extends it at the other end. Invocation through an SSP chain is efficient: each stage of an SSP chain contains location information and long chains are short-cut at invocation time. These actions require (almost) no extra messages in addition to those of the client application. The rules for creating, using, modifying and deleting SSP chains are stated precisely and maintain well-defined invariants. The invariants hold even in the presence of message failures (loss, duplication, late delivery); after a crash, the existence invariants must be re-established. SSP chains support distributed garbage collection (GC); we present a robust ...
A Survey of Distributed Garbage Collection Techniques
, 1995
"... This paper is organised as follows. Section 2 first introduces our object model. Section 3 describes the reference count-based approach. In particular, we compare those techniques according to their resilience to message failures. Such counting-based techniques are unable to collect cycles of garbag ..."
Abstract
-
Cited by 69 (5 self)
- Add to MetaCart
This paper is organised as follows. Section 2 first introduces our object model. Section 3 describes the reference count-based approach. In particular, we compare those techniques according to their resilience to message failures. Such counting-based techniques are unable to collect cycles of garbage and must assume that they are rare enough to minimize memory leakage. A number of hybrid proposals as explained in 5 which combine counting-based techniques with a global (tracing-based) technique. Section (explained in Section 6) surveys some enhanced techniques well suited to distributed settings. Section (explained in Section 7) sums up our conclusions and proposes taxonomy of the reviewed techniques. 2 Model
Reliable Multicast between Microkernels
- In Proceedings of the USENIX Workshop on Micro-Kernels and Other Architectures
, 1992
"... ISIS is a system for building applications consisting of cooperating, distributed processes. Here we present a new implementation of the ISIS system, geared towards modern microkernel technology. We have adopted similar strategies, such as using basic internal mechanisms for efficiency, external ser ..."
Abstract
-
Cited by 42 (13 self)
- Add to MetaCart
ISIS is a system for building applications consisting of cooperating, distributed processes. Here we present a new implementation of the ISIS system, geared towards modern microkernel technology. We have adopted similar strategies, such as using basic internal mechanisms for efficiency, external services to implement policies, and light-weight user space constructs for simplicity. The resulting design is less complex and more efficient than the present one. This discussion focuses on the integration of our new system into the MACH and Chorus systems, and discusses the status and performance of an initial implementation. 1. Introduction ISIS [1, 2], developed at Cornell University, is a system for building applications consisting of cooperating, distributed processes. Group management and group communication are two basic building blocks provided by ISIS. ISIS has been very successful, and there is currently a demand for a version that will run on many different environments and transp...
Distributed Garbage Collection for Network Objects
- Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301
, 1993
"... In this report we present a fault-tolerant and efficient algorithm for distributed garbage collection and prove its correctness. The algorithm is a generalization of reference counting; it maintains a set of identifiers for processes with references to an object. The set is maintained with pair-wise ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
In this report we present a fault-tolerant and efficient algorithm for distributed garbage collection and prove its correctness. The algorithm is a generalization of reference counting; it maintains a set of identifiers for processes with references to an object. The set is maintained with pair-wise communication between processes, so no global synchronization is required. The primary cost for maintaining the set is one remote procedure call when an object reference is transferred to a new process for the first time. The distributed collector collaborates with the local collector in detecting garbage; any local collector may be used, so long as it can be extended to provide notification when an object is collected. In fact, the distributed collector could be used without a local collector; in that case, the programmer would insert explicit dispose commands to release an object. The algorithm was designed and implemented as part of the Modula-3 network objects system, but it should be s...
Introduction to the Theory of Nested Transactions
, 1988
"... A new formal model is presented for studying concurrency and resiliency properties for nested transactions. The model is used to state and prove correctness of a well-known locking algorithm. ..."
Abstract
-
Cited by 35 (8 self)
- Add to MetaCart
A new formal model is presented for studying concurrency and resiliency properties for nested transactions. The model is used to state and prove correctness of a well-known locking algorithm.

