Results 1 - 10
of
36
Reliable Communication in the Presence of Failures
- ACM Transactions on Computer Systems
, 1987
"... The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols at ..."
Abstract
-
Cited by 546 (18 self)
- Add to MetaCart
(Show Context)
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.
Checkpointing and Rollback-Recovery for Disitributed Systems
- IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL
, 1987
"... We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consiste ..."
Abstract
-
Cited by 366 (0 self)
- Add to MetaCart
We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a minimal number of additional processes are forced to roll back with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions.
Exploiting virtual synchrony in distributed systems
, 1987
"... We describe applications of a virtually synchronous environment for distributed programming, which underlies a collection of distributed programming tools in the 1SIS2 system. A virtually synchronous environment allows processes to be structured into process groups, and makes events like broadcasts ..."
Abstract
-
Cited by 360 (30 self)
- Add to MetaCart
(Show Context)
We describe applications of a virtually synchronous environment for distributed programming, which underlies a collection of distributed programming tools in the 1SIS2 system. A virtually synchronous environment allows processes to be structured into process groups, and makes events like broadcasts to the group as an entity, group membership changes, and even migration of an activity from one place to another appear to occur instantaneously -- in other words, synchronously. A major advantage to this approach is that many aspects of a distributed application can be treated independently without compromising correctness. Moreover, user code that is designed as if the system were synchronous can often be executed concurrently. We argue that this approach to building distributed and fault-tolerant software is more straightforward, more flexible, and more likely to yield correct solutions than alternative approaches.
Orca: A language for parallel programming of distributed systems
- IEEE Transactions on Software Engineering
, 1992
"... Orca is a language for implementing parallel applications on loosely coupled distributed systems. Unlike most languages for distributed programming, it allows processes on different machines to share data. Such data are encapsulated in data-objects, which are instances of user-defined abstract data ..."
Abstract
-
Cited by 332 (46 self)
- Add to MetaCart
(Show Context)
Orca is a language for implementing parallel applications on loosely coupled distributed systems. Unlike most languages for distributed programming, it allows processes on different machines to share data. Such data are encapsulated in data-objects, which are instances of user-defined abstract data types. The implementation of Orca takes care of the physical distribution of objects among the local memories of the processors. In particular, an implementation may replicate and/or migrate objects in order to decrease access times to objects and increase parallelism. This paper gives a detailed description of the Orca language design and motivates the design choices. Orca is intended for applications programmers rather than systems programmers. This is reflected in its design goals to provide a simple, easy to use language that is type-secure and provides clean semantics. The paper discusses three example parallel applications in Orca, one of which is described in detail. It also describes one of the existing implementations, which is based on reliable broadcasting. Performance measurements of this system are given for three parallel applications. The measurements show that significant speedups can be obtained for all three applications. Finally, the paper compares Orca with several related languages and systems. 1.
Building Secure and Reliable Network Applications
, 1996
"... ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably deliv ..."
Abstract
-
Cited by 230 (16 self)
- Add to MetaCart
ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to the invoker, and exceptions are raised if (and only if) an error occurs. Given a completely reliable communication environment, which never loses, duplicates, or reorders messages, and given client and server processes that never fail, RPC would be trivial to solve. The sender would merely package the invocation into one or more messages, and transmit these to the server. The server would unpack the data into local variables, perform the desired operation, and send back the result (or an indication of any exception that occurred) in a reply message. The challenge, then, is created by failures. Were it not for the possibility of process and machine crashes, an RPC protocol capable of overcomi...
Maintaining Availability in Partitioned Replicated Databases
- ACM Transactions on Database Systems
, 1989
"... In a replicated database, a data item may have copies residing on several sites. A replica control protocol is necessary to ensure that data items with several copies behave as if they consist of a single copy, as far as users can tell. We describe a new replica control protocol that allows the acce ..."
Abstract
-
Cited by 103 (4 self)
- Add to MetaCart
(Show Context)
In a replicated database, a data item may have copies residing on several sites. A replica control protocol is necessary to ensure that data items with several copies behave as if they consist of a single copy, as far as users can tell. We describe a new replica control protocol that allows the accessing of data in spite of site failures and network partitioning. This protocol provides the database designer with a large degree of flexibility in deciding the degree of data availability, as well as the cost of accessing data.
Ficus: A Very Large Scale Reliable Distributed File System
- UNIVERSITY OF CALIFORNIA, LOS ANGELES
, 1991
"... The dissertation presents the issues addressed in the design of Ficus, a large scale wide area distributed file system currently operational on a modest scale at UCLA. Key aspects of providing such a service include toleration of partial operation in virtually all areas; support for large scale, ..."
Abstract
-
Cited by 48 (7 self)
- Add to MetaCart
(Show Context)
The dissertation presents the issues addressed in the design of Ficus, a large scale wide area distributed file system currently operational on a modest scale at UCLA. Key aspects of providing such a service include toleration of partial operation in virtually all areas; support for large scale, optimistic data replication; and a flexible, extensible modular design. Ficus incorporates a "stackable layers" modular architecture and full support for optimistic replication. Replication is provided by a pair of layers operating in concert above a traditional filing service. A "volume" abstraction and on-the-fly volume "grafting" mechanism are used to manage the large scale file name space. The replication service uses a f...
A Distributed Implementation Of The Shared Data-Object Model
- IN USENIX WORKSHOP ON EXPERIENCES WITH BUILDING DISTRIBUTED AND MULTIPROCESSOR SYSTEMS
, 1989
"... The shared data-object model is designed to ease the implementation of parallel applications on loosely coupled distributed systems. Unlike most other models for distributed programming (e.g., RPC), the shared data-object model allows processes on different machines to share data. Such data are enca ..."
Abstract
-
Cited by 32 (8 self)
- Add to MetaCart
The shared data-object model is designed to ease the implementation of parallel applications on loosely coupled distributed systems. Unlike most other models for distributed programming (e.g., RPC), the shared data-object model allows processes on different machines to share data. Such data are encapsulated in data-objects, which are instances of user-defined abstract data types. The shared data-object model forms the basis of a new language for distributed programming, Orca, which gives linguistic support for parallelism and data-objects. A distributed implementation of the shared data-object model should take care of the physical distribution of objects among the local memories of the processors. In particular, an implementation may replicate objects in order to decrease access times to objects and increase parallelism. The intent of this paper is to show that, for several applications, the proposed model is both easy to use and efficient. We first give a brief description of the sh...
The Architecture and Implementation of a Distributed Hypermedia Storage System
- in HT'93
"... Our project is studying the process by which groups of individuals work together to build l3rge, complelC stru~tu r es of ideas and is d<Jveloping a distributed hypermedia collaboration environment (called ADC) to suppor.t that process. This paper focuses on the architecture and im plementation o ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
Our project is studying the process by which groups of individuals work together to build l3rge, complelC stru~tu r es of ideas and is d<Jveloping a distributed hypermedia collaboration environment (called ADC) to suppor.t that process. This paper focuses on the architecture and im plementation of the Distributed G raph Stor age (DGS) component of ABC. The OGS supportS a graph · based data model, conservatively extended to meet hypermedia requirements. Some important issues addressed in the system int;lude scale, perfor mance, concurrency semantics, access protection, location independence, and replicMion (for fauh tolerance).
Partial Database Replication and Group Communication Primitives (Extended Abstract)
- in Proceedings of the 2 nd European Research Seminar on Advances in Distributed Systems (ERSADS’97
, 1997
"... ) Gustavo Alonso Database research Group Institute for Information Systems ETH Zentrum, Zurich CH-8092, Switzerland E-mail: falonsog@inf.ethz.ch January 17, 1997 1 Introduction Existing research on replication is generally based on synchronous replication (all copies are kept consistent at all tim ..."
Abstract
-
Cited by 25 (6 self)
- Add to MetaCart
) Gustavo Alonso Database research Group Institute for Information Systems ETH Zentrum, Zurich CH-8092, Switzerland E-mail: falonsog@inf.ethz.ch January 17, 1997 1 Introduction Existing research on replication is generally based on synchronous replication (all copies are kept consistent at all times) and update everywhere (any replica can be updated) approaches. There is a strong belief among database designers, however, that synchronous, update everywhere replication is simply not feasible in a database environment [GBH + 96, Gol94, Sta94]. Some of the arguments behind this belief is the high probability of deadlocks that replication introduces and the problems to scale any replication approach beyond a few sites. As a result, most current database replication solutions are asynchronous (copies are not kept consistent at all times) and based on a primary-copy approach 1 (only one master copy can be updated, all other replicas are read-only) [Gol94, Sta94]. An additional argument...