Results 1 - 10
of
267
A reliable multicast framework for light-weight sessions and application level framing
- IEEE/ACM Transactions on Networking
, 1995
"... Abstract — This paper describes Scalable Reliable Multicast (SRM), a reliable multicast framework for light-weight sessions and application level framing. The algorithms of this framework are efficient, robust, and scale well to both very large networks and very large sessions. The SRM framework has ..."
Abstract
-
Cited by 945 (46 self)
- Add to MetaCart
Abstract — This paper describes Scalable Reliable Multicast (SRM), a reliable multicast framework for light-weight sessions and application level framing. The algorithms of this framework are efficient, robust, and scale well to both very large networks and very large sessions. The SRM framework has been prototyped in wb, a distributed whiteboard application, which has been used on a global scale with sessions ranging from a few to a few hundred participants. The paper describes the principles that have guided the SRM design, including the IP multicast group delivery model, an end-to-end, receiver-based model of reliability, and the application level framing protocol model. As with unicast communications, the performance of a reliable multicast delivery algorithm depends on the underlying topology and operational environment. We investigate that dependence via analysis and simulation, and demonstrate an adaptive algorithm that uses the results of previous loss recovery events to adapt the control parameters used for future loss recovery. With the adaptive algorithm, our reliable multicast delivery algorithm provides good performance over a wide range of underlying topologies. Index Terms—Computer networks, computer network performance, Internetworking.
Unreliable Failure Detectors for Reliable Distributed Systems
- Journal of the ACM
, 1996
"... We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with ..."
Abstract
-
Cited by 807 (17 self)
- Add to MetaCart
We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
Reliable Multicast Transport Protocol (RMTP)
"... This paper presents the design, implementation and performance of a reliable multicast transport protocol called RMTP. RMTP is based on a hierarchical structure in which receivers are grouped into local regions or domains and in each domain there is a special receiver called a Designated Receiver (D ..."
Abstract
-
Cited by 554 (9 self)
- Add to MetaCart
This paper presents the design, implementation and performance of a reliable multicast transport protocol called RMTP. RMTP is based on a hierarchical structure in which receivers are grouped into local regions or domains and in each domain there is a special receiver called a Designated Receiver (DR) which is responsible for sending acknowledgments periodically to the sender, for processing acknowledgements from receivers in its domain and for retransmitting lost packets to the corresponding receivers. Since lost packets are recovered by local retransmissions as opposed to retransmissions from the original sender, end-to-end latency is significantly reduced, and the overall throughput is improved as well. Also, since only the DRs send their acknowledgments to the sender, instead of all receivers sending their acknowledgments to the sender, a single acknowledgement is generated per local region, and this prevents acknowledgement implosion. Receivers in RMTP send their acknowledgments to the DRs periodically, thereby simplifying error recovery. In addition, lost packets are recovered by selective repeat retransmissions, leading to improved throughput at the cost of minimal additional buffering at the receivers. This paper also describes the implementation of RMTP and its performance on the Internet.
Lightweight causal and atomic group multicast
- ACM Transactions on Computer Systems
, 1991
"... (DoD) under DARPA/NASA subcontract NAG2-593 administered by the NASA ..."
Abstract
-
Cited by 542 (44 self)
- Add to MetaCart
(DoD) under DARPA/NASA subcontract NAG2-593 administered by the NASA
Reliable Communication in the Presence of Failures
- ACM Transactions on Computer Systems
, 1987
"... The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols at ..."
Abstract
-
Cited by 480 (13 self)
- Add to MetaCart
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.
Transis: A Communication Sub-System for High Availability
, 1992
"... This paper describes Transis, a communication sub-system for high availability. Transis is a transport layer package that supports a variety of reliable multicast message passing services between processors. It provides highly tuned multicast and control services for scalable systems with arbitrary ..."
Abstract
-
Cited by 337 (46 self)
- Add to MetaCart
This paper describes Transis, a communication sub-system for high availability. Transis is a transport layer package that supports a variety of reliable multicast message passing services between processors. It provides highly tuned multicast and control services for scalable systems with arbitrary topology. The communication domain comprises of a set of processors that can initiate multicast messages to a chosen subset. Transis delivers them reliably and maintains the membership of connected processors automatically, in the presence of arbitrary communication delays, of message losses and of processor failures and joins. The contribution of this paper is in providing an aggregate definition of communication and control services over broadcast domains. The main benefit is the efficient implementation of these services using the broadcast capability. In addition, the membership algorithm has a novel approach in handling partitions and remerging; in allowing the regular flow of messages...
Orca: A language for parallel programming of distributed systems
- IEEE Transactions on Software Engineering
, 1992
"... Orca is a language for implementing parallel applications on loosely coupled distributed systems. Unlike most languages for distributed programming, it allows processes on different machines to share data. Such data are encapsulated in data-objects, which are instances of user-defined abstract data ..."
Abstract
-
Cited by 307 (43 self)
- Add to MetaCart
Orca is a language for implementing parallel applications on loosely coupled distributed systems. Unlike most languages for distributed programming, it allows processes on different machines to share data. Such data are encapsulated in data-objects, which are instances of user-defined abstract data types. The implementation of Orca takes care of the physical distribution of objects among the local memories of the processors. In particular, an implementation may replicate and/or migrate objects in order to decrease access times to objects and increase parallelism. This paper gives a detailed description of the Orca language design and motivates the design choices. Orca is intended for applications programmers rather than systems programmers. This is reflected in its design goals to provide a simple, easy to use language that is type-secure and provides clean semantics. The paper discusses three example parallel applications in Orca, one of which is described in detail. It also describes one of the existing implementations, which is based on reliable broadcasting. Performance measurements of this system are given for three parallel applications. The measurements show that significant speedups can be obtained for all three applications. Finally, the paper compares Orca with several related languages and systems. 1.
Exploiting virtual synchrony in distributed systems
, 1987
"... Abstract: We describe applications of a virtually synchro-nous environment for distributed programming, which underlies a collection of distributed programming tools in the 1SIS2 system. A virtually synchronous environment allows processes to be structured into process groups, and makes events like ..."
Abstract
-
Cited by 298 (26 self)
- Add to MetaCart
Abstract: We describe applications of a virtually synchro-nous environment for distributed programming, which underlies a collection of distributed programming tools in the 1SIS2 system. A virtually synchronous environment allows processes to be structured into process groups, and makes events like broadcasts to the group as an entity, group membership changes, and even migration of an activity from one place to another appear to occur instan-taneously-- in other words, synchronously. A major advantage to this approach is that many aspects of a dis-tributed application can be treated independently without compromising correctness. Moreover, user code that is designed as if the system were synchronous can often be executed concurrently. We argue that this approach to building distributed and fault-tolerant software is more straightforward, more flexible, and more likely to yield correct solutions than alternative approaches. 1. A toolkit for distributed systems Consider the design of a distributed system for factory automation, say for VLSI chip fabrication. Such a system would need to group control 'processes into services responsible for different aspects of the fabrication procedure. One service might accept batches of chips needing photographic emulsions, another oversee transport of chips from station to sta-tion, etc. Within a service, algorithms would be needed for scheduling work, replicating data, coordi-
Understanding Fault-Tolerant Distributed Systems
- Communications of the ACM
, 1993
"... We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design ..."
Abstract
-
Cited by 296 (23 self)
- Add to MetaCart
We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give examples of systems which adopt one approach or the other. The aim is to introduce some order in the complex discipline of designing and understanding fault-tolerant distributed systems. 1 Introduction Computing systems consist of a multitude of hardware and software components that are bound to fail eventually. In many systems, such component failures can lead to unanticipated, potentially disruptive failure behavior and to service unavailability. Some systems are designed to be fault-tolerant: they either exhibit a well-defined failure behavior when components fail or mask component failures to users, that is, continue to provid...
Log-Based Receiver-Reliable Multicast Distributed Interactive Simulation
, 1995
"... Reliable multicast communication is important in large-scale distributed applications. For example, reliable multicast is used to transmit terrain and environmental updates in distributed simulations. To date, proposed protocols have not supported these applications' requirements, which include wide ..."
Abstract
-
Cited by 222 (5 self)
- Add to MetaCart
Reliable multicast communication is important in large-scale distributed applications. For example, reliable multicast is used to transmit terrain and environmental updates in distributed simulations. To date, proposed protocols have not supported these applications' requirements, which include wide-area data distribution, low-latency packet loss detection and recovery, and minimal data and management overhead within fine-grained multicast groups, each containing a single data source.

