Results 1 - 10
of
30
Preserving and Using Context Information in Interprocess Communication
- ACM Transactions on Computer Systems
, 1989
"... ion Psync is based on a conversation abstraction that provides a shared message space through which a collection of processes exchange messages. The general form of this message space is defined by a directed acyclic graph that preserves the partial order of the exchanged messages. For the purpose ..."
Abstract
-
Cited by 210 (24 self)
- Add to MetaCart
ion Psync is based on a conversation abstraction that provides a shared message space through which a collection of processes exchange messages. The general form of this message space is defined by a directed acyclic graph that preserves the partial order of the exchanged messages. For the purpose of this section, we view a conversation as an abstract data type that is implemented in shared memory; Section 3 gives an algorithm for implementing a conversation in an unreliable network. A conversation behaves much like any connection-oriented IPC abstraction: A well-defined set of processes---called participants---explicitly open a conversation, exchange messages through it, and close the conversation. Only processes that have been identified as participants may exchange message through the conversation, and this set is fixed for the duration of the conversation. Processes begin a conversation with the operations: conv = active open(participant set) conv = passive open(pid) The first...
Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement
- Information and Computation
, 1985
"... In distributed systems subject to random communication delays and component failures, atomic broadcast can be used to implement the abstraction of synchronous replicated storage, a distributed storage that displays the same contents at every correct processor as of any clock time. This paper present ..."
Abstract
-
Cited by 210 (15 self)
- Add to MetaCart
In distributed systems subject to random communication delays and component failures, atomic broadcast can be used to implement the abstraction of synchronous replicated storage, a distributed storage that displays the same contents at every correct processor as of any clock time. This paper presents a systematic derivation of a family of atomic broadcast protocols that are tolerant of increasingly general failure classes: omission failures, timing failures, and authentication-detectable Byzantine failures. The protocols work for arbitrary point-to-point network topologies, and can tolerate any number of link and process failures up to network partitioning. After proving their correctness, we also prove two lower bounds that show that the protocols provide in many cases the best possible termination times. Keywords and phrases: Atomic Broadcast, Byzantine Agreement, Computer Network, Correctnesss, Distributed System, Failure Classification, Fault-Tolerance, Lower Bound, Real-Time Syste...
Reaching Agreement on Processor Group Membership in Synchronous Distributed Systems
- Distributed Computing
, 1991
"... Reaching agreement on the identity of correctly functioning processors of a distributed system in the presence of random communication delays, failures and processor joins is a fundamental problem in fault-tolerant distributed systems. Assuming a synchronous communication network that is not subj ..."
Abstract
-
Cited by 125 (14 self)
- Add to MetaCart
Reaching agreement on the identity of correctly functioning processors of a distributed system in the presence of random communication delays, failures and processor joins is a fundamental problem in fault-tolerant distributed systems. Assuming a synchronous communication network that is not subject to partition occurrences, we specify the processor-group membership problem and we propose three simple protocols for solving it. The protocols provide all correct processors with consistent views of the processor-group membership and guarantee bounded processor failure detection and join delays. Key words: Communication network -- Distributed system -- Failure detection -- Fault tolerance -- Real time system -- Replicated data 1 Introduction When designing a computing service that must remain available despite component failures, a key idea is to replicate service state information at several servers running on distinct processors. The service state typically consists of the ser...
Replicated Distributed Processes in Manetho
"... This paper presents the process-replication protocol of Manetho, a system whose goal is to provide efficient, application-transparent fault tolerance to long-running distributed computations. Manetho uses a new negative-acknowledgment multicast protocol to enforce the same receipt order of applicati ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
This paper presents the process-replication protocol of Manetho, a system whose goal is to provide efficient, application-transparent fault tolerance to long-running distributed computations. Manetho uses a new negative-acknowledgment multicast protocol to enforce the same receipt order of application messages among all replicas of a process. The protocol depends on a combination of antecedence graph maintenance, a form of sender-based message logging, and the fact that the receivers of each multicast execute the same deterministic program. This combination allows our protocol to avoid the delay in application message delivery that is common in existing negative-acknowledgment multicast protocols, without giving up the advantage of requiring only a small number of control messages.
A Low-Level Processor Group Membership Protocol for LANS
- In Proceedings of the 13th International Conference on Distributed Computing Systems
, 1992
"... This paper presents a processor group membership protocol designed to run on top of a local area network. The protocol maintains information about a selected group of stations that explicitly join the protocol by keeping a replica of a global membership table at every member. Additionally, the proto ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
This paper presents a processor group membership protocol designed to run on top of a local area network. The protocol maintains information about a selected group of stations that explicitly join the protocol by keeping a replica of a global membership table at every member. Additionally, the protocol guarantees that a given station always occupies the same entry in the table. As a result, table indexes do uniquely and universally identify a station and can thus be used as short identifiers. The interest of a processor group membership is twofold: it is a powerful auxiliary for process group membership management and it provides support for efficient message addressing. Keywords: Distributed Systems, Distributed Algorithms, Fault-Tolerance, Communication Protocols, Real-Time. 1 Introduction Distributed systems may take advantage of the local availability of up to date information about the nodes in the system. This information is not static: during the lifetime of the system, statio...
A Reliable Multicast Protocol for Distributed Real-Time Systems
, 1991
"... Distributed computer architectures are well accepted in the domain of real-time applications. To realize fault-tolerance, fail-silent node computers providing the same service can be clustered into Fault-Tolerant Units (FTUs). Each FTU provides a specified service as long as at least one of its node ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
Distributed computer architectures are well accepted in the domain of real-time applications. To realize fault-tolerance, fail-silent node computers providing the same service can be clustered into Fault-Tolerant Units (FTUs). Each FTU provides a specified service as long as at least one of its node computers is operational. The communication between these FTUs has to be deterministic, reliable and timely, i.e. there must be a tight upper bound on the time it takes to send a message from one FTU to the other FTUs. This paper presents a communication system suitable for real-time applications that meets these requirements.
Abstractions for Constructing Dependable Distributed Systems
, 1992
"... ions for Constructing Dependable Distributed Systems Shivakant Mishra 1 and Richard D. Schlichting TR 92-19 Abstract Distributed systems, in which multiple machines are connected by a communications network, are often used to build highly dependable computing systems. However, constructing the softw ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
ions for Constructing Dependable Distributed Systems Shivakant Mishra 1 and Richard D. Schlichting TR 92-19 Abstract Distributed systems, in which multiple machines are connected by a communications network, are often used to build highly dependable computing systems. However, constructing the software required to realize such dependability is a difficult task since it requires the programmer to build fault-tolerant software that can continue to function despite failures. To simplify this process, canonical structuring techniques or programming paradigms have been developed, including the object/action model, the primary/backup approach, the state machine approach, and conversations. In this paper, some of the system abstractions designed to support these paradigms are described. These abstractions, which are termed fault-tolerant services, can be categorized into two types. One type provides functionality similar to standard hardware or operating system services, but with improved ...
A Conceptual Framework for System Fault Tolerance
, 1992
"... : A major problem in transitioning fault tolerance practices to the practitioner community is a lack of a common view of what fault tolerance is, and how it can help in the design of reliable computer systems. This document takes a step towards making fault tolerance more understandable by proposing ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
: A major problem in transitioning fault tolerance practices to the practitioner community is a lack of a common view of what fault tolerance is, and how it can help in the design of reliable computer systems. This document takes a step towards making fault tolerance more understandable by proposing a conceptual framework. The framework provides a consistent vocabulary for fault tolerance concepts, discusses how systems fail, describes commonly used mechanisms for making systems fault tolerant, and provides some rules for developing fault tolerant systems. 1 Introduction One of the major problems in transitioning fault tolerance practices to the practitioner community is a lack of a common view of exactly what fault tolerance is, and how it can help in the design of reliable systems. One step towards making fault tolerance more understandable is to provide a conceptual framework. The purpose of this document is to propose such a framework. This document begins with a discussion of wh...
Experience with Modularity in Consul
, 1993
"... services From the application's perspective, Consul provides a collection of fault-tolerant services that collectively support the state machine model of distributed computing. ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
services From the application's perspective, Consul provides a collection of fault-tolerant services that collectively support the state machine model of distributed computing.
Dynamic Configuration Management in Reliable Distributed RealTime Information Systems
- IEEE Trans. on Knowledge and Data Engr
, 1999
"... Abstract—Large-scale information systems emerging in challenging application fields must meet the high standards of reliability, maintainability, and service interruption bound requirements. Their operations are entirely, or partially, of the distributed real-time data object manipulation type. A ne ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Abstract—Large-scale information systems emerging in challenging application fields must meet the high standards of reliability, maintainability, and service interruption bound requirements. Their operations are entirely, or partially, of the distributed real-time data object manipulation type. A new architecture for such systems is presented in this paper. The original aspects of the architecture are mainly in two parts: 1) the time-triggered message-triggered object (TMO) structuring of the middleware and the application software of distributed real-time information systems; and 2) the dynamic configuration management subsystem (DCMS), based on the supervisor-based network surveillance (SNS) scheme. The positive impacts of this TMO structuring on maintainability and service interruption bounds are first discussed, with distributed replicated information service systems and other systems as examples. Then, the main discussion dwells on the DCMS architecture¦in particular, formal presentation of its key component: the SNS scheme. As a component of DCMS, the network surveillance (NS) subsystem enables fast learning by each interested fault-free node in the system of the faults or repair completion events occurring in other parts of the system. Currently, concrete real-time NS schemes effective in distributed systems based on point-to-point network architectures are scarce. The SNS scheme presented in this paper is a semicentralized real-time NS scheme effective in a variety of point-to-point networks. This scheme is highly scalable. An efficient implementation model for the SNS scheme is presented that can be easily adapted to various commercial operating system kernels. This paper also presents a formal analysis of the SNS scheme, on the basis of the implementation model, to obtain its strongly competitive tight bounds on the fault detection latency. Finally, some DCMS implementation issues are discussed that remain to be addressed in future research.

