Results 1 - 10
of
18
A Configurable Membership Service
- IEEE Transactions on Computers
, 1994
"... A membership service is used to maintain information about which sites are functioning in a distributed system at any given time. Many such services have been defined, with each implementing a unique combination of properties that simplify the construction of higher levels of the system. Despite thi ..."
Abstract
-
Cited by 45 (10 self)
- Add to MetaCart
A membership service is used to maintain information about which sites are functioning in a distributed system at any given time. Many such services have been defined, with each implementing a unique combination of properties that simplify the construction of higher levels of the system. Despite this wealth of possibilities, however, any given service only realizes one set of properties, which makes it difficult to tailor the service provided to the specific needs of the application. Here, a configurable membership service that addresses this problem is described. This service is based on decomposing membership into its constituent abstract properties, and then implementing these properties as separate software modules called micro-protocols that can be configured together to produce a customized membership service. A prototype C++ implementation of the membership service for a simulated distributed environment is also described. December 19, 1994 Revised January 9, 1996 Department of ...
An Approach to Constructing Modular Fault-Tolerant Protocols
- IN PROCEEDINGS OF THE 12TH SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS
, 1993
"... Modularization is a well-known technique for simplifying complex software. Here, an approach to modularizing fault-tolerant protocols such as reliable multicast andmembership is described. The approach is based on implementing a protocol's individual properties as separate microprotocols, and then c ..."
Abstract
-
Cited by 25 (14 self)
- Add to MetaCart
Modularization is a well-known technique for simplifying complex software. Here, an approach to modularizing fault-tolerant protocols such as reliable multicast andmembership is described. The approach is based on implementing a protocol's individual properties as separate microprotocols, and then combining selected micro-protocols using an event-driven software framework; a system is constructed by composing these frameworks with traditional network protocols using standard hierarchical techniques. In addition to simplifying the software, this model helps clarify the dependencies among properties of fault-tolerant protocols, and makes it possible to construct systems that are customized to the specifics of the application or underlying architecture. An example involving reliable group multicast is given, together with a description of a prototype implementation using the SR concurrent programming language. An implementation based on the x-kernel and RT-Mach is also underway.
Abstractions for Constructing Dependable Distributed Systems
, 1992
"... ions for Constructing Dependable Distributed Systems Shivakant Mishra 1 and Richard D. Schlichting TR 92-19 Abstract Distributed systems, in which multiple machines are connected by a communications network, are often used to build highly dependable computing systems. However, constructing the softw ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
ions for Constructing Dependable Distributed Systems Shivakant Mishra 1 and Richard D. Schlichting TR 92-19 Abstract Distributed systems, in which multiple machines are connected by a communications network, are often used to build highly dependable computing systems. However, constructing the software required to realize such dependability is a difficult task since it requires the programmer to build fault-tolerant software that can continue to function despite failures. To simplify this process, canonical structuring techniques or programming paradigms have been developed, including the object/action model, the primary/backup approach, the state machine approach, and conversations. In this paper, some of the system abstractions designed to support these paradigms are described. These abstractions, which are termed fault-tolerant services, can be categorized into two types. One type provides functionality similar to standard hardware or operating system services, but with improved ...
Reliable real-time communication in cooperative mobile applications
- IEEE Transactions on Computers
, 2003
"... Abstract—Embedded systems are expected to provide increasingly complex and safety-critical services that will, sooner or later, require the cooperation of several such systems for their fulfillment. In particular, coordinating the access to shared physical and information technological resources wil ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Abstract—Embedded systems are expected to provide increasingly complex and safety-critical services that will, sooner or later, require the cooperation of several such systems for their fulfillment. In particular, coordinating the access to shared physical and information technological resources will become a general problem. Examples are mobile robots in industrial automation or car-to-car coordination for future traffic control applications. In such applications, cooperation is subject to strong real-time and reliability requirements. In this paper, we present an architecture that allows autonomous mobile systems to schedule shared resources in real-time using their own wireless distributed infrastructure. In this architecture, there is a clear separation between the applicationspecific scheduling part and the application independent communication part that constitutes the real-time and reliability hardcore of the system. The latter provides clock synchronization, real-time atomic multicast, and real-time group membership based on an IEEE 802.11 Standard wireless LAN. An application prototype shows how the architecture can be used in future mobile cooperative applications. Index Terms—Fault-tolerant real-time systems, distributed algorithms, mobile computing, wireless networks. æ 1
Membership and System Diagnosis
- IN PROCEEDINGS OF THE 14TH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS
, 1995
"... A membership service is a service in a distributed system that maintains and provides information about which sites are functioning and which have failed at any given time. System diagnosis, on the other hand, is a method for detecting faulty processing elements and distributing this information to ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
A membership service is a service in a distributed system that maintains and provides information about which sites are functioning and which have failed at any given time. System diagnosis, on the other hand, is a method for detecting faulty processing elements and distributing this information to non-faultyelements. In spite of the apparent similarity of goals, these two fields have been considered separately from their beginnings. In this paper, we attempt to compare these fields and show the fundamental differences and the similarities. We demonstrate that the problems are closely related, with the major differences being the assumptions made about the failure model, the testing methods, and the type of service guarantees provided to the application. Furthermore, we demonstrate that the fields are closely enough related that some algorithms utilized in one field can easily be transformed into algorithms in the other. As examples, we derive new membership algorithms from a distribut...
A Dependable Distributed Auction System: Architecture and an Implementation Framework
, 2001
"... The work presented here develops a distributed systems architecture and propose an implementation framework for conducting dependable Internet based on-line auctions, meeting the requirements of scalability and service integrity. Current auction services essentially rely on a central auction server. ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
The work presented here develops a distributed systems architecture and propose an implementation framework for conducting dependable Internet based on-line auctions, meeting the requirements of scalability and service integrity. Current auction services essentially rely on a central auction server. Given the increasing popularity and usage of electronic auctions, such a centralised approach is fundamentally restrictive with respect to scalability. Further, different national markets have different monetary regulations and may employ different procedures for payment settlements. Catering for local market autonomy means that decentralisation is an essential and practical requirement. With these design goals in mind, the paper develops an approach that permits an auction service to be mapped on to globally distributed auction servers. It then proposes a framework for a fault-tolerant implementation of the architecture. Faulttolerance is achieved through matured technologies: replication management and group paradigm. Keywords and Phrases: Auctions, distributed servers, multicast groups, passive and active replication, reliable multicast, membership service, synchronous and asynchronous networks. 1.
Distributed Diagnosis in Dynamic Fault Environments
- IEEE Trans. on Parallel and Distributed Systems
, 2004
"... Abstract—The problem of distributed diagnosis in the presence of dynamic failures and repairs is considered. To address this problem, the notion of bounded correctness is defined. Bounded correctness is made up of three properties: bounded diagnostic latency, which ensures that information about sta ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Abstract—The problem of distributed diagnosis in the presence of dynamic failures and repairs is considered. To address this problem, the notion of bounded correctness is defined. Bounded correctness is made up of three properties: bounded diagnostic latency, which ensures that information about state changes of nodes in the system reaches working nodes with a bounded delay, bounded start-up time, which guarantees that working nodes determine valid states for every other node in the system within bounded time after their recovery, and accuracy, which ensures that no spurious events are recorded by working nodes. It is shown that, in order to achieve bounded correctness, the rate at which nodes fail and are repaired must be limited. This requirement is quantified by defining a minimum state holding time in the system. Algorithm HeartbeatComplete is presented and it is proven that this algorithm achieves bounded correctness in fully-connected systems while simultaneously minimizing diagnostic latency, start-up time, and state holding time. A diagnosis algorithm for arbitrary topologies, known as Algorithm ForwardHeartbeat, is also presented. ForwardHeartbeat is shown to produce significantly shorter latency and state holding time than prior algorithms, which focused primarily on minimizing the number of tests at the expense of latency. Index Terms—Distributed diagnosis, dynamic failures, fault tolerance, synchronous systems. 1
A Hierarchical Membership Protocol for Synchronous Distributed Systems
, 1994
"... . A membership service for a synchronous distributed computer system is described. The system is assumed to be composed of groups in which a relatively frequent message exchange occurs. A hierarchy of connected groups constitutes a connected network. The membership service protocol reflects this ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
. A membership service for a synchronous distributed computer system is described. The system is assumed to be composed of groups in which a relatively frequent message exchange occurs. A hierarchy of connected groups constitutes a connected network. The membership service protocol reflects this hierarchical structure. The protocol tolerates timing, omission and crash failures. Time-bounds are specified in which additions (removals) of processors to (from) the system are known to all participating processors. keywords: membership service, distributed algorithm, fault tolerance, hierarchical system, synchronous system. 1 Introduction The construction of a service that determines the presence of correct processors in a distributed system, commonly known as the membership service, is regarded as a fundamental problem in distributed systems. Once solved, it allows the solution of many other problems based upon its availability. Three advantages of membership are: -- (i) efficie...
Understanding Membership
, 1995
"... A membership service is used in a distributed system to maintain information about which sites are functioning and which have failed at any given time. Such services have proven to be fundamental for constructing distributed applications, with many example services and algorithms defined in the l ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
A membership service is used in a distributed system to maintain information about which sites are functioning and which have failed at any given time. Such services have proven to be fundamental for constructing distributed applications, with many example services and algorithms defined in the literature. Despite these efforts, however, little has been done on examining the abstract properties commonly guaranteed by membership services independent of a given implementation. Here, a number of these properties are identified and defined. These properties range from agreement among sites on membership changes, consistent ordering of change notifications, and timing properties to various ways for dealing with recoveries and partitions. Message ordering graphs, which are an abstract representation of the set of messages at each site in the system and their potential delivery order, are used to define the properties. Dependency graphs, which are a graphical representation expressi...
Action-Level Fault Tolerance
- Advances in Real-Time Systems
, 1995
"... Introduction Real-time computing is a relatively old branch (at least 35 years old) of computer engineering but has been advancing rather slowly until recently. In the author's opinion the main reason was the small market size for real-time computing. However, in recent years, the researcher popula ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Introduction Real-time computing is a relatively old branch (at least 35 years old) of computer engineering but has been advancing rather slowly until recently. In the author's opinion the main reason was the small market size for real-time computing. However, in recent years, the researcher population in the field of real-time computing has shown rapid growth for various economic and technological reasons and as a result, the technology started advancing rapidly. Most large-scale real-time computing applications are of safety-critical type and thus require high reliability and fault tolerance properties from the computer systems used. The most desirable types of fault tolerance techniques for use in real-time computer systems (RTCSs) are those for realizing action-level (AL) fault tolerance, which means to accomplish every critical action (output action of a critical task as specified) successfully in spite of component failures. Such techniques are thus aimed

