Results 1 -
9 of
9
On the Quality of Service of Failure Detectors
- IEEE Transactions on Computers
, 2000
"... AbstractÐWe study the quality of service �QoS) of failure detectors. By QoS, we mean a specification that quantifies 1) how fast the failure detector detects actual failures and 2) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems wit ..."
Abstract
-
Cited by 84 (12 self)
- Add to MetaCart
AbstractÐWe study the quality of service �QoS) of failure detectors. By QoS, we mean a specification that quantifies 1) how fast the failure detector detects actual failures and 2) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyze its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we suggest some ways to make our failure detector adaptive to changes in the probabilistic behavior of the network. Index TermsÐFailure detectors, quality of service, fault tolerance, distributed algorithm, probabilistic analysis. 1
Chameleon: A Software Infrastructure For Adaptive Fault Tolerance In Distributed Systems
, 1998
"... The project has benefited through some demonstrations and presentations given to people from the computer industry, and the discussions they generated. Of them, mention must be made of Pankaj Mehra of Tandem Labs, Roger Lee, Robert Ferraro and Jagdish Patel of the Jet Propulsion Laboratory, Larry Ja ..."
Abstract
-
Cited by 80 (11 self)
- Add to MetaCart
The project has benefited through some demonstrations and presentations given to people from the computer industry, and the discussions they generated. Of them, mention must be made of Pankaj Mehra of Tandem Labs, Roger Lee, Robert Ferraro and Jagdish Patel of the Jet Propulsion Laboratory, Larry Jack and Chet Markiewicz of Honeywell Inc., and Haim Levendel of Motorola. iii TABLE OF CONTENTS CHAPTER PAGE 1 INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2 RELATED WORK : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 3 CHAMELEON OVERVIEW : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14 3.1 Behavioral Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15 3.1.1 Initialization of the Chameleon Environment : : : : : : : : : : : : : : : : 15 3.1.2 Interpreting User-Specified Depen
Experiences, Strategies and Challenges in Building Fault-Tolerant CORBA Systems
- IEEE Transactions on Computers
, 2004
"... After almost a decade since the introduction of the earliest reliable CORBA implementation, and despite the recent adoption of the Fault Tolerant CORBA (FT-CORBA) standard by the Object Management Group, CORBA is still not widely adopted as the preferred platform for building reliable distributed ap ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
After almost a decade since the introduction of the earliest reliable CORBA implementation, and despite the recent adoption of the Fault Tolerant CORBA (FT-CORBA) standard by the Object Management Group, CORBA is still not widely adopted as the preferred platform for building reliable distributed applications. Among the obstacles to FT-CORBA's widespread deployment are the complexity of the new standard, the lack of understanding in implementing and/or deploying reliable CORBA applications, and the fact that current FT-CORBA implementations are not readily applicable to real-world complex applications. In this paper, we candidly share our independent experiences as developers of two separate reliable CORBA infrastructures (OGS and Eternal), and as contributors to the FT-CORBA standardization process. Our intention is to reveal the intricacies, challenges and strategies in developing fault-tolerant CORBA systems, including our own. We provide an overview of the new FT-CORBA standard, and discuss its limitations and techniques for best exploiting it. We reflect on the difficulties that we encountered in building dependable CORBA systems, the solutions that we developed to address these challenges, and the lessons that we learned as a result. Finally, we highlight some of the open issues, such as non-determinism and partitioning, along with some solutions for resolving these issues.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks
- IEEE Transactions on Mobile Computing
, 2004
"... The design of ad hoc mobile applications often requires the availability of a consistent view of the application state among the participating hosts. Such views are important because they simplify both the programming and verification tasks. We argue that preventing the occurrence of unannounced dis ..."
Abstract
-
Cited by 12 (7 self)
- Add to MetaCart
The design of ad hoc mobile applications often requires the availability of a consistent view of the application state among the participating hosts. Such views are important because they simplify both the programming and verification tasks. We argue that preventing the occurrence of unannounced disconnection is essential to constructing and maintaining a consistent view in the ad hoc mobile environment. In this light, we provide the specification for a partitionable group membership service supporting ad hoc mobile applications and propose a protocol for implementing the service. A unique property of this partitionable group membership is that messages sent between group members are guaranteed to be delivered successfully, given appropriate system assumptions. This property is preserved over time despite movement and frequent disconnections. The protocol splits and merges groups and maintains a logical connectivity graph based on a notion of safe-distance. An implementation of the protocol in Java is available for testing. The implementation is used to implement Lime , a middleware for mobility that supports transparent sharing of data in both wired and ad hoc wireless environments.
Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment
- IEEE Transactions on Knowledge and Data Engineering
, 2000
"... In this paper, we propose a hierarchical framework for providing fault tolerance to the SIFT layer of a distributed system, and extending it to the applications executing in such an environment. The detection hierarchy is proposed in the context of Chameleon, a software environment for providing ada ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In this paper, we propose a hierarchical framework for providing fault tolerance to the SIFT layer of a distributed system, and extending it to the applications executing in such an environment. The detection hierarchy is proposed in the context of Chameleon, a software environment for providing adaptive faulttolerance in a COTS environment to off-the-shelf software. A flexible mechanism for combining different levels in the hierarchy and different techniques within a level is proposed. We define intra-level and interlevel optimizations to minimize the overhead of detection and make the optimizations adaptive to runtime requirements. New approaches for software signatures and diagnosis through interactive consistency protocols are highlighted. The paper presents results from a detailed simulation of the environment, using as parameters, measurements obtained from an early prototype implementation. The results indicate the increase in availability due to the detection framework and help...
A Real-Time Push-Pull Communications Model for Distributed Real-Time and Multimedia Systems
- Department of Computer Science, Carnegie Mellon University
, 1999
"... Real-time and multimedia applications like multi-party collaboration, internet telephony and distributed command control systems require the exchange of information over distributed and heterogeneous nodes. Multiple data types including voice, video, sensor data, real-time intelligence data and text ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Real-time and multimedia applications like multi-party collaboration, internet telephony and distributed command control systems require the exchange of information over distributed and heterogeneous nodes. Multiple data types including voice, video, sensor data, real-time intelligence data and text are being transported widely across today's information, control and surveillance networks. All such applications can benefit enormously from middleware, operating system and networking services that can support QoS guarantees, high availability, dynamic reconfigurability and scalability. In this paper, we propose a middleware layer called the "Real-Time Push-Pull Communications Service" to easily and quickly disseminate information across heterogeneous nodes with flexible communication patterns. Realtime push-pull communications is an extension of the real-time publisher/subscriber model, and represents both "push" (data transfer initiated by a sender) and "pull" (data transfer initiated b...
Fully Distributed Three-Tier Active Software Replication
- IEEE TPDS
, 2006
"... Abstract—Keeping strongly consistent the state of the replicas of a software service deployed across a distributed system prone to crashes and with highly unstable message transfer delays (e.g., the Internet), is a real practical challenge. The solution to this problem is subject to the FLP impossib ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract—Keeping strongly consistent the state of the replicas of a software service deployed across a distributed system prone to crashes and with highly unstable message transfer delays (e.g., the Internet), is a real practical challenge. The solution to this problem is subject to the FLP impossibility result, and thus there is a need for “long enough ” periods of synchrony with time bounds on process speeds and message transfer delays to ensure deterministic termination of any run of agreement protocols executed by replicas. This behavior can be abstracted by a partially synchronous computational model. In this setting, before reaching a period of synchrony, the underlying network can arbitrarily delay messages and these delays can be perceived as false failures by some timeout-based failure detection mechanism leading to unexpected service unavailability. This paper proposes a fully distributed solution for active software replication based on a three-tier software architecture well-suited to such a difficult setting. The formal correctness of the solution is proved by assuming the middle-tier runs in a partially synchronous distributed system. This architecture separates the ordering of the requests coming from clients, executed by the middle-tier, from their actual execution, done by replicas, i.e., the end-tier. In this way, clients can show up in any part of the distributed system and replica placement is simplified, since only the middle-tier has to be deployed on a well-behaving part of the distributed system that frequently respects synchrony bounds. This deployment permits a rapid timeout tuning reducing thus unexpected service unavailability. Index Terms—Dependable distributed systems, software replication in wide-area networks, replication protocols, architectures for dependable services. æ 1
Building the next generation groupware: A survey of groupware and its impact on the virtual enterprise
, 1999
"... This document explores the issues in building the "groupware of the future". The approach is twofold. First we briefly describe our vision of a "virtual enterprise" that is made up of a set of services through a composition of components that comply with a specific contract (specification). Two scen ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This document explores the issues in building the "groupware of the future". The approach is twofold. First we briefly describe our vision of a "virtual enterprise" that is made up of a set of services through a composition of components that comply with a specific contract (specification). Two scenarios are presented which demonstrate the set of underlying requirements. Secondly, a review of the-state-ofthe -art in groupware technology is presented which identifies a set of services provided by current groupware systems. An assessment of the relative importance of these services is then presented and a comparison made to the requirements specified in the part one. A more detailed presentation of a subset of "important" groupware projects or products clarifies the limitations of current approaches. These limitations, in conjunction to the current trends in software technology (object technology, component architectures, web technology etc), determine the next steps towards our vision. The document concludes with a presentation of our definition for the "groupware of the future" and the route towards it.
Laboratory's Center for Integrated Space Microsystems, in
"... AbstractÐThe RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the- ..."
Abstract
- Add to MetaCart
AbstractÐThe RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN technology has been transfered to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology. Index TermsÐDistributed computing, scalable architectures, interconnection networks, fault tolerance, data storage, cluster computing.

