Results 1 -
8 of
8
Making Reliable Distributed Systems in the Presence of Software Errors
, 2003
"... product, having over a million lines of Erlang code. This product (the AXD301) is thought to be one of the most reliable products ever made by Ericsson. ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
product, having over a million lines of Erlang code. This product (the AXD301) is thought to be one of the most reliable products ever made by Ericsson.
Low-Cost Error Containment and Recovery for Onboard Guarded Software Upgrading and Beyond
- IEEE Trans. Computers
, 2002
"... Message-driven confidence-driven (MDCD) error containment and recovery, a low-cost approach to mitigating the effect of software design faults in distributed embedded systems, is developed for onboard guarded software upgrading for deep-space missions. In this paper, we first describe and verify t ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Message-driven confidence-driven (MDCD) error containment and recovery, a low-cost approach to mitigating the effect of software design faults in distributed embedded systems, is developed for onboard guarded software upgrading for deep-space missions. In this paper, we first describe and verify the MDCD algorithms in which we introduce the notion of "confidence-driven" to complement the "communication-induced" approach employed by a number of existing checkpointing protocols to achieve error containment and recovery efficiency. We then conduct a model-based analysis to show that the algorithms ensure low performance overhead. Finally, we discuss the advantages of the MDCD approach and its potential utility as a general-purpose, low-cost software fault tolerance technique for distributed embedded computing.
Low-Cost Flexible Software Fault Tolerance for Distributed Computing
- in Proceedings of the 12th International Symposium on Software Reliability Engineering (ISSRE 2001), (Hong Kong
, 2001
"... In this paper, we revisit the problem of software fault tolerance in distributed systems. In particular, we propose an extension of a message-driven confidence-driven (MDCD) protocol we have developed for error containment and recovery in a particular type of distributed embedded system. More specif ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
In this paper, we revisit the problem of software fault tolerance in distributed systems. In particular, we propose an extension of a message-driven confidence-driven (MDCD) protocol we have developed for error containment and recovery in a particular type of distributed embedded system. More specifically, we augment the original MDCD protocol by introducing the method of "finegrained confidence adjustment," which enables us to remove the architectural restrictions. The dynamic nature of the MDCD approach gives it a number of desirable characteristics. First, this approach does not impose any restrictions on interactions among application software components or require costly message-exchange based process coordination /synchronization. Second, the algorithms allow redundancies to be applied only to low-confidence or critical interacting software components in a distributed system, permitting flexible realization of software fault tolerance. Finally, the dynamic error containment and recovery mechanisms are transparent to the application and ready to be implemented by generic middleware.
Synergistic Coordination between Software and Hardware Fault Tolerance Techniques
- in Proceedings of the International Conference on Dependable Systems and Networks (DSN-2001),(Göteborg, Sweden
, 2001
"... This paper describes an approach for enabling the synergistic coordination between two fault tolerance protocols to simultaneously tolerate software and hardware faults in a distributed computing environment. Specifically, our approach is based on a message-driven confidence-driven (MDCD) protocol t ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
This paper describes an approach for enabling the synergistic coordination between two fault tolerance protocols to simultaneously tolerate software and hardware faults in a distributed computing environment. Specifically, our approach is based on a message-driven confidence-driven (MDCD) protocol that we have devised for tolerating software design faults, and a time-based (TB) checkpointing protocol that was developed by Neves and Fuchs for tolerating hardware faults. By carrying out algorithm modifications that are conducive to synergistic coordination between volatile-storage and stable-storage checkpoint establishments, we are able to circumvent the potential interference between the MDCD and TB protocols, and to allow them to effectively complement each other to extend a system's fault tolerance capability. Moreover, the protocolcoordination approach preserves and enhances the features and advantages of the individual protocols that participate in the coordination, keeping the performance cost low.
Performability Analysis of Guarded-Operation Duration: A Successive Model-Translation Approach
, 2002
"... When making an engineering design decision, it is often necessary to consider its implications on both system performance and dependability. In this paper, we present a performability study that analyzes the guarded operation duration for onboard software upgrading. In particular, we define a "perfo ..."
Abstract
- Add to MetaCart
When making an engineering design decision, it is often necessary to consider its implications on both system performance and dependability. In this paper, we present a performability study that analyzes the guarded operation duration for onboard software upgrading. In particular, we define a "performability index" Y that quantifies the extent to which the guarded operation with a duration # reduces the expected total performance degradation. In order to solve for Y , we progressively translate its formulation until it becomes an aggregate of constituent measures conducive to efficient reward model solutions. Based on the reward-mapping-enabled intermediate model, we specify reward structures in the composite base model which is built on three stochastic activity network reward models. We describe the model-translation approach and show its feasibility for design-oriented performability modeling.
On Low-Cost Error Containment and Recovery Methods for Guarded Software Upgrading
- in Proceedings of the 20th International Conference on Distributed Computing Systems (ICDCS 2000
, 2000
"... To assure dependable onboard evolution, we have developed a methodology called guarded software upgrading (GSU). In this paper, we focus on a low-cost approach to error containment and recovery for GSU. To ensure low development cost, we exploit inherent system resource redundancies as the fault tol ..."
Abstract
- Add to MetaCart
To assure dependable onboard evolution, we have developed a methodology called guarded software upgrading (GSU). In this paper, we focus on a low-cost approach to error containment and recovery for GSU. To ensure low development cost, we exploit inherent system resource redundancies as the fault tolerance means. In order to mitigate the effect of residual software faults at low performance cost, we take a crucial step in devising error containment and recovery methods by introducing the "confidencedriven " notion. This notion complements the message-driven (or "communication-induced") approach employed by a number of existing checkpointing protocols for tolerating hardware faults. In particular, we discriminate between the individual software components with respect to our confidence in their reliability, and keep track of changes of our confidence (due to knowledge about potential process state contamination) in particular processes. This, in turn, enables the individual processes in the spaceborne distributed system to make decisions locally, at run-time, on whether to establish a checkpoint upon message passing and whether to roll back or roll forward during error recovery. The resulting message-driven confidence-driven approach enables cost-effective checkpointing and cascading-rollback free recovery.
Protecting Distributed Software Upgrades that Involve Message-Passing
"... We present in this paper an extension of the messagedriven confidence-driven framework that we developed for onboard guarded software upgrading. The purpose of this work is to provide the framework with the capability of protecting distributed software upgrades that involve messagepassing interface ..."
Abstract
- Add to MetaCart
We present in this paper an extension of the messagedriven confidence-driven framework that we developed for onboard guarded software upgrading. The purpose of this work is to provide the framework with the capability of protecting distributed software upgrades that involve messagepassing interface changes. To achieve this goal, we propose an approach to clustering the components involved in software upgrades and those involved in message-passing interface changes, such that from outside the cluster all those components can be perceived collectively as one virtual low-confidence component. Moreover, we develop a confidence-driven mechanism that enables combined use of sender- and receiver-side message logging for efficient, fine-grained error containment and recovery. The paper provides a detailed algorithm description.
Printed by Universitetsservice US-AB 2003iii
, 2003
"... Making reliable distributed systems in the presence of sodware errors ..."

