Results 1 -
9 of
9
Making Reliable Distributed Systems in the Presence of Software Errors
, 2003
"... product, having over a million lines of Erlang code. This product (the AXD301) is thought to be one of the most reliable products ever made by Ericsson. ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
product, having over a million lines of Erlang code. This product (the AXD301) is thought to be one of the most reliable products ever made by Ericsson.
On the effectiveness of a message-driven confidence-driven protocol for guarded software upgrading
- Performance Evaluation
, 2001
"... In order to accomplish dependable onboard evolution, we develop a methodology which is called guarded software upgrading (GSU). The core of the methodology is a low-cost error containment and recovery protocol that escorts an upgraded software component through onboard validation and guarded operati ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
In order to accomplish dependable onboard evolution, we develop a methodology which is called guarded software upgrading (GSU). The core of the methodology is a low-cost error containment and recovery protocol that escorts an upgraded software component through onboard validation and guarded operation, safeguarding mission functions. The message-driven confidence-driven (MDCD) nature of the protocol elim-inates the need for costly process coordination or atomic action, yet guaranteeing the system to reach a consistent global state upon the completion of the rollback or roll-forward actions carried out by individual processes during error recovery. Aimed at validating the effectiveness of the MDCD protocol with respect to its ability, in a real-istic, non-ideal execution environment, to enhance system reliability when a software component undergoes onboard upgrading, we conduct a stochastic activity network model based analysis. The results confirm the effectiveness of the protocol as origi-nally surmised. Moreover, the model-based analysis provides to us useful insights about the system behavior resulting from the use of the protocol under various conditions in its execution environment, facilitating effective utility of the protocol.
Design of a fault-tolerant COTS-based bus architecture
- IEEE Trans. Reliability
, 1999
"... The high-performance, scalability and miniaturization requirements together with the power, mass and cost constraints mandate the use of commercial-off-the-shelf (COTS) components and standards in the X2000 avionics system architecture for deep-space missions. In this paper, we report our experience ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
The high-performance, scalability and miniaturization requirements together with the power, mass and cost constraints mandate the use of commercial-off-the-shelf (COTS) components and standards in the X2000 avionics system architecture for deep-space missions. In this paper, we report our experiences and findings on the design of an IEEE 1394 compliant fault-tolerant COTS-based bus architecture. While the COTS standard IEEE 1394 adequately supports power management, high performance and scalability, its topological criteria impose restrictions on fault tolerance realization. To circumvent the difficulties, we derive a “stack-tree’ ’ topology that not only complies with the IEEE 1394 standard but also facilitates fault tolerance realization in a spaceborne system with limited dedicated resource redundancies. Moreover, by exploiting pertinent standard features of the 1394 interface which are not purposely designed for fault tolerance, we devise a comprehensive set of fault detection mechanisms to support the fault-tolerant bus architecture.
A Comparative Analysis of Hardware and Software Fault Tolerance: Impact on Software Reliability Engineering
, 1999
"... this paper, we focus on methods of fault tolerance, and investigate the differences between hardware fault tolerance and software fault tolerance. 1.2 Fault, Error and Failure ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
this paper, we focus on methods of fault tolerance, and investigate the differences between hardware fault tolerance and software fault tolerance. 1.2 Fault, Error and Failure
On-Board Preventive Maintenance: A Design-Oriented Analytic Study for Long-Life Applications
, 1999
"... With respect to the long-life missions associated with NASA'sX2000AdvancedDeep-SpaceSystemDevelopment Program, reliability implies a system's continuous operation for many years in an unsurveyed radiation-intense environment. Further, the stringent constraints on the mass of a spacecraft and the pow ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
With respect to the long-life missions associated with NASA'sX2000AdvancedDeep-SpaceSystemDevelopment Program, reliability implies a system's continuous operation for many years in an unsurveyed radiation-intense environment. Further, the stringent constraints on the mass of a spacecraft and the power on-board create unprecedented challenges on the means for achieving the ultra-high mission reliability. In this paper, we present an approach to on-board preventive maintenance which rejuvenates a system by letting system components rotate between on-duty and off-duty shifts, slowing down a system's aging process and thus enhancing mission reliability. By exploiting nondedicated system redundancy, hardware and software rejuvenation are realized simultaneously without significant performance penalty. Our design-oriented analysis confirms a potential for significant gains in mission reliability from on-board preventive maintenance and provides to us useful insights about the collective effe...
Onboard Guarded Software Upgrading: Motivation and Framework £Ý
"... Abstract — The goal of the guarded software upgrading (GSU) framework is to minimize mission performance loss due to onboard software upgrading activities and that due to system failure caused by residual faults in an upgraded version. We exploit inherent system resource redundancies as the means of ..."
Abstract
- Add to MetaCart
Abstract — The goal of the guarded software upgrading (GSU) framework is to minimize mission performance loss due to onboard software upgrading activities and that due to system failure caused by residual faults in an upgraded version. We exploit inherent system resource redundancies as the means of fault tolerance to meet the development cost and onboard resource constraints. Furthermore, we devise a message-driven confidence-driven protocol to
On-Board Guarded Software Upgrading for Space Missions
- in Proceedings of the 18th Digital Avionics Systems Conference
, 1999
"... this paper was supported in part by Small Business Innovation Research (SBIR) Contract NAS399125 from Jet Propulsion Laboratory, National Aeronautics and Space Administration ..."
Abstract
- Add to MetaCart
this paper was supported in part by Small Business Innovation Research (SBIR) Contract NAS399125 from Jet Propulsion Laboratory, National Aeronautics and Space Administration
On Low-Cost Error Containment and Recovery Methods for Guarded Software Upgrading
- in Proceedings of the 20th International Conference on Distributed Computing Systems (ICDCS 2000
, 2000
"... To assure dependable onboard evolution, we have developed a methodology called guarded software upgrading (GSU). In this paper, we focus on a low-cost approach to error containment and recovery for GSU. To ensure low development cost, we exploit inherent system resource redundancies as the fault tol ..."
Abstract
- Add to MetaCart
To assure dependable onboard evolution, we have developed a methodology called guarded software upgrading (GSU). In this paper, we focus on a low-cost approach to error containment and recovery for GSU. To ensure low development cost, we exploit inherent system resource redundancies as the fault tolerance means. In order to mitigate the effect of residual software faults at low performance cost, we take a crucial step in devising error containment and recovery methods by introducing the "confidencedriven " notion. This notion complements the message-driven (or "communication-induced") approach employed by a number of existing checkpointing protocols for tolerating hardware faults. In particular, we discriminate between the individual software components with respect to our confidence in their reliability, and keep track of changes of our confidence (due to knowledge about potential process state contamination) in particular processes. This, in turn, enables the individual processes in the spaceborne distributed system to make decisions locally, at run-time, on whether to establish a checkpoint upon message passing and whether to roll back or roll forward during error recovery. The resulting message-driven confidence-driven approach enables cost-effective checkpointing and cascading-rollback free recovery.
Printed by Universitetsservice US-AB 2003iii
, 2003
"... Making reliable distributed systems in the presence of sodware errors ..."

