Results 1 - 10
of
160
The Time-Triggered Architecture
- PROCEEDINGS OF THE IEEE
, 2003
"... The time-triggered architecture (TTA) provides a computing infrastructure for the design and implementation of dependable distributed embedded systems. A large real-time application is decomposed into nearly autonomous clusters and nodes, and a fault-tolerant global time base of known precision is g ..."
Abstract
-
Cited by 157 (10 self)
- Add to MetaCart
The time-triggered architecture (TTA) provides a computing infrastructure for the design and implementation of dependable distributed embedded systems. A large real-time application is decomposed into nearly autonomous clusters and nodes, and a fault-tolerant global time base of known precision is generated at every node. In the TTA, this global time is used to precisely specify the interfaces among the nodes, to simplify the communication and agreement protocols, to perform prompt error detection, and to guarantee the timeliness of real-time applications. The TTA supports a two-phased design methodology, architecture design, and component design. During the architecture design phase, the interactions among the distributed components and the interfaces of the components are fully specified in the value domain and in the temporal domain. In the succeeding component implementation phase, the components are built, taking these interface specifications as constraints. This two-phased design methodology is a prerequisite for the composability of applications implemented in the TTA and for the reuse of prevalidated components within the TTA. This paper presents the architecture model of the TTA, explains the design rationale, discusses the time-triggered communication protocols TTP/C and TTP/A, and illustrates how transparent fault tolerance can be implemented in the TTA.
The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software
- IEEE Transactions on Software Engineering
, 1993
"... This paper affirms that the quantification of life-critical software reliability is infeasible using statistical methods whether applied to standard software or fault-tolerant software. The classical methods of estimating reliability are shown to lead to exhorbitant amounts of testing when applie ..."
Abstract
-
Cited by 103 (1 self)
- Add to MetaCart
This paper affirms that the quantification of life-critical software reliability is infeasible using statistical methods whether applied to standard software or fault-tolerant software. The classical methods of estimating reliability are shown to lead to exhorbitant amounts of testing when applied to life-critical software. Reliability growth models are examined and also shown to be incapable of overcoming the need for excessive amounts of testing. The key assumption of software fault tolerance---separately programmed versions fail independently---is shown to be problematic. This assumption cannot be justified by experimentation in the ultrareliability region and subjective arguments in its favor are not sufficiently strong to justify it as an axiom. Also, the implications of the recent multiversion software experiments support this affirmation. Index Terms---Life-Critical, Validation, Software Reliability, Design Error, Ultrareliability, Software Fault-Tolerance 1 Introducti...
Fault Tolerance in Concurrent Object-Oriented Software through Coordinated Error Recovery
- FTCS-25 SUBMISSION
"... This paper presents a scheme for coordinated error recovery between multiple interacting objects in a concurrent object-oriented system. A conceptual framework for fault tolerance is established based on a general object concurrency model that is supported by most concurrent object-oriented language ..."
Abstract
-
Cited by 85 (41 self)
- Add to MetaCart
This paper presents a scheme for coordinated error recovery between multiple interacting objects in a concurrent object-oriented system. A conceptual framework for fault tolerance is established based on a general object concurrency model that is supported by most concurrent object-oriented languages and systems. This framework integrates two complementary concepts — conversations and transactions. Conversations (associated with cooperative exception handling) are used to provide coordinated error recovery between concurrent interacting activities whilst transactions are used to maintain the consistency of shared resources in the presence of concurrent access. The serialisability property of transactions is exploited in order to help prevent unexpected information smuggling. The proposed framework is illustrated by means of a case study, and various linguistic and implementation issues are discussed.
Treating bugs as allergies -- a safe method to survive software failures
- IN SOSP
, 2005
"... Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Previous approaches for surviving software failures suffer from several limitations, including requiring application restructuring, failing to address deterministic software bugs, unsafely spe ..."
Abstract
-
Cited by 69 (6 self)
- Add to MetaCart
Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Previous approaches for surviving software failures suffer from several limitations, including requiring application restructuring, failing to address deterministic software bugs, unsafely speculating on program execution, and re-quiring a long recovery time. This paper
The Infeasibility of Experimental Quantification of Life-Critical Software Reliability
- IEEE Transactions on Software Engineering
, 1991
"... This paper affirms that quantification of life-critical software reliability is infeasible using statistical methods whether applied to standard software or faulttolerant software. The key assumption of software fault tolerance---separately programmed versions fail independently---is shown to be pro ..."
Abstract
-
Cited by 56 (2 self)
- Add to MetaCart
This paper affirms that quantification of life-critical software reliability is infeasible using statistical methods whether applied to standard software or faulttolerant software. The key assumption of software fault tolerance---separately programmed versions fail independently---is shown to be problematic. This assumption cannot be justified by experimentation in the ultrareliability region and subjective arguments in its favor are not sufficiently strong to justify it as an axiom. Also, the implications of the recent multiversion software experiments support this affirmation. Index Terms: LIFE-CRITICAL, VALIDATION, SOFTWARE RELIABILITY, DESIGN ERROR, ULTRARELIABILITY, SOFTWARE FAULT-TOLERANCE, 1 Introduction The potential of enhanced flexibility and functionality has led to an ever increasing use of digital computer systems in control applications. At first, the digital systems were designed to perform the same functions as their analog counterparts. However, the availability of en...
Highly Reliable Upgrading of Components
- In Proceedings of the 21st International Conference on Software Engineering
, 1999
"... After a system is deployed, fixes, enhancements, and modifications all occur that change the components that make up the system. Unfortunately, new versions of components can introduce new errors and break existing, depended-upon behavior. When this happens, the old component version could have prov ..."
Abstract
-
Cited by 53 (6 self)
- Add to MetaCart
After a system is deployed, fixes, enhancements, and modifications all occur that change the components that make up the system. Unfortunately, new versions of components can introduce new errors and break existing, depended-upon behavior. When this happens, the old component version could have provided the correct behavior, but it is no longer part of the system. We propose a framework for upgrading system components that, instead of removing the old version of the component, keeps multiple versions of a component running. Doing so allows behavior to be utilized from all versions, and maintains system integrity and correctness even in the presence of newly introduced errors. This framework ensures that the move towards dynamic, configurable software systems does not lessen, but rather provides capabilities to enhance, the reliability that software will achieve through the next century. 1 INTRODUCTION Users fear upgrades. This unfortunate but true statement reflects the current para...
Adaptive Distributed and Fault-Tolerant Systems
- International Journal of Computer Systems Science and Engineering
, 1995
"... An adaptive computing system is one that modifies its behavior based on changes in the environment. Since sites connected by a local-area network inherently have to deal with network congestion and the failure of other sites, distributed systems can be viewed as an important subclass of adaptive ..."
Abstract
-
Cited by 52 (5 self)
- Add to MetaCart
An adaptive computing system is one that modifies its behavior based on changes in the environment. Since sites connected by a local-area network inherently have to deal with network congestion and the failure of other sites, distributed systems can be viewed as an important subclass of adaptive systems. As such, use of adaptive methods in this context has the same potential advantages of improved efficiency and structural simplicity as for adaptive systems in general. This paper describes a model for adaptive systems that can be applied in many scenarios arising in distributed and fault-tolerant systems. This model divides the adaptation process into three different phases---change detection, agreement, and action---that can be used to describe existing algorithms that deal with change, as well as to develop new adaptive algorithms. In addition to clarifying the logical structure of such algorithms, this model can also serve as a unifying implementation framework. Several ad...
Analysis of Software Rejuvenation using Markov Regenerative Stochastic Petri Net
, 1995
"... In a client-server type system, the server software is required to run continuously for very long periods. Due to repeated and potentially faulty usage by many clients, such software "ages" with time and eventually fails. Huang et. al. proposed a technique called "software rejuvenation" [9] in which ..."
Abstract
-
Cited by 37 (18 self)
- Add to MetaCart
In a client-server type system, the server software is required to run continuously for very long periods. Due to repeated and potentially faulty usage by many clients, such software "ages" with time and eventually fails. Huang et. al. proposed a technique called "software rejuvenation" [9] in which the software is periodically stopped and then restarted in a "robust" state after proper maintenance. This "renewal" of software prevents (or at least postpones) the crash failure. As the time lost (or the cost incurred) due to the software failure is typically more than the time lost (or the cost incurred) due to rejuvenation, the technique reduces the expected unavailability of the software. In this paper, we present a quantitative analysis of software rejuvenation. The behavior of the system is represented through a Markov Regenerative Stochastic Petri Net (MRSPN) model which is solved both for steady state as well as transient conditions. We provide a closedform analytical solution for ...
Coordinated Atomic Actions: from Concept to Implementation
, 1997
"... The Coordinated Atomic Action (or CA action) concept is a unified scheme for coordinating complex concurrent activities and supporting error recovery between multiple interacting objects in a distributed object-oriented system. It provides a conceptual framework for dealing with different kinds of c ..."
Abstract
-
Cited by 37 (18 self)
- Add to MetaCart
The Coordinated Atomic Action (or CA action) concept is a unified scheme for coordinating complex concurrent activities and supporting error recovery between multiple interacting objects in a distributed object-oriented system. It provides a conceptual framework for dealing with different kinds of concurrency and achieving fault tolerance by extending and integrating two complementary concepts --- conversations and transactions. Conversations (enhanced with concurrent exception handling) are used to control cooperative concurrency and to implement coordinated error recovery whilst transactions are used to maintain the consistency of shared resources in the presence of failures and competitive concurrency. This paper explains the CA action concept in detail and then addresses related design issues such as multi-thread coordination, exception handling and resolution, coordinated access to shared objects and provision of software fault tolerance. Finally, brief details are given of a numb...
Improving the N-Version Programming Process Through the Evolution of a Design Paradigm
, 1993
"... To encourage a practical application of the N-Version Programming (NVP) technique, a design paradigm was proposed and applied in a Six-Language Project. The design paradigm improved the development effort of the N-Version Software (NVS), however, there were some deficiencies of the design paradigm w ..."
Abstract
-
Cited by 33 (12 self)
- Add to MetaCart
To encourage a practical application of the N-Version Programming (NVP) technique, a design paradigm was proposed and applied in a Six-Language Project. The design paradigm improved the development effort of the N-Version Software (NVS), however, there were some deficiencies of the design paradigm which lead to the leak of a pair of coincident faults. In this paper, we report a similar experiment conducted by using a revised NVP design paradigm, identify its impact to the software development process, and demonstrate the improvement of the resulting NVS product. This project reused the revised specification of an automatic airplane landing problem, and was participated by 40 students at the University of Iowa and the Rockwell International. Guided by the refined NVS development paradigm, the students formed 15 independent programming teams to design, program, test, and evaluate the application. The insights, experiences, and learnings in conducting this project are presented. Several q...

