Results 1 -
6 of
6
Fault Tolerance via Diversity for Off-The-Shelf Products: a Study with SQL Database Servers
"... Abstract — If an off-the-shelf software product exhibits poor dependability due to design faults, software fault tolerance is often the only way available to users and system integrators to alleviate the problem. Thanks to low acquisition costs, even using multiple versions of software in a parallel ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Abstract — If an off-the-shelf software product exhibits poor dependability due to design faults, software fault tolerance is often the only way available to users and system integrators to alleviate the problem. Thanks to low acquisition costs, even using multiple versions of software in a parallel architecture, a scheme formerly reserved for few and highly critical applications, may become viable for many applications. We have studied the potential dependability gains from these solutions for off-theshelf database servers. We based the study on the bug reports available for four off-the-shelf SQL servers, plus later releases of two of them. We found that many of these faults cause systematic, noncrash failures, a category ignored by most studies and standard implementations of fault tolerance for databases. Our observations suggest that diverse redundancy would be effective for tolerating design faults in this category of products. Only in very few cases would demands that triggered a bug in one server cause failures in another one, and there were no coincident failures in more than two of the servers. Use of different releases of the same product would also tolerate a significant fraction of the faults. We report our results and discuss their implications, the architectural options available for exploiting them and the difficulties that they may present.
A structured approach to handling on-line interface upgrades
- Proceedings of the 26th Annual International Computer Software and Applications Conference (COMPSAC 2002
, 2002
"... The integration of complex systems out of existing systems is an active area of research and development. There are many practical situations in which the interfaces of the component systems, for example belonging to separate organisations, are changed dynamically and without notification. In this p ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
The integration of complex systems out of existing systems is an active area of research and development. There are many practical situations in which the interfaces of the component systems, for example belonging to separate organisations, are changed dynamically and without notification. In this paper we propose an approach to handling such upgrades in a structured and disciplined fashion. All interface changes are viewed as abnormal events and general fault tolerance mechanisms (exception handling, in particular) are applied to dealing with them. The paper outlines general ways of detecting such interface upgrades and recovering after them. An Internet Travel Agency is used as a case study. 1.
Low-Cost Flexible Software Fault Tolerance for Distributed Computing
- in Proceedings of the 12th International Symposium on Software Reliability Engineering (ISSRE 2001), (Hong Kong
, 2001
"... In this paper, we revisit the problem of software fault tolerance in distributed systems. In particular, we propose an extension of a message-driven confidence-driven (MDCD) protocol we have developed for error containment and recovery in a particular type of distributed embedded system. More specif ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
In this paper, we revisit the problem of software fault tolerance in distributed systems. In particular, we propose an extension of a message-driven confidence-driven (MDCD) protocol we have developed for error containment and recovery in a particular type of distributed embedded system. More specifically, we augment the original MDCD protocol by introducing the method of "finegrained confidence adjustment," which enables us to remove the architectural restrictions. The dynamic nature of the MDCD approach gives it a number of desirable characteristics. First, this approach does not impose any restrictions on interactions among application software components or require costly message-exchange based process coordination /synchronization. Second, the algorithms allow redundancies to be applied only to low-confidence or critical interacting software components in a distributed system, permitting flexible realization of software fault tolerance. Finally, the dynamic error containment and recovery mechanisms are transparent to the application and ready to be implemented by generic middleware.
Structured Handling of On-Line Interface Upgrades in Integrating Dependable Systems of Systems
- in Proceedings of the Scientific Engineering for Distributed Java Applications International Workshop (FIDJI 2002
, 2003
"... Abstract. The integration of complex systems out of existing systems is an active area of research and development. There are many practical situations in which the interfaces of the component systems, for example belonging to separate organisations, are changed dynamically and without notification. ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. The integration of complex systems out of existing systems is an active area of research and development. There are many practical situations in which the interfaces of the component systems, for example belonging to separate organisations, are changed dynamically and without notification. Usually systems of system (SoS) developers deal with such situations off-line causing considerable downtime and undermining the quality of the service that SoSs are delivering [Romanovsky & Smith 2002]. In this paper we propose an approach to on-line handling such upgrades in a structured and disciplined fashion. All interface changes are viewed as abnormal events and general fault tolerance mechanisms (exception handling, in particular) are applied to dealing with them. The paper outlines general ways of detecting such interface upgrades and recovering after them. An Internet Travel Agency is used as a case study throughout the paper. An implementation demonstrating how the general approach proposed can be applied for dealing with some of the possible interface upgrades within this case study is discussed. 1
Performability Analysis of Guarded-Operation Duration: A Successive Model-Translation Approach
, 2002
"... When making an engineering design decision, it is often necessary to consider its implications on both system performance and dependability. In this paper, we present a performability study that analyzes the guarded operation duration for onboard software upgrading. In particular, we define a "perfo ..."
Abstract
- Add to MetaCart
When making an engineering design decision, it is often necessary to consider its implications on both system performance and dependability. In this paper, we present a performability study that analyzes the guarded operation duration for onboard software upgrading. In particular, we define a "performability index" Y that quantifies the extent to which the guarded operation with a duration # reduces the expected total performance degradation. In order to solve for Y , we progressively translate its formulation until it becomes an aggregate of constituent measures conducive to efficient reward model solutions. Based on the reward-mapping-enabled intermediate model, we specify reward structures in the composite base model which is built on three stochastic activity network reward models. We describe the model-translation approach and show its feasibility for design-oriented performability modeling.
Protecting Distributed Software Upgrades that Involve Message-Passing
"... We present in this paper an extension of the messagedriven confidence-driven framework that we developed for onboard guarded software upgrading. The purpose of this work is to provide the framework with the capability of protecting distributed software upgrades that involve messagepassing interface ..."
Abstract
- Add to MetaCart
We present in this paper an extension of the messagedriven confidence-driven framework that we developed for onboard guarded software upgrading. The purpose of this work is to provide the framework with the capability of protecting distributed software upgrades that involve messagepassing interface changes. To achieve this goal, we propose an approach to clustering the components involved in software upgrades and those involved in message-passing interface changes, such that from outside the cluster all those components can be perceived collectively as one virtual low-confidence component. Moreover, we develop a confidence-driven mechanism that enables combined use of sender- and receiver-side message logging for efficient, fine-grained error containment and recovery. The paper provides a detailed algorithm description.

