Results 1 - 10
of
10
Commercial Fault Tolerance: A Tale of Two Systems
- IEEE Transactions on Dependable and Secure Computing
, 2004
"... Abstract—This paper compares and contrasts the design philosophies and implementations of two computer system families: the IBM S/360 and its evolution to the current zSeries line, and the Tandem (now HP) NonStop1 Server. Both systems have a long history; the initial IBM S/360 machines were shipped ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
Abstract—This paper compares and contrasts the design philosophies and implementations of two computer system families: the IBM S/360 and its evolution to the current zSeries line, and the Tandem (now HP) NonStop1 Server. Both systems have a long history; the initial IBM S/360 machines were shipped in 1964, and the Tandem NonStop System was first shipped in 1976. They were aimed at similar markets, what would today be called enterprise-class applications. The requirement for the original S/360 line was for very high availability; the requirement for the NonStop platform was for single fault tolerance against unplanned outages. Since their initial shipments, availability expectations for both platforms have continued to rise and the system designers and developers have been challenged to keep up. There were and still are many similarities in the design philosophies of the two lines, including the use of redundant components and extensive error checking. The primary difference is that the S/360-zSeries focus has been on localized retry and restore to keep processors functioning as long as possible, while the NonStop developers have based systems on a loosely coupled multiprocessor design that supports a “fail-fast ” philosophy implemented through a combination of hardware and software, with workload being actively taken over by another resource when one fails. Index Terms—Computer systems implementation, fault tolerance, high availability. 1
Automatic Configuration of Internet Services
, 2007
"... Recent research has found that operators frequently misconfigure Internet services, causing various availability and performance problems. In this paper, we propose a software infrastructure that eliminates several types of misconfiguration by automating the generation of configuration files in Inte ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Recent research has found that operators frequently misconfigure Internet services, causing various availability and performance problems. In this paper, we propose a software infrastructure that eliminates several types of misconfiguration by automating the generation of configuration files in Internet services, even as the services evolve. The infrastructure comprises a custom scripting language, configuration file templates, communicating runtime monitors, and heuristic algorithms to detect dependencies between configuration parameters and select ideal configurations. To demonstrate our infrastructure experimentally, we apply it to a realistic online auction service. Our results show that the infrastructure can simplify operation significantly while eliminating 58 % of the misconfigurations found in a previous study of the same service. Furthermore, our results show that the infrastructure can efficiently determine the configuration parameters that lead to high performance as the service evolves through a hardware upgrade and the scheduled maintenance of a few nodes.
Probing and Monitoring of WSBPEL Processes with Web Services
"... Today’s business climate requires organizations to constantly evolve IT strategies to respond to new opportunities or threats. Tracking the achievement of business goals, objectives and strategies is increasingly used to measure and adjust the outcome of business processes. In this paper, we introdu ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Today’s business climate requires organizations to constantly evolve IT strategies to respond to new opportunities or threats. Tracking the achievement of business goals, objectives and strategies is increasingly used to measure and adjust the outcome of business processes. In this paper, we introduce a web service based approach for probing WSBPEL processes. With our approach organizations are able to automatically extend existing WSBPEL processes with auditing extensions which capture audit information during process execution time. We show how to transform a WSBPEL model into an auditable model which can be used for process monitoring purposes. Based on our experience on building an auditable WSBPEL model we propose some extensions to the WSBPEL specification.
Case-based reasoning for autonomous service failure diagnosis and remediation in software systems
- Proc. European Conference on Case-Based Reasoning (ECCBR) 2006, Lecture Notes in Artificial Intelligence 4106
, 2006
"... Abstract. Self-healing, one of the four key properties characterizing Autonomic Systems, aims to enable large-scale software systems delivering complex services on a 24/7 basis to meet their goals without any human intervention. Achieving self-healing requires the elicitation and maintenance of doma ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. Self-healing, one of the four key properties characterizing Autonomic Systems, aims to enable large-scale software systems delivering complex services on a 24/7 basis to meet their goals without any human intervention. Achieving self-healing requires the elicitation and maintenance of domain knowledge in the form of 〈service failure diagnosis, remediation strategy 〉 patterns, a task which can be overwhelming. Case-Based Reasoning (CBR) is a lazy learning paradigm that largely reduces this kind of knowledge acquisition bottleneck. Moreover, the application of CBR for failure diagnosis and remediation in software systems appears to be very suitable, as in this domain most errors are re-occurrences of known problems. In this paper, we describe a CBR approach for providing large-scale, distributed software systems with self-healing capabilities, and demonstrate the practical applicability of our methodology by means of some experimental results on a real world application. 1
Achieving Self-Healing in Service Delivery Software Systems by Means of Case-Based Reasoning
"... Self-healing, i.e. the capability of a system to autonomously detect failures and recover from them, is a very attractive property that may enable large-scale software systems, aimed at delivering services on a 24/7 fashion, to meet their goals with little or no human intervention. Achieving self-he ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Self-healing, i.e. the capability of a system to autonomously detect failures and recover from them, is a very attractive property that may enable large-scale software systems, aimed at delivering services on a 24/7 fashion, to meet their goals with little or no human intervention. Achieving self-healing requires the elicitation and maintenance of domain knowledge in the form of 〈service failure diagnosis, repair plan 〉 patterns, a task which can be overwhelming. Case-Based Reasoning (CBR) is a lazy learning paradigm that largely reduces this kind of knowledge acquisition bottleneck. Moreover, the application of CBR for failure diagnosis and remediation in software systems appears to be very suitable, as in this domain most errors are re-occurrences of known problems. In this paper, we describe a CBR approach for providing large-scale, distributed software systems with selfhealing capabilities, and demonstrate the practical applicability of our methodology by means of some experimental results on a real world application. 1
Coordinating Human Operators and Computer Agents for RecoveryOriented Computing
- In Proceedings of the International Conference on Information Reuse and Integration
, 2004
"... This paper examines the errors committed by human operators of large networks and systems. It proposes a formal procedure in which system defense mechanisms are used to improve the coordination between human operators and computer agents. Further, it discusses and compares the effectiveness of diffe ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper examines the errors committed by human operators of large networks and systems. It proposes a formal procedure in which system defense mechanisms are used to improve the coordination between human operators and computer agents. Further, it discusses and compares the effectiveness of different types of system defense mechanisms by performing experiments with web-based GUI screens. In the process, the paper offers definitions of human errors and proposes methods to quantify such errors. Our experimental results have shown that more layers of system defense can play a pivotal role in minimizing commonly encountered human errors. 1.
Achieving Service Dependability Through Context-Awareness
"... By large-scale services (LSS) we understand IT services that are being deployed over unbounded large-scale infrastructures. Typical examples may be found in e-commerce scenarios, in computational Grids, or in ubiquitous computing environments. The more such services will be composed on the fly from ..."
Abstract
- Add to MetaCart
By large-scale services (LSS) we understand IT services that are being deployed over unbounded large-scale infrastructures. Typical examples may be found in e-commerce scenarios, in computational Grids, or in ubiquitous computing environments. The more such services will be composed on the fly from other services the more dependability becomes a challenge. Service dependability is about error processing and fault tolerance. There are operational similarities between error processing and context provisioning in ubiquitous computing. In this paper we are exploring the beneficials of applying context-awareness to the dependable composition of services. Based on an e-commerce scenario we analyze the similarities between error processing and context provisioning, but we also show how the single phases of these processes translate into each other. We address how service dependability may benefit from context-awareness and we finally report on a prototypical implementation of a Web Services based context provisioning framework.
1 © RAMS Consultants, India www.ramsconsultants.org
"... Abstract: This paper summarizes the state of knowledge and ongoing research on methods and techniques for resilience evaluation, taking into account the resiliencescaling challenges and properties related to the ubiquitous computerized systems. We mainly focus on quantitative evaluation approaches a ..."
Abstract
- Add to MetaCart
Abstract: This paper summarizes the state of knowledge and ongoing research on methods and techniques for resilience evaluation, taking into account the resiliencescaling challenges and properties related to the ubiquitous computerized systems. We mainly focus on quantitative evaluation approaches and, in particular, on model-based evaluation techniques that are commonly used to evaluate and compare, from the dependability point of view, different architecture alternatives at the design stage. We outline some of the main modeling techniques aiming at mastering the largeness of analytical dependability models at the construction level. Actually, addressing the model largeness problem is important with respect to the investigation of the scalability of current techniques to meet the complexity challenges of ubiquitous systems. Finally we present two case studies in which some of the presented techniques are applied for modeling web services and General Packet Radio Service (GPRS) mobile telephone networks, as prominent examples of large and evolving systems. Key Words: dependability, ubiquitous systems, stochastic modeling, evaluation 1.

