Results 1 -
8 of
8
Consistability: Describing usually consistent systems
"... Current weak consistency semantics provide worst-case guarantees to clients. These guarantees fail to adequately describe systems that provide varying levels of consistency in the face of distinct failure modes, or that achieve better than worst-case guarantees during normal execution. The inability ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Current weak consistency semantics provide worst-case guarantees to clients. These guarantees fail to adequately describe systems that provide varying levels of consistency in the face of distinct failure modes, or that achieve better than worst-case guarantees during normal execution. The inability to make precise statements about consistency throughout a system’s execution represents a lost opportunity to clearly understand client application requirements and to optimize systems and services appropriately. In this position paper, we motivate the need for and introduce the concept of consistability—a unified metric of consistency and availability. Consistability offers a means of describing, specifying, and discussing how much consistency a usually consistent system provides, and how often it does so. We describe our initial results of applying consistability reasoning to a keyvalue store we are developing and to other recent distributed systems. We also discuss the limitations of our consistability definition. 1
Fault tolerance techniques for the merrimac streaming supercomputer
- In Proceedings of the 2005 ACM/IEEE conference on Supercomputing
, 2005
"... As device scales shrink, higher transistor counts are available while soft-errors, even in logic, become a major concern. A new class of architectures, such as Merrimac and the IBM Cell, take advantage of the higher transistor count by exposing control, communication, and a large number of functiona ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
As device scales shrink, higher transistor counts are available while soft-errors, even in logic, become a major concern. A new class of architectures, such as Merrimac and the IBM Cell, take advantage of the higher transistor count by exposing control, communication, and a large number of functional-units at the architectural level, thus achieving high performance and efficiency. This paper explores soft-error fault tolerance in the context of these computeintensive architectures, which differ significantly from their control-intensive CPU counterparts. The main goal of the proposed schemes for Merrimac is to conserve the critical and costly off-chip bandwidth and on-chip storage resources, while maintaining high peak and sustained performance. We achieve this by allowing for reconfigurability and relying on programmer input. The processor is either run at full peak performance employing software fault-tolerance methods, or reduced performance with hardware redundancy. We present several methods, their analysis, and detailed case studies. 1.
Measuring Availability in Optimistic Partition-tolerant Systems with Data Constraints
- In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
, 2007
"... Replicated systems that run over partitionable environments, can exhibit increased availability if isolated partitions are allowed to optimistically continue their execution independently. This availability gain is traded against consistency, since several replicas of the same objects could be updat ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Replicated systems that run over partitionable environments, can exhibit increased availability if isolated partitions are allowed to optimistically continue their execution independently. This availability gain is traded against consistency, since several replicas of the same objects could be updated separately. Once partitioning terminates, divergences in the replicated state needs to be reconciled. One way to reconcile the state consists of letting the application manually solve inconsistencies. However, there are several situations where automatic reconciliation of the replicated state is meaningful. We have implemented replication and automatic reconciliation protocols that can be used as building blocks in a partition-tolerant middleware. The novelty of the protocols is the continuous service of the application even during the reconciliation process. A prototype system is experimentally evaluated to illustrate the increased availability despite network partitions. 1
Middleware Extensions that Trade Consistency for Availability
"... Replicated distributed object systems are deployed to provide timely and reliable services to actors at distributed locations. This paper treats applications in which data updates are dependent on satisfaction of integrity constraints over multiple objects. We propose a means of achieving higher ava ..."
Abstract
- Add to MetaCart
Replicated distributed object systems are deployed to provide timely and reliable services to actors at distributed locations. This paper treats applications in which data updates are dependent on satisfaction of integrity constraints over multiple objects. We propose a means of achieving higher availability by providing partition-awareness in middleware. The general approach has been illustrated by implementing a number of CORBA extensions that trade consistency for availability during network partitions. This paper contains a thorough experimental evaluation that shows the gains and costs of our approach. The experiments clearly illustrate the benefit of our protocols in terms of significantly higher availability and number of performed operations. key words: middleware, fault tolerance, network partitions 1.
University advisor(s):
, 2009
"... Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies. Contact Information: Author(s): ..."
Abstract
- Add to MetaCart
Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies. Contact Information: Author(s):
Exploring Scaling Limits and Computational Paradigms for Next Generation Embedded Systems
, 2009
"... ..."
Mean time to meaningless: MTTDL, Markov models, and storage system reliability
"... Mean Time To Data Loss (MTTDL) has been the standard reliability metric in storage systems for more than 20 years. MTTDL represents a simple formula that can be used to compare the reliability of small disk arrays and to perform comparative trending analyses. The MTTDL metric is often misused, with ..."
Abstract
- Add to MetaCart
Mean Time To Data Loss (MTTDL) has been the standard reliability metric in storage systems for more than 20 years. MTTDL represents a simple formula that can be used to compare the reliability of small disk arrays and to perform comparative trending analyses. The MTTDL metric is often misused, with egregious examples relying on the MTTDL to generate reliability estimates that span centuries or millennia. Moving forward, the storage community needs to replace MTTDL with a metric that can be used to accurately compare the reliability of systems in a way that reflects the impact of data loss in the real world. 1
Failure Diagnosis of Complex Systems
"... Abstract Failure diagnosis is the process of identifying the causes of impairment in a system’s function based on observable symptoms, i.e., determining which fault led to an observed failure. Since multiple faults can often lead to very similar symptoms, failure diagnosis is often the first line of ..."
Abstract
- Add to MetaCart
Abstract Failure diagnosis is the process of identifying the causes of impairment in a system’s function based on observable symptoms, i.e., determining which fault led to an observed failure. Since multiple faults can often lead to very similar symptoms, failure diagnosis is often the first line of defense when things go wrong- a prerequisite before any corrective actions can be undertaken. The results of diagnosis also provide data about a system’s operational fault profile for use in offline resilience evaluation. While diagnosis has historically been a largely manual process requiring significant human input, techniques to automate as much of the process as possible have significantly grown in importance in many industries including telecommunications, internet services, automotive systems, and aerospace. This chapter presents a survey of automated failure diagnosis techniques including both model-based and model-free approaches. Industrial applications of these techniques in the above domains are presented, and finally, future trends and open challenges in the field are discussed. 1

