Results 1 - 10
of
12
Path-Based Failure and Evolution Management
- IN PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI’04
, 2004
"... We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as they move through the system as our core abstraction, and our "macro" approach focuses on component interactions rather than the details of ..."
Abstract
-
Cited by 91 (5 self)
- Add to MetaCart
We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as they move through the system as our core abstraction, and our "macro" approach focuses on component interactions rather than the details of the components themselves. Paths record component performance and interactions, are user- and request-centric, and occur in sufficient volume to enable statistical analysis, all in a way that is easily reusable across applications. Automated statistical analysis of multiple paths allows for the detection and diagnosis of complex failures and the assessment of evolution issues. In particular, our approach enables significantly stronger capabilities in failure detection, failure diagnosis, impact analysis, and understanding system evolution. We explore these capabilities with three real implementations, two of which service millions of requests per day. Our contributions include the approach; the maintainable, extensible, and reusable architecture; the various statistical analysis engines; and the discussion of our experience with a high-volume production service over several years.
The CEO Problem
- IEEE Trans. Inform. Theory
, 1996
"... automated diagnosis, selfhealing and selfmonitoring systems, statistical induction and ..."
Abstract
-
Cited by 62 (3 self)
- Add to MetaCart
automated diagnosis, selfhealing and selfmonitoring systems, statistical induction and
Detecting Application-Level Failures in Component-based Internet Services
, 2004
"... Pinpoint is an application-generic framework for using statistical learning techniques to detect and localize likely application-level failures in component-based Internet services. Assuming that most of the system is working most of the time, Pinpoint looks for anomalies in low-level behaviors that ..."
Abstract
-
Cited by 53 (10 self)
- Add to MetaCart
Pinpoint is an application-generic framework for using statistical learning techniques to detect and localize likely application-level failures in component-based Internet services. Assuming that most of the system is working most of the time, Pinpoint looks for anomalies in low-level behaviors that are likely to reflect high-level application faults, and correlates these anomalies to their potential causes within the system. In our experiments, Pinpoint correctly detected and localized over 70-88% of the faults, depending on the type of fault, we injected into our testbed system, as compared to the 50-70% detected by current techniques. By demonstrating the applicability of statistical learning and providing an application-generic platform on which additional machine learning techniques can be applied to the problem of fast failure detection, we hope to hasten the adoption of statistical approaches to dependability for complex software systems.
Ensembles of models for automated diagnosis of system performance problems
- In DSN
, 2005
"... Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we showed [1] that Tree-Augmented Bayesian Networks or TAN models are effective at identifying which low-level system properties were correlated to high-level SLO violati ..."
Abstract
-
Cited by 27 (8 self)
- Add to MetaCart
Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we showed [1] that Tree-Augmented Bayesian Networks or TAN models are effective at identifying which low-level system properties were correlated to high-level SLO violations (the metric attribution problem) under stable workloads. In this paper we extend our approach to adapt to changing workloads and external disturbances by maintaining an ensemble of probabilistic models, adding new models when existing ones do not accurately capture current system behavior. Using realistic workloads on an implemented prototype system, we show that the ensemble of TAN models captures the performance behavior of the system accurately under changing workloads and conditions. We fuse diagnoses from the ensemble of models to identify likely causes of the performance problem, with results comparable to those produced by an oracle that continuously changes the model based on advance knowledge of the workload. The cost of inducing new models and managing the ensembles is negligible, making our approach both immediately practical and theoretically appealing.
Deconstructing Commodity Storage Clusters
- In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA ’05
, 2005
"... The traditional approach for characterizing complex systems is to run standard workloads and measure the resulting performance as seen by the end user. However, unique opportunities exist when characterizing a system that is itself constructed from standardized components: one can also look inside t ..."
Abstract
-
Cited by 25 (8 self)
- Add to MetaCart
The traditional approach for characterizing complex systems is to run standard workloads and measure the resulting performance as seen by the end user. However, unique opportunities exist when characterizing a system that is itself constructed from standardized components: one can also look inside the system itself by instrumenting each of the components. In this paper, we show how intra-box instrumentation can help one understand the behavior of a large-scale storage cluster, the EMC Centera. In our analysis, we leverage standard tools for tracing both the disk and network traffic emanating from each node of the cluster. By correlating this traffic with the running workload, we are able to infer the structure of the software system (e.g., its write update protocol) as well as its policies (e.g., how it performs caching, replication, and load-balancing). Further, by imposing variable intra-box delays on network and disk traffic, we can confirm the causal relationships between network and disk events. Thus, we are able to infer the semantics of the messages between nodes without examining a single line of source code. 1
Problem diagnosis in large-scale computing environments
- In ACM/IEEE Conf. on Supercomputing (SC
, 2006
"... ..."
Towards Realistic File-System Benchmarks with CodeMRI
"... Benchmarks are crucial to understanding software systems and assessing their performance. In file-system research, synthetic benchmarks are accepted and widely used as substitutes for more realistic and complex workloads. However, synthetic benchmarks are largely based on the benchmark writer’s inte ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Benchmarks are crucial to understanding software systems and assessing their performance. In file-system research, synthetic benchmarks are accepted and widely used as substitutes for more realistic and complex workloads. However, synthetic benchmarks are largely based on the benchmark writer’s interpretation of the real workload, and how it exercises the system API. This is insufficient since even a simple operation through the API may end up exercising the file system in very different ways due to effects of features such as caching and prefetching. In this paper, we describe our first steps in creating “realistic synthetic ” benchmarks by building a tool, CodeMRI. CodeMRI leverages file-system domain knowledge and a small amount of system profiling in order to better understand how the benchmark is stressing the system and to deconstruct its workload. 1
Model-Based Validation for Internet Services
"... Abstract—Operator mistakes are a significant source of unavailability in Internet services. In our previous work, we proposed operator action validation as an approach for detecting mistakes while hiding them from the service and its users. Previous validation strategies have limitations, however, i ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract—Operator mistakes are a significant source of unavailability in Internet services. In our previous work, we proposed operator action validation as an approach for detecting mistakes while hiding them from the service and its users. Previous validation strategies have limitations, however, including the need for instances of correct behavior for comparison. In this paper, we propose a novel model-based validation strategy that addresses these limitations and complements our previous techniques. Model-based validation calls for service engineers to define models of Internet services that can be used to differentiate between correct and incorrect configurations and behaviors. These models are then used to guide the specification of validation assertions that check the correctness of operator actions before they are exposed. We have implemented a prototype modelbased validation system for two services, the Web crawler of a commercial search engine (Ask.com) and an academic yet realistic online auction service. Experimentation with modelbased validation demonstrates that it is highly effective at detecting and hiding both activated and latent mistakes. Keywords-validation; model; operator mistake; internet service; I.
Statistical Monitoring + Predictable Recovery = Self-*
"... It is by now motherhood-and-apple-pie that complex distributed Internet services form the basis not only of ecommerce but increasingly of mission-critical networkbased applications. What is new is that the workload and internal architecture of three-tier enterprise applications presents the opportun ..."
Abstract
- Add to MetaCart
It is by now motherhood-and-apple-pie that complex distributed Internet services form the basis not only of ecommerce but increasingly of mission-critical networkbased applications. What is new is that the workload and internal architecture of three-tier enterprise applications presents the opportunity for a new approach to keeping them running in the face of both "natural" failures and adversarial attacks. The core of the approach is anomaly detection and localization based on statistical machine learning techniques. Unlike previous approaches, we propose anomaly detection and pattern mining not only for operational statistics such as mean response time, but also for structural behaviors of the system---what parts of the system, in what combinations, are being exercised in response to different kinds of external stimuli. In addition, rather than building baseline models a priori, we extract them by observing the behavior of the system over a short period of time during normal operation. We explain the necessary underlying assumptions and why they can be realized by systems research, report on some early successes using the approach, describe benefits of the approach that make it competitive as a path toward selfmanaging systems, and outline some research challenges. Our hope is that this approach will enable "new science" in the design of self-managing systems by allowing the rapid and widespread application of statistical learning theory techniques (SLT) to problems of system dependability. 1 Recovery as Rapid Adaptation A "five nines" availability service (99.999% uptime) can be down only five minutes a year. Putting a human in the critical path to recovery would expend that entire budget on a single incident, hence the increasing interest in self-managing or so-cal...

