Results 1 - 10
of
39
Sympathy for the sensor network debugger
- In SenSys
, 2005
"... Being embedded in the physical world, sensor networks present a wide range of bugs and misbehavior qualitatively different from those in most distributed systems. Unfortunately, due to resource constraints, programmers must investigate these bugs with only limited visibility into the application. Th ..."
Abstract
-
Cited by 83 (3 self)
- Add to MetaCart
Being embedded in the physical world, sensor networks present a wide range of bugs and misbehavior qualitatively different from those in most distributed systems. Unfortunately, due to resource constraints, programmers must investigate these bugs with only limited visibility into the application. This paper presents the design and evaluation of Sympathy, a tool for detecting and debugging failures in sensor networks. Sympathy has selected metrics that enable efficient failure detection, and includes an algorithm that root-causes failures and localizes their sources in order to reduce overall failure notifications and point the user to a small number of probable causes. We describe Sympathy and evaluate its performance through fault injection and by debugging an active application, ESS, in simulation and deployment. We show that for a broad class of data gathering applications, it is possible to detect and diagnose failures by collecting and analyzing a minimal set of metrics at a centralized sink. We have found that there is a tradeoff between notification latency and detection accuracy; that additional metrics traffic does not always improve notification latency; and that Sympathy’s process of failure localization reduces primary failure notifications by at least 50 % in most cases.
Capturing, indexing, clustering, and retrieving system history
- In SOSP
, 2005
"... system performance, Bayesian networks, information retrieval, problem signatures We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similari ..."
Abstract
-
Cited by 65 (5 self)
- Add to MetaCart
system performance, Bayesian networks, information retrieval, problem signatures We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual "raw " values of collected measurements is
The CEO Problem
- IEEE Trans. Inform. Theory
, 1996
"... automated diagnosis, selfhealing and selfmonitoring systems, statistical induction and ..."
Abstract
-
Cited by 62 (3 self)
- Add to MetaCart
automated diagnosis, selfhealing and selfmonitoring systems, statistical induction and
Combining visualization and statistical analysis to improve operator confidence and efficiency for failure detection and localization
- In Proceedings of the 2nd IEEE International Conference on Autonomic Computing (ICAC ’05
, 2005
"... Web applications suffer from software and configuration faults that lower their availability. Recovering from failure is dominated by the time interval between when these faults appear and when they are detected by site operators. We introduce a set of tools that augment the ability of operators to ..."
Abstract
-
Cited by 28 (5 self)
- Add to MetaCart
Web applications suffer from software and configuration faults that lower their availability. Recovering from failure is dominated by the time interval between when these faults appear and when they are detected by site operators. We introduce a set of tools that augment the ability of operators to perceive the presence of failure: an automatic anomaly detector scours HTTP access logs to find changes in user behavior that are indicative of site failures, and a visualizer helps operators rapidly detect and diagnose problems. Visualization addresses a key question of autonomic computing of how to win operators ’ confidence so that new tools will be embraced. Evaluation performed using HTTP logs from Ebates.com demonstrates that these tools can enhance the detection of failure as well as shorten detection time. Our approach is application-generic and can be applied to any Web application without the need for instrumentation. 1.
Ensembles of models for automated diagnosis of system performance problems
- In DSN
, 2005
"... Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we showed [1] that Tree-Augmented Bayesian Networks or TAN models are effective at identifying which low-level system properties were correlated to high-level SLO violati ..."
Abstract
-
Cited by 27 (8 self)
- Add to MetaCart
Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we showed [1] that Tree-Augmented Bayesian Networks or TAN models are effective at identifying which low-level system properties were correlated to high-level SLO violations (the metric attribution problem) under stable workloads. In this paper we extend our approach to adapt to changing workloads and external disturbances by maintaining an ensemble of probabilistic models, adding new models when existing ones do not accurately capture current system behavior. Using realistic workloads on an implemented prototype system, we show that the ensemble of TAN models captures the performance behavior of the system accurately under changing workloads and conditions. We fuse diagnoses from the ensemble of models to identify likely causes of the performance problem, with results comparable to those produced by an oracle that continuously changes the model based on advance knowledge of the workload. The cost of inducing new models and managing the ensembles is negligible, making our approach both immediately practical and theoretically appealing.
1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
"... Mochi, a new visual, log-analysis based debugging tool correlates Hadoop’s behavior in space, time and volume, and extracts a causal, unified control- and dataflow model of Hadoop across the nodes of a cluster. Mochi’s analysis produces visualizations of Hadoop’s behavior using which users can reaso ..."
Abstract
-
Cited by 12 (7 self)
- Add to MetaCart
Mochi, a new visual, log-analysis based debugging tool correlates Hadoop’s behavior in space, time and volume, and extracts a causal, unified control- and dataflow model of Hadoop across the nodes of a cluster. Mochi’s analysis produces visualizations of Hadoop’s behavior using which users can reason about and debug performance issues. We provide examples of Mochi’s value in revealing a Hadoop job’s structure, in optimizing real-world workloads, and in identifying anomalous Hadoop behavior, on the Yahoo! M45 Hadoop cluster. 1
Three research challenges at the intersection of machine learning, statistical induction, and systems
- In HotOS 2005
, 2005
"... results for performance debugging and failure diagnosis and detection in systems by using approaches based on automatically inducing models and deriving correlations from observed data. We believe that maximizing the potential of this line of research will require surmounting some fundamental challe ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
results for performance debugging and failure diagnosis and detection in systems by using approaches based on automatically inducing models and deriving correlations from observed data. We believe that maximizing the potential of this line of research will require surmounting some fundamental challenges arising not from the modeling techniques themselves, but specifically from the application of those techniques to realworld systems. We specifically formulate three challenges. First, as new data is collected from a system, previously-induced models must be continuously assessed and validated, with the ultimate aim of achieving online adaption to system changes. Second, human operators must be able to effectively interact with the models, including interpreting model findings to generate explanations, enabling human feedback to improve the models, and identifying false positives and missed detections. Third, it should be possible to formally manipulate “signatures” of system state as represented by these models, allowing us to query the system’s past to identify recurring problems and manually annotate them with additional information. We contend that the specifics of this problem domain not only raise these challenges, but also provide the knowledge base from which to derive wellengineered solutions to them. We suggest some possible strategies for addressing each challenge and show how they arise in the context of a real example. 1
A Case for Machine Learning to Optimize Multicore Performance
"... Multicore architectures have become so complex and diverse that there is no obvious path to achieving good performance. Hundreds of code transformations, compiler flags, architectural features and optimization parameters result in a search space that can take many machinemonths to explore exhaustive ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Multicore architectures have become so complex and diverse that there is no obvious path to achieving good performance. Hundreds of code transformations, compiler flags, architectural features and optimization parameters result in a search space that can take many machinemonths to explore exhaustively. Inspired by successes in the systems community, we apply state-of-the-art machine learning techniques to explore this space more intelligently. On 7-point and 27-point stencil code, our technique takes about two hours to discover a configuration whose performance is within 1 % of and up to 18 % better than that achieved by a human expert. This factor of 2000 speedup over manual exploration of the auto-tuning parameter space enables us to explore optimizations that were previously off-limits. We believe the opportunity for using machine learning in multicore autotuning is even more promising than the successes to date in the systems literature. 1
Ganesha: Black-box diagnosis of mapreduce systems
- In Proceedings of the 2nd Workshop on Hot Topics in Measurement & Modeling of Computer Systems
, 2009
"... Ganesha aims to diagnose faults transparently (in a blackbox manner) in MapReduce systems, by analyzing OS-level metrics. Ganesha’s approach is based on peer-symmetry under fault-free conditions, and can diagnose faults that manifest asymmetrically at nodes within a MapReduce system. We evaluate Gan ..."
Abstract
-
Cited by 8 (7 self)
- Add to MetaCart
Ganesha aims to diagnose faults transparently (in a blackbox manner) in MapReduce systems, by analyzing OS-level metrics. Ganesha’s approach is based on peer-symmetry under fault-free conditions, and can diagnose faults that manifest asymmetrically at nodes within a MapReduce system. We evaluate Ganesha by diagnosing Hadoop problems for the Gridmix Hadoop benchmark on 10-node and 50-node MapReduce clusters on Amazon’s EC2. We also candidly highlight faults that escape Ganesha’s diagnosis. 1.
Why Did My PC Suddenly Slow Down?
"... Users are often frustrated when they encounter a sudden decrease in the responsiveness of their personal computers. However, it is often difficult to pinpoint a particular offending process and the resource it is overconsuming, even when such a simple explanation does exist. We present preliminary r ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Users are often frustrated when they encounter a sudden decrease in the responsiveness of their personal computers. However, it is often difficult to pinpoint a particular offending process and the resource it is overconsuming, even when such a simple explanation does exist. We present preliminary results from several weeks of PC usage showing that user-perceived unresponsiveness often has such a simple explanation and that simple statistical models often suffice to pinpoint the problem. The statistical models we build use all the performance counters for all running processes. When the user expresses frustration at a given time point, we can use these models to determine which processes are acting most anomalously, and in turn which features of those processes are most anomalous. We present an investigative tool that ranks processes and features according to their degree of anomaly, and allows the user to interactively examine the relevant time series.

