## Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control (2004)

### Cached

### Download Links

Venue: | In OSDI |

Citations: | 206 - 17 self |

### BibTeX

@INPROCEEDINGS{Cohen04correlatinginstrumentation,

author = {Ira Cohen and Moises Goldszmidt and Terence Kelly and Julie Symons},

title = {Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control},

booktitle = {In OSDI},

year = {2004},

pages = {231--244}

}

### Years of Citing Articles

### OpenURL

### Abstract

building block for automated diagnosis and control

### Citations

7442 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...g, and require a large number of data samples [16, 21]. We can simplify the problem by making some assumptions about the structure of the distribution P. TANs comprise a subclass of Bayesian networks =-=[29]-=-, which offer a well-developed mathematical language to represent structure in probability distributions. 3.1 Bayesian networks and TANs A Bayesian network is an annotated directed acyclic graph encod... |

5375 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...of a classifier by considering the false positive rate and false negative rate separately. 2.2 Inducing Classifier Models There are many techniques for pattern classification in the literature (e.g., =-=[7, 30]-=-). Our approach first inducessMetric Description mean AS CPU 1 USERTIME CPU time spent in user mode on the application server. var AS CPU 1 USERTIME Variance of user CPU time on the application server... |

5299 |
Neural Networks for pattern recognition
- Bishop
- 1995
(Show Context)
Citation Context ...subindex t when the context is clear). The pattern classification problem is to induce or learn a classifier function F mapping the universe of possible values for �Mt to the range of system states S =-=[16, 7]-=-. The input to this analysis is a training data set. In this case, the training set is a log of observations of the form < �Mt,St > from the system in operation. The learning is supervised in that the... |

4144 |
Pattern classification and scene analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...r �M of metric values when the system is in a given state S. Multidimensional problems of this form are subject to challenges of robustness and overfitting, and require a large number of data samples =-=[16, 21]-=-. We can simplify the problem by making some assumptions about the structure of the distribution P. TANs comprise a subclass of Bayesian networks [29], which offer a well-developed mathematical langua... |

3339 |
Data Mining: Practical machine learning tools and techniques. 2nd Edition
- Witten, Frank
- 2005
(Show Context)
Citation Context ...tems. They are based on sound and well-developed theory, they are computationally efficient and robust, they require no expertise to use, and they are readily available in open-source implementations =-=[24, 34, 5]-=-. While other approaches may prove to yield comparable accuracy and/or efficiency, Bayesian networks and TANs in particular have important practical advantages: they are interpretable and they can inc... |

2292 |
The Elements of Statistical Learning
- Hastie
- 2001
(Show Context)
Citation Context ...r �M of metric values when the system is in a given state S. Multidimensional problems of this form are subject to challenges of robustness and overfitting, and require a large number of data samples =-=[16, 21]-=-. We can simplify the problem by making some assumptions about the structure of the distribution P. TANs comprise a subclass of Bayesian networks [29], which offer a well-developed mathematical langua... |

1699 |
An introduction to support vector machines and other kernel-based learning methods
- Cristianini, Taylor
- 2000
(Show Context)
Citation Context ...e for any given �M. This interpretability property makes Bayesian networks attractive for diagnosis and control, relative to competing alternatives such as neural networks and support vector machines =-=[13]-=-. One other alternative, decision trees [30], can be interpreted as a set of if-then rules on the metrics and their values. Bayesian networks have an additional advantage of modifiability: they can in... |

1132 |
R: a language for data analysis and graphics
- Ihaka
- 1996
(Show Context)
Citation Context ...tems. They are based on sound and well-developed theory, they are computationally efficient and robust, they require no expertise to use, and they are readily available in open-source implementations =-=[24, 34, 5]-=-. While other approaches may prove to yield comparable accuracy and/or efficiency, Bayesian networks and TANs in particular have important practical advantages: they are interpretable and they can inc... |

1109 |
The Vision of Autonomic Computing
- Kephart, Chess
(Show Context)
Citation Context ...ase@cs.duke.edu graphically to operators. But it is widely recognized that the complexity of deployed systems surpasses the ability of humans to diagnose and respond to problems rapidly and correctly =-=[17, 26]-=-. Research on automated diagnosis and control—beginning with tools to analyze and interpret instrumentation data—has not kept pace with the demand for practical solutions in the field. Broadly there a... |

949 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ... is independent of its non-descendants, given that the state of its parents is known. There is a set of wellunderstood algorithms and methods to induce Bayesian network models statistically from data =-=[22]-=-, and these are available in open-source software [24, 34, 5]. In a naive Bayesian network, the state variable S is the only parent of all other vertices. Thus a naive Bayesian network assumes that al... |

939 |
The Art of Computer Systems Performance Analysis
- Jain
- 1991
(Show Context)
Citation Context ...at the analysis and TAN models suggest the causes of performance problems, either directly or indirectly, depending on the metrics recorded. 7 Related Work Jain’s classic text on performance analysis =-=[25]-=- surveys a wide range of analytical approaches for performance modeling, bottleneck analysis, and performance diagnosis. Classical analytical models are based on a priori knowledge from human experts;... |

635 | Bayesian network classifiers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ...nstrumentation data from network services in order to forecast, diagnose, and repair failure conditions. This paper studies the effectiveness and practicality of TreeAugmented Naive Bayesian networks =-=[18]-=-, or TANs, as a basis for performance diagnosis and forecasting from system-level instrumentation in a three-tier network service. TANs comprise a subclass of Bayesian networks, USENIX Association OSD... |

604 | Grid Information Services for Distributed Resource Sharing
- Czajkowski, Fitzgerald, et al.
- 2001
(Show Context)
Citation Context ...n of workload, software structure, hardware, traffic conditions, and system goals. Pervasive instrumentation and query capabilities are necessary elements of the solution for managing complex systems =-=[32, 23, 33, 14]-=-. There are now many commercial frameworks on the market for coordinated monitoring and control of large-scale systems: tools such as HP’s OpenView and IBM’s Tivoli aggregate information from a variet... |

375 | Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining
- Renesse, Birman, et al.
(Show Context)
Citation Context ...n of workload, software structure, hardware, traffic conditions, and system goals. Pervasive instrumentation and query capabilities are necessary elements of the solution for managing complex systems =-=[32, 23, 33, 14]-=-. There are now many commercial frameworks on the market for coordinated monitoring and control of large-scale systems: tools such as HP’s OpenView and IBM’s Tivoli aggregate information from a variet... |

309 | Querying the Internet with PIER
- R, HELLERSTEIN, et al.
- 2003
(Show Context)
Citation Context ...n of workload, software structure, hardware, traffic conditions, and system goals. Pervasive instrumentation and query capabilities are necessary elements of the solution for managing complex systems =-=[32, 23, 33, 14]-=-. There are now many commercial frameworks on the market for coordinated monitoring and control of large-scale systems: tools such as HP’s OpenView and IBM’s Tivoli aggregate information from a variet... |

307 | httperf: A tool for Measuring Web Server Performance
- Mosberger, Jin
- 1998
(Show Context)
Citation Context ..., application middleware server (BEA WebLogic), and database server (Oracle) run on three different servers instrumented with HP OpenView to collect a set of system metrics. A load generator (httperf =-=[28]-=-) offers load to the service over a sequence of execution intervals. An SLO indicator processes the Apache logs to determine SLO compliance over each interval, based on the average server response tim... |

259 | Performance debugging for distributed systems of black boxes
- Aguilera, Mogul, et al.
- 2003
(Show Context)
Citation Context ...vironment. For example, there has been much recent progress on the use of statistical analysis tools to infer component relationships from histories of interaction patterns (e.g., from packet traces) =-=[9, 2, 4, 10]-=-. But it is still an open problem to identify techniques that are powerful enough to induce effective models, and that are sufficiently efficient, accurate, and robust to deploy in practice. The goal ... |

240 | Pinpoint: Problem determination in large, dynamic, internet services
- Chen, Kiciman, et al.
- 2002
(Show Context)
Citation Context ...vironment. For example, there has been much recent progress on the use of statistical analysis tools to infer component relationships from histories of interaction patterns (e.g., from packet traces) =-=[9, 2, 4, 10]-=-. But it is still an open problem to identify techniques that are powerful enough to induce effective models, and that are sufficiently efficient, accurate, and robust to deploy in practice. The goal ... |

227 | Performance guarantees for web server end-systems: A control-theoretical approach
- Abdelzaher, Shin, et al.
(Show Context)
Citation Context ...em structure and behavior, which may be represented quantitatively or as sets of event-condition-action rules. Recent work has explored the uses of such models in automated performance control (e.g., =-=[3, 1, 15]-=-). This approach has several limitations: the models and rule bases are themselves difficult and costly to build, may be incomplete or inaccurate in significant ways, and inevitably become brittle whe... |

162 | Adaptive Probabilistic Networks with Hidden Variables
- Binder, Koller, et al.
- 1997
(Show Context)
Citation Context ...oaches. One focus of our continuing work is online adaptation of the models to respond to changing conditions. Research on adapting Bayesian networks to incoming data has yielded practical approaches =-=[22, 6, 19]-=-. For example, known statistical techniques for sequential update are sufficient to adapt the model parameters. However, adapting the model structure requires a search over a space of candidate models... |

145 | A knowledge plane for the internet
- Clark, Partridge, et al.
- 2003
(Show Context)
Citation Context ... USENIX Association OSDI ’04: 6th Symposium on Operating Systems Design and Implementation 231s232 recently of interest to the systems community as potential elements of an Internet “Knowledge Plane” =-=[11]-=-. TANs are less powerful than generalized Bayesian networks (see Section 3), but they are simple, compact and efficient. TANs have been shown to be promising in diverse contexts including financial mo... |

127 | Path-Based Failure and Evolution Management
- Chen, Accardi, et al.
- 2004
(Show Context)
Citation Context ...vironment. For example, there has been much recent progress on the use of statistical analysis tools to infer component relationships from histories of interaction patterns (e.g., from packet traces) =-=[9, 2, 4, 10]-=-. But it is still an open problem to identify techniques that are powerful enough to induce effective models, and that are sufficiently efficient, accurate, and robust to deploy in practice. The goal ... |

122 | Model-Based Resource Provisioning in a Web Service Utility
- Doyle, Chase
- 2003
(Show Context)
Citation Context ...em structure and behavior, which may be represented quantitatively or as sets of event-condition-action rules. Recent work has explored the uses of such models in automated performance control (e.g., =-=[3, 1, 15]-=-). This approach has several limitations: the models and rule bases are themselves difficult and costly to build, may be incomplete or inaccurate in significant ways, and inevitably become brittle whe... |

116 | Minerva: an automated resource provisioning tool for large-scale storage systems
- Alvarez
- 2001
(Show Context)
Citation Context ...em structure and behavior, which may be represented quantitatively or as sets of event-condition-action rules. Recent work has explored the uses of such models in automated performance control (e.g., =-=[3, 1, 15]-=-). This approach has several limitations: the models and rule bases are themselves difficult and costly to build, may be incomplete or inaccurate in significant ways, and inevitably become brittle whe... |

86 | Sophia: an information plane for networked systems
- Wawrzoniak, Peterson, et al.
(Show Context)
Citation Context |

56 | Magpie: online modelling and performance-aware systems
- Barham, Isaacs, et al.
- 2003
(Show Context)
Citation Context |

46 | Sequential update of Bayesian network structure
- Friedman, Goldszmidt
- 1997
(Show Context)
Citation Context ...oaches. One focus of our continuing work is online adaptation of the models to respond to changing conditions. Research on adapting Bayesian networks to incoming data has yielded practical approaches =-=[22, 6, 19]-=-. For example, known statistical techniques for sequential update are sufficient to adapt the model parameters. However, adapting the model structure requires a search over a space of candidate models... |

34 | File classification in self-* storage systems
- Mesnier, Thereska, et al.
- 2004
(Show Context)
Citation Context ...tterns among components; in this respect our approach is complementary. Others are beginning to apply model-induction techniques from machine learning to a variety of systems problems. Mesiner et al. =-=[27]-=-, for instance, apply decision-tree classifiers to predict properties of files (e.g., access patterns) based on creation-time attributes (e.g., names and permissions). They report that accurate models... |

17 | Self-repairing computers
- Fox, Patterson
- 2003
(Show Context)
Citation Context ...ase@cs.duke.edu graphically to operators. But it is widely recognized that the complexity of deployed systems surpasses the ability of humans to diagnose and respond to problems rapidly and correctly =-=[17, 26]-=-. Research on automated diagnosis and control—beginning with tools to analyze and interpret instrumentation data—has not kept pace with the demand for practical solutions in the field. Broadly there a... |

15 | Using probabilistic reasoning to automate software tuning
- Sullivan
- 2003
(Show Context)
Citation Context ... among the metrics, or prior probability distributions. Blake & Breese [8] give examples, including an early use of Bayesian networks to discover bottlenecks in the Windows operating system. Sullivan =-=[31]-=- applies this approach to tune database parameters. 4 Methodology We considered a variety of approaches to empirical evaluation before eventually settling on the testbed environment and workloads desc... |

11 |
Automating computer bottleneck detection with belief nets
- Breese, Blake
- 1995
(Show Context)
Citation Context ...ledge can take the form of explicit lists of metrics to be included in the model, information about correlations and dependencies among the metrics, or prior probability distributions. Blake & Breese =-=[8]-=- give examples, including an early use of Bayesian networks to discover bottlenecks in the Windows operating system. Sullivan [31] applies this approach to tune database parameters. 4 Methodology We c... |

10 |
Sun performance and Tuning
- Cockcroft, Pettit
- 1998
(Show Context)
Citation Context ...ot limited to workload measures and device measures. More recent books aimed at practitioners consider goals closer to ours but pursue them using different approaches. For example, Cockcroft & Pettit =-=[12]-=- cover a range of facilities for system performance measurement and techniques for performance diagnosis. They also describe Virtual Adrian, a performance diagnosis package that encodes human expert k... |

5 |
Web transaction analysis and optimization
- Garg, Hao, et al.
- 2002
(Show Context)
Citation Context ...stems based on passive observations of communication among “black box” components, e.g., processes or Java J2EE beans implementing different tiers of a multi-tier Web service. Examples include WebMon =-=[20]-=-, Magpie [4], and Pinpoint [10]. Aguilera et al. [2] provides an excellent review of these and related research efforts. It also proposes several algorithms to infer causal paths of messages related t... |

1 |
Minerva: An automated
- Alvarez, Borowsky, et al.
- 2001
(Show Context)
Citation Context |

1 |
Minerva: An automated provisioning tool for large-scale storage systems
- Alvarez, Borowsky, et al.
- 2001
(Show Context)
Citation Context |