Results 1 - 10
of
106
Capturing, indexing, clustering, and retrieving system history
- In SOSP
, 2005
"... system performance, Bayesian networks, information retrieval, problem signatures We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similari ..."
Abstract
-
Cited by 65 (5 self)
- Add to MetaCart
system performance, Bayesian networks, information retrieval, problem signatures We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual "raw " values of collected measurements is
The CEO Problem
- IEEE Trans. Inform. Theory
, 1996
"... automated diagnosis, selfhealing and selfmonitoring systems, statistical induction and ..."
Abstract
-
Cited by 62 (3 self)
- Add to MetaCart
automated diagnosis, selfhealing and selfmonitoring systems, statistical induction and
Stardust: Tracking Activity in a Distributed Storage System
- ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (Saint-Malo
, 2006
"... Performance monitoring in most distributed systems provides minimal guidance for tuning, problem diagnosis, and decision making. Stardust is a monitoring infrastructure that replaces traditional performance counters with end-to-end traces of requests and allows for efficient querying of performance ..."
Abstract
-
Cited by 40 (10 self)
- Add to MetaCart
Performance monitoring in most distributed systems provides minimal guidance for tuning, problem diagnosis, and decision making. Stardust is a monitoring infrastructure that replaces traditional performance counters with end-to-end traces of requests and allows for efficient querying of performance metrics. Such traces better inform key administrative performance challenges by enabling, for example, extraction of per-workload, per-resource demand information and per-workload latency graphs. This paper reports on our experience building and using end-to-end tracing as an on-line monitoring tool in a distributed storage system. Using diverse system workloads and scenarios, we show that such fine-grained tracing can be made efficient (less than 6% overhead) and is useful for on- and off-line analysis of system behavior. These experiences make a case for having other systems incorporate such an instrumentation framework.
Automated Known Problem Diagnosis with Event Traces
- In EuroSys
, 2006
"... Computer problem diagnosis remains a serious challenge to users and support professionals. Traditional troubleshooting methods relying heavily on human intervention make the process inefficient and the results inaccurate even for solved problems, which contribute significantly to user’s dissatisfact ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
Computer problem diagnosis remains a serious challenge to users and support professionals. Traditional troubleshooting methods relying heavily on human intervention make the process inefficient and the results inaccurate even for solved problems, which contribute significantly to user’s dissatisfaction. We propose to use system behavior information such as system event traces to build correlations with solved problems, instead of using only vague text descriptions as in existing practices. The goal is to enable automatic identification of the root cause of a problem if it is a known one, which would further lead to its resolution. By applying statistical learning techniques to classifying system call sequences, we show our approach can achieve considerable accuracy of root cause recognition by studying four case examples.
Exploiting nonstationarity for performance prediction
- In Proc. of EuroSys’2007,Lisbon
, 2007
"... Real production applications ranging from enterprise applications to large e-commerce sites share a crucial but seldom-noted characteristic: The relative frequencies of transaction types in their workloads are nonstationary, i.e., the transaction mix changes over time. Accurately predicting applicat ..."
Abstract
-
Cited by 36 (10 self)
- Add to MetaCart
Real production applications ranging from enterprise applications to large e-commerce sites share a crucial but seldom-noted characteristic: The relative frequencies of transaction types in their workloads are nonstationary, i.e., the transaction mix changes over time. Accurately predicting application-level performance in businesscritical production applications is an increasingly important problem. However, transaction mix nonstationarity casts doubt on the practical usefulness of prediction methods that ignore this phenomenon. This paper demonstrates that transaction mix nonstationarity enables a new approach to predicting application-level performance as a function of transaction mix. We exploit nonstationarity to circumvent the need for invasive instrumentation and controlled benchmarking during model calibration; our approach relies solely on lightweight passive measurements that are routinely collected in today’s production environments. We evaluate predictive accuracy on two real business-critical production applications. The accuracy of our response time predictions ranges from 10 % to 16 % on these applications, and our models generalize well to workloads very different from those used for calibration. We apply our technique to the challenging problem of predicting the impact of application consolidation on transaction response times. We calibrate models of two testbed applications running on dedicated machines, then use the models to predict their performance when they run together on a shared machine and serve very different workloads. Our predictions are accurate to within 4 % to 14%. Existing approaches to consolidation decision support predict post-consolidation resource utilizations. Our method allows application-level performance to guide consolidation decisions. 1.
Combining visualization and statistical analysis to improve operator confidence and efficiency for failure detection and localization
- In Proceedings of the 2nd IEEE International Conference on Autonomic Computing (ICAC ’05
, 2005
"... Web applications suffer from software and configuration faults that lower their availability. Recovering from failure is dominated by the time interval between when these faults appear and when they are detected by site operators. We introduce a set of tools that augment the ability of operators to ..."
Abstract
-
Cited by 28 (5 self)
- Add to MetaCart
Web applications suffer from software and configuration faults that lower their availability. Recovering from failure is dominated by the time interval between when these faults appear and when they are detected by site operators. We introduce a set of tools that augment the ability of operators to perceive the presence of failure: an automatic anomaly detector scours HTTP access logs to find changes in user behavior that are indicative of site failures, and a visualizer helps operators rapidly detect and diagnose problems. Visualization addresses a key question of autonomic computing of how to win operators ’ confidence so that new tools will be embraced. Evaluation performed using HTTP logs from Ebates.com demonstrates that these tools can enhance the detection of failure as well as shorten detection time. Our approach is application-generic and can be applied to any Web application without the need for instrumentation. 1.
Ensembles of models for automated diagnosis of system performance problems
- In DSN
, 2005
"... Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we showed [1] that Tree-Augmented Bayesian Networks or TAN models are effective at identifying which low-level system properties were correlated to high-level SLO violati ..."
Abstract
-
Cited by 27 (8 self)
- Add to MetaCart
Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we showed [1] that Tree-Augmented Bayesian Networks or TAN models are effective at identifying which low-level system properties were correlated to high-level SLO violations (the metric attribution problem) under stable workloads. In this paper we extend our approach to adapt to changing workloads and external disturbances by maintaining an ensemble of probabilistic models, adding new models when existing ones do not accurately capture current system behavior. Using realistic workloads on an implemented prototype system, we show that the ensemble of TAN models captures the performance behavior of the system accurately under changing workloads and conditions. We fuse diagnoses from the ensemble of models to identify likely causes of the performance problem, with results comparable to those produced by an oracle that continuously changes the model based on advance knowledge of the workload. The cost of inducing new models and managing the ensembles is negligible, making our approach both immediately practical and theoretically appealing.
Agile dynamic provisioning of multi-tier internet applications
- ACM Transactions on Autonomous and Adaptive Systems (TAAS
, 2008
"... Abstract — Dynamic capacity provisioning is a useful technique for handling the multi-time-scale variations seen in Internet workloads. In this paper, we propose a novel dynamic provisioning technique for multi-tier Internet applications that employs (i) a flexible queuing model to determine how muc ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
Abstract — Dynamic capacity provisioning is a useful technique for handling the multi-time-scale variations seen in Internet workloads. In this paper, we propose a novel dynamic provisioning technique for multi-tier Internet applications that employs (i) a flexible queuing model to determine how much resources to allocate to each tier of the application, and (ii) a combination of predictive and reactive methods that determine when to provision these resources, both at large and small time scales. We propose a novel data center architecture based on virtual machine monitors to reduce provisioning overheads. Our experiments on a forty-machine Xen/Linux-based hosting platform demonstrate the responsiveness of our technique in handling dynamic workloads. In one scenario where a flash crowd caused the workload of a three-tier application to double, our technique was able to double the application capacity within five minutes, thus maintaining response time targets. Our technique also reduced the overhead of switching servers across applications from several minutes to less than a second, while meeting the performance targets of residual sessions.
Detailed Diagnosis in Enterprise Networks
, 2009
"... By studying trouble tickets from small enterprise networks, we conclude that their operators need detailed fault diagnosis. That is, the diagnostic system should be able to diagnose not only generic faults (e.g., performance-related) but also application specific faults (e.g., error codes). It sho ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
By studying trouble tickets from small enterprise networks, we conclude that their operators need detailed fault diagnosis. That is, the diagnostic system should be able to diagnose not only generic faults (e.g., performance-related) but also application specific faults (e.g., error codes). It should also identify culprits at a fine granularity such as a process or firewall configuration. We build a system, called NetMedic, that enables detailed diagnosis by harnessing the rich information exposed by modern operating systems and applications. It formulates detailed diagnosis as an inference problem that more faithfully captures the behaviors and interactions of fine-grained network components such as processes. The primary challenge in solving this problem is inferring when a component might be impacting another. Our solution is based on an intuitive technique that uses the joint behavior of two components in the past to estimate the likelihood of them impacting one another in the present. We find that our deployed prototype is effective at diagnosing faults that we inject in a live environment. The faulty component is correctly identified as the most likely culprit in 80% of the cases and is almost always in the list of top five culprits.

