Results 1 - 10
of
18
Triage: Diagnosing production run failures at the users site
- In Proc. of 21st SOSP
, 2007
"... Diagnosing production run failures is a challenging yet important task. Most previous work focuses on offsite diagnosis, i.e. development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
Diagnosing production run failures is a challenging yet important task. Most previous work focuses on offsite diagnosis, i.e. development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; (2) offsite diagnosis cannot provide timely guidance for recovery or security purposes; (3) it is infeasible to provide a programmer to diagnose every production run failure; and (4) privacy concerns limit the release of information (e.g. coredumps) to programmers. To address production-run failures, we propose a system, called Triage, that automatically performs onsite software failure diagnosis at the very moment of failure. It provides a detailed diagnosis report, including the failure nature, triggering conditions, related code and variables, the fault propagation chain, and potential fixes. Triage achieves this by leveraging lightweight reexecution support
D³S: Debugging Deployed Distributed Systems
, 2008
"... Testing large-scale distributed systems is a challenge, because some errors manifest themselves only after a distributed sequence of events that involves machine and network failures. D³S is a checker that allows developers to specify predicates on distributed properties of a deployed system, and th ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Testing large-scale distributed systems is a challenge, because some errors manifest themselves only after a distributed sequence of events that involves machine and network failures. D³S is a checker that allows developers to specify predicates on distributed properties of a deployed system, and that checks these predicates while the system is running. When D³S finds a problem it produces the sequence of state changes that led to the problem, allowing developers to quickly find the root cause. Developers write predicates in a simple and sequential programming style, while D³S checks these predicates in a distributed and parallel manner to allow checking to be scalable to large systems and fault tolerant. By using binary instrumentation, D³S works transparently with legacy systems and can change predicates to be checked at runtime. An evaluation with 5 deployed systems shows that D 3 S can detect non-trivial correctness and performance bugs at runtime and with low performance overhead (less than 8%).
Snitch: Interactive Decision Trees for Troubleshooting Misconfigurations
"... Troubleshooting misconfigurations of modern applications is difficult due to their large and complex state. Snitch is a prototype tool that assists human troubleshooters by finding relationships between application state and subsequent faults. It correlates configuration state and application errors ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Troubleshooting misconfigurations of modern applications is difficult due to their large and complex state. Snitch is a prototype tool that assists human troubleshooters by finding relationships between application state and subsequent faults. It correlates configuration state and application errors across many machines and users, and across long periods of time. Snitch aids the human expert in extracting patterns from this rich but enormous data set by building decision trees pinpointing potential configuration problems. We applied Snitch to 114 GB of configuration traces from 151 machines over 567 days. We illustrate how Snitch can suggest misconfigurations in case studies of two Windows applications: Messenger and Outlook. 1
PDA: A Tool for Automated Problem Determination
"... Problem determination remains one of the most expensive and time-consuming functions in system management due to the difficulty in automating what is essentially a highly experience-dependent task. In this paper we study the characteristics of problem tickets in an enterprise IT infrastructure and o ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Problem determination remains one of the most expensive and time-consuming functions in system management due to the difficulty in automating what is essentially a highly experience-dependent task. In this paper we study the characteristics of problem tickets in an enterprise IT infrastructure and observe that most of the tickets come from very few products and modules, and OS problems present higher resolving duration. We propose PDA, a problem management tool that provides automated problem diagnosis capabilities to assist system administrators in solving realworld problems more efficiently. PDA uses a two-level approach of proactive, high-level system health checks, coupled with rule-based ‘‘drill-down’ ’ probing to automatically collect detailed information related to the problem. Our tool allows system administrators to author and customize probes and rules accordingly and share across the organization. We illustrate the usage and benefits of PDA with a number of UNIX problem scenarios that show PDA is able to quickly collect key information through its rules to aid in problem determination.
Failure Classification and Inference in Large-Scale Systems: A Systematic Study of Failures in PlanetLab
"... Large-scale distributed systems are prone to frequent failures, which could be caused by a variety of factors related to network, hardware, and software problems. Any downtime due to failures, whatever the cause, can lead to large disruptions and huge losses. Identifying the location and cause of a ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Large-scale distributed systems are prone to frequent failures, which could be caused by a variety of factors related to network, hardware, and software problems. Any downtime due to failures, whatever the cause, can lead to large disruptions and huge losses. Identifying the location and cause of a failure is critical for the reliability and availability of such systems. However, identifying the actual cause of failures in such systems is a challenging task due to their large scale and variety of failure causes. In this work, we try to understand failures in a large-scale system through a two-step methodology: (i) classifying failures based on their statistical properties, and (ii) using additional monitoring data to explain these failures. We illustrate our methodology through a systematic study of failures in PlanetLab over a 3-month period. Our results show that most of the failures that required restarting a node were of small size and lasted for long durations. We also found that incorporating geographic information into our analysis enabled us to find site-wise correlated failures. We were also able to explain some failures by using error-message information collected by the monitoring nodes, and some of short-lived failures by transient CPU overloads on machines. 1
Model-Based Validation for Internet Services
"... Abstract—Operator mistakes are a significant source of unavailability in Internet services. In our previous work, we proposed operator action validation as an approach for detecting mistakes while hiding them from the service and its users. Previous validation strategies have limitations, however, i ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract—Operator mistakes are a significant source of unavailability in Internet services. In our previous work, we proposed operator action validation as an approach for detecting mistakes while hiding them from the service and its users. Previous validation strategies have limitations, however, including the need for instances of correct behavior for comparison. In this paper, we propose a novel model-based validation strategy that addresses these limitations and complements our previous techniques. Model-based validation calls for service engineers to define models of Internet services that can be used to differentiate between correct and incorrect configurations and behaviors. These models are then used to guide the specification of validation assertions that check the correctness of operator actions before they are exposed. We have implemented a prototype modelbased validation system for two services, the Web crawler of a commercial search engine (Ask.com) and an academic yet realistic online auction service. Experimentation with modelbased validation demonstrates that it is highly effective at detecting and hiding both activated and latent mistakes. Keywords-validation; model; operator mistake; internet service; I.
Why Software Hangs and What Can Be Done With It ∗
"... Software hang is an annoying behavior and forms a major threat to the dependability of many software systems. To avoid software hang at the design phase or fix it in production runs, it is desirable to understand its characteristics. Unfortunately, to our knowledge, there is currently no comprehensi ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Software hang is an annoying behavior and forms a major threat to the dependability of many software systems. To avoid software hang at the design phase or fix it in production runs, it is desirable to understand its characteristics. Unfortunately, to our knowledge, there is currently no comprehensive study on why software hangs and how to deal with it. In this paper, we study the reported hangrelated bugs of four typical open-source software applications, aiming to gain insight into characteristics of software hang and provide some guidelines to fix them at the first place or remedy them in production runs. 1
Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems Abhishek Kumar
"... Production systems are commonly plagued by intermittent problems that are difficult to diagnose. This paper describes a new diagnostic tool, called Chopstix, that continuously collects profiles of low-level OS events (e.g., scheduling, L2 cache misses, CPU utilization, I/O operations, page allocatio ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Production systems are commonly plagued by intermittent problems that are difficult to diagnose. This paper describes a new diagnostic tool, called Chopstix, that continuously collects profiles of low-level OS events (e.g., scheduling, L2 cache misses, CPU utilization, I/O operations, page allocation, locking) at the granularity of executables, procedures and instructions. Chopstix then reconstructs these events offline for analysis. We have used Chopstix to diagnose several elusive problems in a largescale production system, thereby reducing these intermittent problems to reproducible bugs that can be debugged using standard techniques. The key to Chopstix is an approximate data collection strategy that incurs very low overhead. An evaluation shows Chopstix requires under 1 % of the CPU, under 256KB of RAM, and under 16MB of disk space per day to collect a rich set of system-wide data. 1
Practical Performance Models for Complex, Popular Applications
"... Perhaps surprisingly, no practical performance models exist for popular (and complex) client applications such as Adobe’s Creative Suite, Microsoft’s Office and Visual Studio, Mozilla, Halo 3, etc. There is currently no tool that automatically answers program developers’, IT administrators ’ and end ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Perhaps surprisingly, no practical performance models exist for popular (and complex) client applications such as Adobe’s Creative Suite, Microsoft’s Office and Visual Studio, Mozilla, Halo 3, etc. There is currently no tool that automatically answers program developers’, IT administrators ’ and end-users ’ simple what-if questions like “what happens to the performance of my favorite application X if I upgrade from Windows Vista to Windows 7?”. This paper describes our approach towards constructing practical, versatile performance models to address this problem. The goal is to have these models be useful for application developers to help expand application testing coverage and for IT administrators to assist with understanding the performance consequences of a software, hardware or configuration change. This paper’s main contributions are in system building and performance modeling. We believe we have built applications that are easier to model because we have proactively instrumented them to export their state and associated metrics. This application-specific monitoring is always on and interesting data is collected from real, "in-the-wild " deployments. The models we are experimenting with are based on statistical techniques. They require no modifications to the OS or applications beyond the above instrumentation, and no explicit a priori model on how an OS or application should behave. We are in the process of learning from models we have constructed for several Microsoft products, including the Office suite, Visual Studio and Media Player. This paper presents preliminary findings from a large user deployment (several hundred thousand user sessions) of these applications that show the coverage and limitations of such models. These findings pushed us to move beyond averages/means and go into some depth into why client application performance has an inherently large variance.
Pattern Insight, Inc.
"... Customer problem troubleshooting has been a critically important issue for both customers and system providers. This paper makes two major contributions to better understand this topic. First, it provides one of the first characteristic studies of customer problem troubleshooting using a large set ( ..."
Abstract
- Add to MetaCart
Customer problem troubleshooting has been a critically important issue for both customers and system providers. This paper makes two major contributions to better understand this topic. First, it provides one of the first characteristic studies of customer problem troubleshooting using a large set (636,108) of real world customer cases reported from 100,000 commercially deployed storage systems in the last two years. We study the characteristics of customer problem troubleshooting from various dimensions as well as correlation among them. Our results show that while some failures are either benign, or resolved automatically, many others can take hours or days of manual diagnosis to fix. For modern storage systems, hardware failures and misconfigurations dominate customer cases, but software failures take longer time to resolve. Interestingly, a relatively significant percentage of cases are because customers lack sufficient knowledge about the system. We observe that customer problems with attached system logs are invariably resolved much faster than those without logs. Second, we evaluate the potential of using storage system logs to resolve these problems. Our analysis shows that a failure message alone is a poor indicator of root cause, and that combining failure messages with multiple log events can improve low-level root cause prediction by a factor of three. We then discuss the challenges in log analysis and possible solutions.

