Results 1 - 10
of
150
Towards highly reliable enterprise network services via inference of multi-level dependencies
- In SIGCOMM
, 2007
"... Localizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently multi-level, spanning hardware and software components across the network and the computing infrastructure. To exploit these dependencies for fast, ..."
Abstract
-
Cited by 82 (7 self)
- Add to MetaCart
Localizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently multi-level, spanning hardware and software components across the network and the computing infrastructure. To exploit these dependencies for fast, accurate problem localization, we introduce an Inference Graph model, which is welladapted to user-perceptible problems rooted in conditions giving rise to both partial service degradation and hard faults. Further, we introduce the Sherlock system to discover Inference Graphs in the operational enterprise, infer critical attributes, and then leverage the result to automatically detect and localize problems. To illuminate strengths and limitations of the approach, we provide results from a prototype deployment in a large enterprise network, as well as from testbed emulations and simulations. In particular, we find that taking into account multi-level structure leads to a 30 % improvement in fault localization, as compared to two-level approaches.
X-trace: A pervasive network tracing framework
- In NSDI
, 2007
"... Modern Internet systems often combine different applications (e.g., DNS, web, and database), span different administrative domains, and function in the context of network mechanisms like tunnels, VPNs, NATs, and overlays. Diagnosing these complex systems is a daunting challenge. Although many diagno ..."
Abstract
-
Cited by 76 (14 self)
- Add to MetaCart
Modern Internet systems often combine different applications (e.g., DNS, web, and database), span different administrative domains, and function in the context of network mechanisms like tunnels, VPNs, NATs, and overlays. Diagnosing these complex systems is a daunting challenge. Although many diagnostic tools exist, they are typically designed for a specific layer (e.g., traceroute) or application, and there is currently no tool for reconstructing a comprehensive view of service behavior. In this paper we propose X-Trace, a tracing framework that provides such a comprehensive view for systems that adopt it. We have implemented X-Trace in several protocols and software systems, and we discuss how it works in three deployed scenarios: DNS resolution, a three-tiered photo-hosting website, and a service accessed through an overlay network. 1
detecting the unexpected in distributed systems
- In NSDI’06: Proceedings of the 3rd conference on 3rd Symposium on Networked Systems Design & Implementation
"... Bugs in distributed systems are often hard to find. Many bugs reflect discrepancies between a system’s behavior and the programmer’s assumptions about that behavior. We present Pip 1, an infrastructure for comparing actual behavior and expected behavior to expose structural errors and performance pr ..."
Abstract
-
Cited by 75 (6 self)
- Add to MetaCart
Bugs in distributed systems are often hard to find. Many bugs reflect discrepancies between a system’s behavior and the programmer’s assumptions about that behavior. We present Pip 1, an infrastructure for comparing actual behavior and expected behavior to expose structural errors and performance problems in distributed systems. Pip allows programmers to express, in a declarative language, expectations about the system’s communications structure, timing, and resource consumption. Pip includes system instrumentation and annotation tools to log actual system behavior, and visualization and query tools for exploring expected and unexpected behavior 2. Pip allows a developer to quickly understand and debug both familiar and unfamiliar systems. We applied Pip to several applications, including FAB, SplitStream, Bullet, and RanSub. We generated most of the instrumentation for all four applications automatically. We found the needed expectations easy to write, starting in each case with automatically generated expectations. Pip found unexpected behavior in each application, and helped to isolate the causes of poor performance and incorrect behavior. 1
Capturing, indexing, clustering, and retrieving system history
- In SOSP
, 2005
"... system performance, Bayesian networks, information retrieval, problem signatures We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similari ..."
Abstract
-
Cited by 65 (5 self)
- Add to MetaCart
system performance, Bayesian networks, information retrieval, problem signatures We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual "raw " values of collected measurements is
Stardust: Tracking Activity in a Distributed Storage System
- ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (Saint-Malo
, 2006
"... Performance monitoring in most distributed systems provides minimal guidance for tuning, problem diagnosis, and decision making. Stardust is a monitoring infrastructure that replaces traditional performance counters with end-to-end traces of requests and allows for efficient querying of performance ..."
Abstract
-
Cited by 40 (10 self)
- Add to MetaCart
Performance monitoring in most distributed systems provides minimal guidance for tuning, problem diagnosis, and decision making. Stardust is a monitoring infrastructure that replaces traditional performance counters with end-to-end traces of requests and allows for efficient querying of performance metrics. Such traces better inform key administrative performance challenges by enabling, for example, extraction of per-workload, per-resource demand information and per-workload latency graphs. This paper reports on our experience building and using end-to-end tracing as an on-line monitoring tool in a distributed storage system. Using diverse system workloads and scenarios, we show that such fine-grained tracing can be made efficient (less than 6% overhead) and is useful for on- and off-line analysis of system behavior. These experiences make a case for having other systems incorporate such an instrumentation framework.
Automated Known Problem Diagnosis with Event Traces
- In EuroSys
, 2006
"... Computer problem diagnosis remains a serious challenge to users and support professionals. Traditional troubleshooting methods relying heavily on human intervention make the process inefficient and the results inaccurate even for solved problems, which contribute significantly to user’s dissatisfact ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
Computer problem diagnosis remains a serious challenge to users and support professionals. Traditional troubleshooting methods relying heavily on human intervention make the process inefficient and the results inaccurate even for solved problems, which contribute significantly to user’s dissatisfaction. We propose to use system behavior information such as system event traces to build correlations with solved problems, instead of using only vague text descriptions as in existing practices. The goal is to enable automatic identification of the root cause of a problem if it is a known one, which would further lead to its resolution. By applying statistical learning techniques to classifying system call sequences, we show our approach can achieve considerable accuracy of root cause recognition by studying four case examples.
Exploiting nonstationarity for performance prediction
- In Proc. of EuroSys’2007,Lisbon
, 2007
"... Real production applications ranging from enterprise applications to large e-commerce sites share a crucial but seldom-noted characteristic: The relative frequencies of transaction types in their workloads are nonstationary, i.e., the transaction mix changes over time. Accurately predicting applicat ..."
Abstract
-
Cited by 36 (10 self)
- Add to MetaCart
Real production applications ranging from enterprise applications to large e-commerce sites share a crucial but seldom-noted characteristic: The relative frequencies of transaction types in their workloads are nonstationary, i.e., the transaction mix changes over time. Accurately predicting application-level performance in businesscritical production applications is an increasingly important problem. However, transaction mix nonstationarity casts doubt on the practical usefulness of prediction methods that ignore this phenomenon. This paper demonstrates that transaction mix nonstationarity enables a new approach to predicting application-level performance as a function of transaction mix. We exploit nonstationarity to circumvent the need for invasive instrumentation and controlled benchmarking during model calibration; our approach relies solely on lightweight passive measurements that are routinely collected in today’s production environments. We evaluate predictive accuracy on two real business-critical production applications. The accuracy of our response time predictions ranges from 10 % to 16 % on these applications, and our models generalize well to workloads very different from those used for calibration. We apply our technique to the challenging problem of predicting the impact of application consolidation on transaction response times. We calibrate models of two testbed applications running on dedicated machines, then use the models to predict their performance when they run together on a shared machine and serve very different workloads. Our predictions are accurate to within 4 % to 14%. Existing approaches to consolidation decision support predict post-consolidation resource utilizations. Our method allows application-level performance to guide consolidation decisions. 1.
Friday: Global comprehension for distributed replay
- In Proceedings of the Fourth Symposium on Networked Systems Design and Implementation (NSDI ’07
, 2007
"... Debugging and profiling large-scale distributed applications is a daunting task. We present Friday, a system for debugging distributed applications that combines deterministic replay of components with the power of symbolic, low-level debugging and a simple language for expressing higher-level distr ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
Debugging and profiling large-scale distributed applications is a daunting task. We present Friday, a system for debugging distributed applications that combines deterministic replay of components with the power of symbolic, low-level debugging and a simple language for expressing higher-level distributed conditions and actions. Friday allows the programmer to understand the collective state and dynamics of a distributed collection of coordinated application components. To evaluate Friday, we consider several distributed problems, including routing consistency in overlay networks, and temporal state abnormalities caused by route flaps. We show via micro-benchmarks and larger-scale application measurement that Friday can be used interactively to debug large distributed applications under replay on common hardware. 1
CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems
"... We propose a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighbor ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
We propose a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighborhood and predicts possible future violations of specified safety properties. We describe a new state exploration algorithm, consequence prediction, which explores causally related chains of events that lead to property violation. This paper describes the design and implementation of this approach, termed CrystalBall. We evaluate CrystalBall on RandTree, BulletPrime, Paxos, and Chord distributed system implementations. We identified new bugs in mature Mace implementations of three systems. Furthermore, we show that if the bug is not corrected during system development, CrystalBall is effective in steering the execution away from inconsistent states at runtime.
Using Queries for Distributed Monitoring and Forensics
, 2006
"... Distributed systems are hard to build, profile, debug, and test. Monitoring a distributed system – to detect and analyze bugs, test for regressions, identify fault-tolerance problems or security compromises – can be difficult and error-prone. In this paper we argue that declarative development of di ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Distributed systems are hard to build, profile, debug, and test. Monitoring a distributed system – to detect and analyze bugs, test for regressions, identify fault-tolerance problems or security compromises – can be difficult and error-prone. In this paper we argue that declarative development of distributed systems is well suited to tackle these tasks. We present an application logging, monitoring, and debugging facility that we have built on top of the P2 system, comprising an introspection model, an execution tracing component, and a distributed query processor. We use this facility to demonstrate a range of on-line distributed diagnosis tools that range from simple, local state assertions to sophisticated global property detectors on consistent snapshots. These tools are small, simple, and can be deployed piecemeal on-line at any point during a system’s life cycle. Our evaluation suggests that the overhead of our approach to improving and monitoring running distributed systems continuously is well in tune with its benefits.

