Results 1 - 10
of
47
Towards highly reliable enterprise network services via inference of multi-level dependencies
- In SIGCOMM
, 2007
"... Localizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently multi-level, spanning hardware and software components across the network and the computing infrastructure. To exploit these dependencies for fast, ..."
Abstract
-
Cited by 82 (7 self)
- Add to MetaCart
Localizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently multi-level, spanning hardware and software components across the network and the computing infrastructure. To exploit these dependencies for fast, accurate problem localization, we introduce an Inference Graph model, which is welladapted to user-perceptible problems rooted in conditions giving rise to both partial service degradation and hard faults. Further, we introduce the Sherlock system to discover Inference Graphs in the operational enterprise, infer critical attributes, and then leverage the result to automatically detect and localize problems. To illuminate strengths and limitations of the approach, we provide results from a prototype deployment in a large enterprise network, as well as from testbed emulations and simulations. In particular, we find that taking into account multi-level structure leads to a 30 % improvement in fault localization, as compared to two-level approaches.
Shrink: A tool for failure diagnosis in IP networks
- In MineNet
, 2005
"... Faults in an IP network have various causes such as the failure of one or more routers at the IP layer, fiber-cuts, failure of physical elements at the optical layer, or extraneous causes like power outages. These faults are usually detected as failures of a set of dependent logical entities–the IP ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
Faults in an IP network have various causes such as the failure of one or more routers at the IP layer, fiber-cuts, failure of physical elements at the optical layer, or extraneous causes like power outages. These faults are usually detected as failures of a set of dependent logical entities–the IP links affected by the failed components. We present Shrink, a tool for root cause analysis of network faults which, given a set of failed IP links, identifies the underlying cause of the faulty state. Shrink models the diagnosis problem as a Bayesian network. It has two main contributions. First, it effectively accounts for noisy measurement and inaccurate mapping between the IP and optical layers. Second, it has an efficient inference algorithm that finds the most likely failure causes in polynomial time and with bounded errors. We compare Shrink with two prior approaches and show that it substantially improves the performance.
Detailed Diagnosis in Enterprise Networks
, 2009
"... By studying trouble tickets from small enterprise networks, we conclude that their operators need detailed fault diagnosis. That is, the diagnostic system should be able to diagnose not only generic faults (e.g., performance-related) but also application specific faults (e.g., error codes). It sho ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
By studying trouble tickets from small enterprise networks, we conclude that their operators need detailed fault diagnosis. That is, the diagnostic system should be able to diagnose not only generic faults (e.g., performance-related) but also application specific faults (e.g., error codes). It should also identify culprits at a fine granularity such as a process or firewall configuration. We build a system, called NetMedic, that enables detailed diagnosis by harnessing the rich information exposed by modern operating systems and applications. It formulates detailed diagnosis as an inference problem that more faithfully captures the behaviors and interactions of fine-grained network components such as processes. The primary challenge in solving this problem is inferring when a component might be impacting another. Our solution is based on an intuitive technique that uses the joint behavior of two components in the past to estimate the likelihood of them impacting one another in the present. We find that our deployed prototype is effective at diagnosing faults that we inject in a live environment. The faulty component is correctly identified as the most likely culprit in 80% of the cases and is almost always in the list of top five culprits.
Detection and Localization of Network Black Holes
- In Proceedings of IEEE Infocom
, 2007
"... Abstract — Internet backbone networks are under constant flux, struggling to keep up with increasing demand. The pace of technology change often outstrips the deployment of associated fault monitoring capabilities that are built into today’s IP protocols and routers. Moreover, some of these new tech ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Abstract — Internet backbone networks are under constant flux, struggling to keep up with increasing demand. The pace of technology change often outstrips the deployment of associated fault monitoring capabilities that are built into today’s IP protocols and routers. Moreover, some of these new technologies cross networking layers, raising the potential for unanticipated interactions and service disruptions that the built-in monitoring systems cannot detect. In such instances, failures may cause data packets to be silently dropped inside the network without triggering any alarms or responses (e.g., the failure is not routed around). So-called “silent failures ” or “black holes” represent a critical threat to today’s rapidly evolving networks. In this paper, we present a simple and effective method to detect and diagnose such silent failures. Our method uses active measurement between edge routers to raise alarms whenever endto-end connectivity is disrupted, regardless of the cause. These alarms feed localization agents that employ spatial correlation techniques to isolate the root-cause of failure. Using data from two real systems deployed on sections of a tier-I ISP network, we successfully detect and localize three known black holes. Further, we present simulation results demonstrating that our system accurately and precisely (both greater than 80 % according to our metrics) localizes a variety of failures classes. I.
NetDiagnoser: Troubleshooting network unreachabilities using end-to-end probes and routing data
- IN CONEXT
, 2007
"... The distributed nature of the Internet makes it difficult for a single service provider to troubleshoot the disruptions experienced by its customers. We propose NetDiagnoser, a troubleshooting algorithm to identify the location of failures in an internetwork environment. First, we adapt the wellknow ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
The distributed nature of the Internet makes it difficult for a single service provider to troubleshoot the disruptions experienced by its customers. We propose NetDiagnoser, a troubleshooting algorithm to identify the location of failures in an internetwork environment. First, we adapt the wellknown Boolean tomography technique to work in this environment. Then, we significantly extend this technique to improve the diagnosis accuracy in the presence of multiple link failures, logical failures (for instance, misconfigurations of route export filters), and incomplete topology inference. In particular, NetDiagnoser takes advantage of rerouted paths, routing messages collected at one provider’s network and Looking Glass servers. We evaluate each feature of Net-Diagnoser separately using C-BGP simulations on realistic topologies. Our results show that NetDiagnoser can successfully identify a small set of links, which almost always includes the actually failed/misconfigured links.
Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions
"... Abstract – Large enterprise networks consist of thousands of services and applications. The performance and reliability of any particular application may depend on multiple services, spanning many hosts and network components. While the knowledge of such dependencies is invaluable for ensuring the s ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Abstract – Large enterprise networks consist of thousands of services and applications. The performance and reliability of any particular application may depend on multiple services, spanning many hosts and network components. While the knowledge of such dependencies is invaluable for ensuring the stability and efficiency of these applications, thus far the only proven way to discover these complex dependencies is by exploiting human expert knowledge, which does not scale with the number of applications in large enterprises. Recently, researchers have proposed automated discovery of dependencies from network traffic [8, 18]. In this paper, we present a comprehensive study of the performance and limitations of this class of dependency discovery techniques (including our own prior work), by comparing with the ground truth of five dominant Microsoft applications. We introduce a new system, Orion, that discovers dependencies using packet headers and timing information in network traffic based on a novel insight of delay spike based analysis. Orion improves the state of the art significantly, but some shortcomings still remain. To take the next step forward, Orion incorporates external tests to reduce errors to a manageable level. Our results show Orion provides a solid foundation for combining automated discovery with simple testing to obtain accurate and validated dependencies. 1
Cross-layer visibility as a service
- In Proc. IV HotNets Workshop
, 2005
"... Accurate cross-layer associations play an essential role in today’s network management tasks such as backbone planning, maintenance, and failure diagnosis. Current techniques for manually maintaining these associations are complex, tedious, and error-prone. One possible approach is to widen the inte ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Accurate cross-layer associations play an essential role in today’s network management tasks such as backbone planning, maintenance, and failure diagnosis. Current techniques for manually maintaining these associations are complex, tedious, and error-prone. One possible approach is to widen the interfaces between layers to support auto discovery. We argue instead that it is less useful to export additional data between layers than to import information into a separate, logically centralized management database. The specification of an interface to this database enables independent evolution of individual layers, side-stepping the challenges inherent in wide layer interfaces. Furthermore, management tools can leverage the network-wide cross-layer visibility provided by such a database to deliver enhanced services that depend on physical- or link-layer diversity. 1
Studying Black Holes in the Internet with Hubble
"... We present Hubble, a system that operates continuously to find Internet reachability problems in which routes exist to a destination but packets are unable to reach the destination. Hubble monitors at a 15 minute granularity the data-path to prefixes that cover 89 % of the Internet’s edge address sp ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
We present Hubble, a system that operates continuously to find Internet reachability problems in which routes exist to a destination but packets are unable to reach the destination. Hubble monitors at a 15 minute granularity the data-path to prefixes that cover 89 % of the Internet’s edge address space. Key enabling techniques include a hybrid passive/active monitoring approach and the synthesis of multiple information sources that include historical data. With these techniques, we estimate that Hubble discovers 85 % of the reachability problems that would be found with a pervasive probing approach, while issuing only 5.5 % as many probes. We also present the results of a three week study conducted with Hubble. We find that the extent of reachability problems, both in number and duration, is much greater than we expected, with problems persisting for hours and even days, and many of the problems do not correlate with BGP updates. In many cases, a multi-homed AS is reachable through one provider, but probes through another terminate; using spoofed packets, we isolated the direction of failure in 84 % of cases we analyzed and found all problems to be exclusively on the forward path from the provider to the destination. A snapshot of the problems Hubble is currently monitoring can be found at
Shadow configuration as a network management primitive
- In SIGCOMM
, 2008
"... Configurations for today’s IP networks are becoming increasingly complex. As a result, configuration management is becoming a major cost factor for network providers and configuration errors are becoming a major cause of network disruptions. In this paper, we present and evaluate the novel idea of s ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Configurations for today’s IP networks are becoming increasingly complex. As a result, configuration management is becoming a major cost factor for network providers and configuration errors are becoming a major cause of network disruptions. In this paper, we present and evaluate the novel idea of shadow configurations. Shadow configurations allow configuration evaluation before deployment and thus can reduce potential network disruptions. We demonstrate using real implementation that shadow configurations can be implemented with low overhead.

