Results 1 - 10
of
17
Towards Automated Performance Diagnosis in a Large IPTV Network
"... IPTV is increasingly being deployed and offered as a commercial service to residential broadband customers. Compared with traditional ISP networks, an IPTV distribution network (i) typically adopts a hierarchical instead of mesh-like structure, (ii) imposes more stringent requirements on both reliab ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
IPTV is increasingly being deployed and offered as a commercial service to residential broadband customers. Compared with traditional ISP networks, an IPTV distribution network (i) typically adopts a hierarchical instead of mesh-like structure, (ii) imposes more stringent requirements on both reliability and performance, (iii) has different distribution protocols (which make heavy use of IP multicast) and traffic patterns, and (iv) faces more serious scalability challenges in managing millions of network elements. These unique characteristics impose tremendous challenges in the effective management of IPTV network and service. In this paper, we focus on characterizing and troubleshooting performance issues in one of the largest IPTV networks in North America. We collect a large amount of measurement data from a wide range of sources, including device usage and error logs, user activity logs, video quality alarms, and customer trouble tickets. We develop a novel diagnosis tool called Giza that is specifically tailored to the enormous scale and hierarchical structure of the IPTV network. Giza applies multi-resolution data analysis to quickly detect and localize regions in the IPTV distribution hierarchy that are experiencing serious performance problems. Giza then uses several statistical data mining techniques to troubleshoot the identified problems and diagnose their root causes. Validation against operational experiences demonstrates the effectiveness of Giza in detecting important performance issues and identifying interesting dependencies. The methodology and algorithms in Giza promise to be of great use in IPTV network operations.
Moving beyond end-to-end path information to optimize cdn performance
- In IMC
, 2009
"... Replicating content across a geographically distributed set of servers and redirecting clients to the closest server in terms of latency has emerged as a common paradigm for improving client performance. In this paper, we analyze latencies measured from servers in Google’s content distribution netwo ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Replicating content across a geographically distributed set of servers and redirecting clients to the closest server in terms of latency has emerged as a common paradigm for improving client performance. In this paper, we analyze latencies measured from servers in Google’s content distribution network (CDN) to clients all across the Internet to study the effectiveness of latency-based server selection. Our main result is that redirecting every client to the server with least latency does not suffice to optimize client latencies. First, even though most clients are served by a geographically nearby CDN node, a sizeable fraction of clients experience latencies several tens of milliseconds higher than other clients in the same region. Second, we find that queueing delays often override the benefits of a client interacting with a nearby server. To help the administrators of Google’s CDN cope with these
WebProphet: Automating performance prediction for web services
- In NSDI
, 2010
"... Today, large-scale web services run on complex systems, spanning multiple data centers and content distribution networks, with performance depending on diverse factors in end systems, networks, and infrastructure servers. Web service providers have many options for improving service performance, var ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Today, large-scale web services run on complex systems, spanning multiple data centers and content distribution networks, with performance depending on diverse factors in end systems, networks, and infrastructure servers. Web service providers have many options for improving service performance, varying greatly in feasibility, cost and benefit, but have few tools to predict the impact of these options. A key challenge is to precisely capture web object dependencies, as these are essential for predicting performance in an accurate and scalable manner. In this paper, we introduce WebProphet, a system that automates performance prediction for web services. WebProphet employs a novel technique based on timing perturbation to extract web object dependencies, and then uses these dependencies to predict the performance impact of changes to the handling of the objects. We have built, deployed, and evaluated the accuracy and efficiency of WebProphet. Applying WebProphet to the Search and Maps services of Google and Yahoo, we find WebProphet predicts the median and 95th percentiles of the page load time distribution with an error rate smaller than 16 % in most cases. Using Yahoo Maps as an example, we find that WebProphet reduces the problem of performance optimization to a small number of web objects whose optimization would reduce the page load time by nearly 40%. 1
Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression ∗
"... Predicting the execution time of computer programs is an important but challenging problem in the community of computer systems. Existing methods require experts to perform detailed analysis of program code in order to construct predictors or select important features. We recently developed a new sy ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Predicting the execution time of computer programs is an important but challenging problem in the community of computer systems. Existing methods require experts to perform detailed analysis of program code in order to construct predictors or select important features. We recently developed a new system to automatically extract a large number of features from program execution on sample inputs, on which prediction models can be constructed without expert knowledge. In this paper we study the construction of predictive models for this problem. We propose the SPORE (Sparse POlynomial REgression) methodology to build accurate prediction models of program performance using feature data collected from program execution on sample inputs. Our two SPORE algorithms are able to build relationships between responses (e.g., the execution time of a computer program) and features, and select a few from hundreds of the retrieved features to construct an explicitly sparse and non-linear model to predict the response variable. The compact and explicitly polynomial form of the estimated model could reveal important insights into the computer program (e.g., features and their non-linear combinations that dominate the execution time), enabling a better understanding of the program’s behavior. Our evaluation on three widely used computer programs shows that SPORE methods can give accurate prediction with relative error less than 7 % by using a moderate number of training data samples. In addition, we compare SPORE algorithms to state-of-the-art sparse regression algorithms, and show that SPORE methods, motivated by real applications, outperform the other methods in terms of both interpretability and prediction accuracy. 1
Predico: A System for What-if Analysis in Complex Data Center Applications
"... Abstract. Modern data center applications are complex distributed systems with tens or hundreds of interacting software components. An important management task in data centers is to predict the impact of a certain workload or reconfiguration change on the performance of the application. Such predic ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Modern data center applications are complex distributed systems with tens or hundreds of interacting software components. An important management task in data centers is to predict the impact of a certain workload or reconfiguration change on the performance of the application. Such predictions require the design of “what-if ” models of the application that take as input hypothetical changes in the application’s workload or environment and estimate its impact on performance. We present Predico, a workload-based what-if analysis system that uses commonly available monitoring information in large scale systems to enable the administrators to ask a variety of workload-based “what-if ” queries about the system. Predico uses a network of queues to analytically model the behavior of large distributed applications. It automatically generates node-level queueing models and then uses model composition to build system-wide models. Predico employs a simple what-if query language and an intelligent query execution algorithm that employs on-the-fly model construction and a change propagation algorithm to efficiently answer queries on large scale systems. We have built a prototype of Predico and have used traces from two large production applications from a financial institution as well as real-world synthetic applications to evaluate its what-if modeling framework. Our experimental evaluation validates the accuracy of Predico’s node-level resource usage, latency and workload-models and then shows how Predico enables what-if analysis in two different applications. 1
Q-score: Proactive Service Quality Assessment in a Large IPTV System
"... Abstract — In large-scale IPTV systems, it is essential to maintain high service quality while providing a wider variety of service features than typical traditional TV. Thus service quality assessment systems are of paramount importance as they monitor the user-perceived service quality and alert w ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract — In large-scale IPTV systems, it is essential to maintain high service quality while providing a wider variety of service features than typical traditional TV. Thus service quality assessment systems are of paramount importance as they monitor the user-perceived service quality and alert when issues occurs. For IPTV systems, however, there is no simple metric to represent userperceived service quality and Quality of Experience (QoE). Moreover, there is only limited user feedback, often in the form of noisy and delayed customer calls. Therefore, we aim to approximate the QoE through a selected set of performance indicators in a proactive (i.e., detect issues before customers reports to call centers) and scalable fashion. In this paper, we present a service quality assessment framework, Q-score, which accurately learns a small set of performance indicators most relevant to user-perceived service quality, and proactively infers service quality in a single score. We evaluate Q-score using network data collected from a commercial IPTV service provider and show that Q-score is able to predict 60 % of the service problems that are reported by customers with 0.1 % false positives. Through Q-score, we have (i) gained insight into various types of service problems causing user dissatisfaction, including why users tend to react promptly to sound issues while late to video issues; (ii) identified and quantified the opportunity to proactively detect the service quality degradation of individual customers before severe performance impact occurs; and (iii) observed possibility to allocate customer care workforce to potentially troubling service areas before issues break out.
TowardsAutomatedPerformanceDiagnosisinaLarge
"... IPTV is increasingly being deployed and offered as a commercial service to residential broadband customers. Compared with traditional ISP networks, an IPTV distribution network (i) typically adopts a hierarchical instead of mesh-like structure, (ii) imposes more stringent requirements on both reliab ..."
Abstract
- Add to MetaCart
IPTV is increasingly being deployed and offered as a commercial service to residential broadband customers. Compared with traditional ISP networks, an IPTV distribution network (i) typically adopts a hierarchical instead of mesh-like structure, (ii) imposes more stringent requirements on both reliability and performance, (iii) has different distribution protocols (which make heavy use of IP multicast) and traffic patterns, and (iv) faces more serious scalability challenges in managing millions of network elements. These unique characteristics impose tremendous challenges in the effective management of IPTV network and service. In this paper, we focus on characterizing and troubleshooting performance issues in one of the largest IPTV networks in North America. We collect a large amount of measurement data from a wide range of sources, including device usage and error logs, user activity logs, video quality alarms, and customer trouble tickets. We develop a novel diagnosis tool called Giza that is specifically tailored to the enormous scale and hierarchical structure of the IPTV network. Giza applies multi-resolution data analysis to quickly detect and localize regions in the IPTV distribution hierarchy that are experiencing serious performance problems. Giza then uses several statistical data mining techniques to troubleshoot the identified problems and diagnose their root causes. Validation against operational experiences demonstrates the effectiveness of Giza in detecting important performance issues and identifying interesting dependencies. The methodology and algorithms in Giza promise to be of great use in IPTV network operations.
SYNERGY: Detecting and Diagnosing Correlated Network Anomalies
"... Network anomalies occur in operational networks and may be logged by a number of network measurement tools such as SNMP and NetFlow. However, accurate and efficient detection of these anomalies in the logged data is very challenging due to the huge data volume and complex characteristics of anomalie ..."
Abstract
- Add to MetaCart
Network anomalies occur in operational networks and may be logged by a number of network measurement tools such as SNMP and NetFlow. However, accurate and efficient detection of these anomalies in the logged data is very challenging due to the huge data volume and complex characteristics of anomalies. The existing approaches are limited by the nature of underlying mathematical models and might be incapable of capturing some abnormal patterns. More importantly, existing approaches do not provide insights on the root causes or impact of the detected anomalies, which makes it hard for a network operator to troubleshoot network performance issues. In this paper, we design and prototype a novel system, SYNERGY, that can detect network anomalies with high confidence by correlating across multiple data sources. It can report the root causes/impact associated with the detected anomalies, which significantly facilitates the work of network operators. In addition, SYNERGY provides a great facility for the area of anomaly detection research – it can serve as a general framework to evaluate the performance of different anomaly detection methods. We evaluate SYNERGY using data collected at a tier-1 ISP network and show that it performs very well compared to the manually identified anomalies found in the operational practice. The methodology and algorithms in SYNERGY promise to be of immense use to network operations. 1.
management
"... Networks continue to change to support new applications, improve reliability and performance and reduce the operational cost. The changes are made to the network in the form of upgrades such as software or hardware upgrades, new network or service features and network configuration changes. It is cr ..."
Abstract
- Add to MetaCart
Networks continue to change to support new applications, improve reliability and performance and reduce the operational cost. The changes are made to the network in the form of upgrades such as software or hardware upgrades, new network or service features and network configuration changes. It is crucial to monitor the network when upgrades are made because they can have a significant impact on network performance and if not monitored may lead to unexpected consequences in operational networks. This can be achieved manually for a small number of devices, but does not scale to large networks with hundreds or thousands of routers and extremely large number of different upgrades made on a regular basis. In this paper, we design and implement a novel infrastructure MERCURY for detecting the impact of network upgrades (or triggers) on performance. MERCURY extracts interesting triggers from a large number of network maintenance activities. It then identifies behavior changes in network performance caused by the triggers. It uses statistical rule mining and network configuration to identify commonality across the behavior changes. We systematically evaluate MERCURY using data collected at a large tier-1 ISP network. By comparing to operational practice, we show that MERCURY is able to capture the interesting triggers and behavior changes induced by the triggers. In some cases, MERCURY also discovers previously unknown network behaviors demonstrating the effectiveness in identifying network conditions flying under the radar.
Analyzing IPTV Set-Top Box Crashes
"... Abstract — Recent advances in residential broadband access technology have led to a wave of commercial IPTV deployment. As IPTV services are rolled out at scale, it is essential for IPTV systems to maintain ultra-high reliability and performance. A major issue that disrupts IPTV service is the crash ..."
Abstract
- Add to MetaCart
Abstract — Recent advances in residential broadband access technology have led to a wave of commercial IPTV deployment. As IPTV services are rolled out at scale, it is essential for IPTV systems to maintain ultra-high reliability and performance. A major issue that disrupts IPTV service is the crash of the set-top box (STB) software, which directly resides inside the consumer’s home network and provides the essential interface to both the user and the network to deliver rich contents that go well beyond traditional TV. To understand the potential causes of STB crashes, we perform in-depth statistical analysis on the relationship among STB crashes, video stream contents, and user activities using logs collected from a large commercial IPTV system. Our initial results suggest that (i) impaired video streams may cause STB to crash, and (ii) continuous usage of STB may gradually degrade the STB health over time. 1.

