Results 1  10
of
60
Less is more: Compact matrix decomposition for large sparse graphs
, 2007
"... Given a large sparse graph, how can we find patterns and anomalies? Several important applications can be modeled as large sparse graphs, e.g., network traffic monitoring, research citation network analysis, social network analysis, and regulatory networks in genes. Low rank decompositions, such as ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
Given a large sparse graph, how can we find patterns and anomalies? Several important applications can be modeled as large sparse graphs, e.g., network traffic monitoring, research citation network analysis, social network analysis, and regulatory networks in genes. Low rank decompositions, such as SVD and CUR, are powerful techniques for revealing latent/hidden variables and associated patterns from high dimensional data. However, those methods often ignore the sparsity property of the graph, and hence usually incur too high memory and computational cost to be practical. We propose a novel method, the Compact Matrix Decomposition (CMD), to compute sparse low rank approximations. CMD dramatically reduces both the computation cost and the space requirements over existing decomposition methods (SVD, CUR). Using CMD as the key building block, we further propose procedures to efficiently construct and analyze dynamic graphs from realtime application data. We provide theoretical guarantee for our methods, and present results on two real, large datasets, one on network flow data (100GB trace of 22K hosts over one month) and one on DBLP (200MB over 25 years). We show that CMD is often an order of magnitude more efficient than the state of the art (SVD and CUR): it is over 10X faster, but requires less than 1/10 of the space, for the same reconstruction accuracy. Finally, we demonstrate how CMD is used for detecting anomalies and monitoring timeevolving graphs, in which it successfully detects wormlike hierarchical scanning patterns in real network data.
Stream Monitoring under the Time Warping Distance
"... Data stream processing has recently attracted an increasing amount of interest. The goal of this paper is to monitor numerical streams, and to find subsequences that are similar to a given query sequence, under the DTW (Dynamic Time Warping) distance. Applications include word spotting, sensor patte ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
Data stream processing has recently attracted an increasing amount of interest. The goal of this paper is to monitor numerical streams, and to find subsequences that are similar to a given query sequence, under the DTW (Dynamic Time Warping) distance. Applications include word spotting, sensor pattern matching, and monitoring of biomedical signals (e.g., EKG, ECG), and monitoring of environmental (seismic and volcanic) signals. DTW is a very popular distance measure, permitting accelerations and decelerations, and it has been studied for finite, stored sequence sets. However, in many applications such as network analysis and sensor monitoring, massive amounts of data arrive continuously and it is infeasible to save all the historical data. We propose SPRING, a novel algorithm that can solve the problem. We provide a theoretical analysis and prove that SPRING does not sacrifice accuracy, while it requires constant space and time per timetick. These are dramatic improvements over the naive method. Our experiments on real and realistic data illustrate that SPRING does indeed detect the qualifying subsequences correctly and that it can offer dramatic improvements in speed (up to 650,000 times) over the naive implementation. 1
Windowbased Tensor Analysis on Highdimensional and Multiaspect Streams
"... Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Aside from timestamp, type and location are the two additional aspects. How to model such streams? How ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Aside from timestamp, type and location are the two additional aspects. How to model such streams? How to simultaneously find patterns within and across the multiple aspects? How to do it incrementally in a streaming fashion? In this paper, all these problems are addressed through a general data model, tensor streams, and an effective algorithmic framework, windowbased tensor analysis (WTA). Two variations of WTA, independentwindow tensor analysis (IW) and movingwindow tensor analysis (MW), are presented and evaluated extensively on real datasets. Finally, we illustrate one important application, MultiAspect Correlation Analysis (MACA), which uses WTA and we demonstrate its effectiveness on an environmental monitoring application. 1
Gamps: Compressing multi sensor data by grouping and amplitude scaling
 In: ACM SIGMOD. (2009
"... We consider the problem of collectively approximating a set of sensor signals using the least amount of space so that any individual signal can be efficiently reconstructed within a given maximum (L∞) error ε. The problem arises naturally in applications that need to collect large amounts of data fr ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
We consider the problem of collectively approximating a set of sensor signals using the least amount of space so that any individual signal can be efficiently reconstructed within a given maximum (L∞) error ε. The problem arises naturally in applications that need to collect large amounts of data from multiple concurrent sources, such as sensors, servers and network routers, and archive them over a long period of time for offline data mining. We present GAMPS, a general framework that addresses this problem by combining several novel techniques. First, it dynamically groups multiple signals together so that signals within each group are correlated and can be maximally compressed jointly. Second, it appropriately scales the amplitudes of different signals within a group and compresses them within the maximum allowed reconstruction error bound. Our schemes are polynomial time O(α, β) approximation schemes, meaning that the maximum (L∞) error is at most αε and it uses at most β times the optimal memory. Finally, GAMPS maintains an index so that various queries can be issued directly on compressed data. Our experiments on several realworld sensor datasets show that GAMPS significantly reduces space without compromising the quality of search and query. Categories and Subject Descriptors
Sustainable Operation and Management of Data Center Chillers using Temporal Data Mining
"... Motivation: Data centers are a critical component of modern IT infrastructure but are also among the worst environmental offenders through their increasing energy usage and the resulting large carbon footprints. Efficient management of data centers, including power management, networking, and coolin ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
Motivation: Data centers are a critical component of modern IT infrastructure but are also among the worst environmental offenders through their increasing energy usage and the resulting large carbon footprints. Efficient management of data centers, including power management, networking, and cooling infrastructure, is hence crucial to sustainability. In the absence of a “firstprinciples ” approach to manage these complex components and their interactions, datadriven approaches have become attractive and tenable. Results: We present a temporal data mining solution to model and optimize performance of data center chillers, a key component of the cooling infrastructure. It helps bridge raw, numeric, timeseries information from sensor streams toward higher level characterizations of chiller behavior, suitable for a data center engineer. To aid in this transduction, temporal data streams are first encoded into a symbolic representation, next runlength encoded segments are mined to form frequent motifs in time series, and finally these metrics are evaluated by their contributions to sustainability. A key innovation in our application is the ability to intersperse “don’t care ” transitions (e.g., transients) in continuousvalued time series data, an advantage we inherit by the application of frequent episode mining to symbolized representations of numeric time series. Our approach provides both qualitative and quantitative characterizations of the sensor streams to the data center engineer, to aid him in tuning chiller operating characteristics. This system is currently being prototyped for a data center managed by HP and experimental results from this application reveal the promise of our approach.
Fast Approximate Correlation for Massive Timeseries Data
"... We consider the problem of computing allpair correlations in a warehouse containing a large number (e.g., tens of thousands) of timeseries (or, signals). The problem arises in automatic discovery of patterns and anomalies in data intensive applications such as data center management, environmental ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
We consider the problem of computing allpair correlations in a warehouse containing a large number (e.g., tens of thousands) of timeseries (or, signals). The problem arises in automatic discovery of patterns and anomalies in data intensive applications such as data center management, environmental monitoring, and scientific experiments. However, with existing techniques, solving the problem for a large stream warehouse is extremely expensive, due to the problem’s inherent quadratic I/O and CPU complexities. We propose novel algorithms, based on Discrete Fourier Transformation (DFT) and graph partitioning, to reduce the endtoend response time of an allpair correlation query. To minimize I/O cost, we partition a massive set of input signals into smaller batches such that caching the signals one batch at a time maximizes data reuse and minimizes disk I/O. To reduce CPU cost, we propose two approximation algorithms. Our first algorithm efficiently computes approximate correlation coefficients of similar signal pairs within a given error bound. The second algorithm efficiently identifies, without any false positives or negatives, all signal pairs with correlations above a given threshold. For many real applications, our approximate solutions are as useful as corresponding exact solutions, due to our strict error guarantees. However, compared to the stateoftheart exact algorithms, our algorithms are up to 17× faster for several real datasets.
Troubleshooting Chronic Conditions in Large IP Networks
, 2008
"... Chronic network conditions are caused by performance impairing events that occur intermittently over an extended period of time. Such conditions can cause repeated performance degradation to customers, and sometimes can even turn into serious hard failures. It is therefore critical to troubleshoot a ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
Chronic network conditions are caused by performance impairing events that occur intermittently over an extended period of time. Such conditions can cause repeated performance degradation to customers, and sometimes can even turn into serious hard failures. It is therefore critical to troubleshoot and repair chronic network conditions in a timely fashion in order to ensure high reliability and performance in large IP networks. Today, troubleshooting chronic conditions is often performed manually, making it a tedious, timeconsuming and errorprone process. In this paper, we present NICE (Networkwide Information Correlation and Exploration), a novel infrastructure that enables the troubleshooting of chronic network conditions by detecting and analyzing statistical correlations across multiple data sources. NICE uses a novel circular permutation test to determine the statistical significance of correlation. It also allows flexible analysis at various spatial granularity (e.g., link, router, network level, etc.). We validate NICE using real measurement data collected at a tier1 ISP network. The results are quite positive. We then apply NICE to troubleshoot real network issues in the tier1 ISP network. In all three case studies conducted so far, NICE successfully uncovers previously unknown chronic network conditions, resulting in improved network operations.
Colibri: Fast Mining of Large Static and Dynamic Graphs
"... Lowrank approximations of the adjacency matrix of a graph are essential in finding patterns (such as communities) and detecting anomalies. Additionally, it is desirable to track the lowrank structure as the graph evolves over time, efficiently and within limited storage. Real graphs typically have ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Lowrank approximations of the adjacency matrix of a graph are essential in finding patterns (such as communities) and detecting anomalies. Additionally, it is desirable to track the lowrank structure as the graph evolves over time, efficiently and within limited storage. Real graphs typically have thousands or millions of nodes, but are usually very sparse. However, standard decompositions such as SVD do not preserve sparsity. This has led to the development of methods such as CUR and CMD, which seek a nonorthogonal basis by sampling the columns and/or rows of the sparse matrix. However, these approaches will typically produce overcomplete bases, which wastes both space and time. In this paper we propose the family of Colibri methods to deal with these challenges. Our version for static graphs, ColibriS, iteratively finds a nonredundant basis and we prove that it has no loss of accuracy compared to the best competitors (CUR and CMD), while achieving significant savings in space and time: on real data, ColibriS requires much less space and is orders of magnitude faster (in proportion to the square of the number of nonredundant columns). Additionally, we propose an efficient update algorithm for dynamic, timeevolving graphs, ColibriD. Our evaluation on a large, real network traffic dataset shows that ColibriD is over 100 times faster than the best published competitor (CMD).
Optimal multiscale patterns in time series streams
 In SIGMOD
, 2006
"... We introduce a method to discover optimal local patterns, which concisely describe the main trends in a time series. Our approach examines the time series at multiple time scales (i.e., window sizes) and efficiently discovers the key patterns in each. We also introduce a criterion to select the best ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
We introduce a method to discover optimal local patterns, which concisely describe the main trends in a time series. Our approach examines the time series at multiple time scales (i.e., window sizes) and efficiently discovers the key patterns in each. We also introduce a criterion to select the best window sizes, which most concisely capture the key oscillatory as well as aperiodic trends. Our key insight lies in learning an optimal orthonormal transform from the data itself, as opposed to using a predetermined basis or approximating function (such as piecewise constant, shortwindow Fourier or wavelets), which essentially restricts us to a particular family of trends. Our method lifts that limitation, while lending itself to fast, incremental estimation in a streaming setting. Experimental evaluation shows that our method can capture meaningful patterns in a variety of settings. Our streaming approach requires order of magnitude less time and space, while still producing concise and informative patterns. 1.