Results 1 - 10
of
42
Less is more: Compact matrix decomposition for large sparse graphs
, 2007
"... Given a large sparse graph, how can we find patterns and anomalies? Several important applications can be modeled as large sparse graphs, e.g., network traffic monitoring, research citation network analysis, social network analysis, and regulatory networks in genes. Low rank decompositions, such as ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
Given a large sparse graph, how can we find patterns and anomalies? Several important applications can be modeled as large sparse graphs, e.g., network traffic monitoring, research citation network analysis, social network analysis, and regulatory networks in genes. Low rank decompositions, such as SVD and CUR, are powerful techniques for revealing latent/hidden variables and associated patterns from high dimensional data. However, those methods often ignore the sparsity property of the graph, and hence usually incur too high memory and computational cost to be practical. We propose a novel method, the Compact Matrix Decomposition (CMD), to compute sparse low rank approximations. CMD dramatically reduces both the computation cost and the space requirements over existing decomposition methods (SVD, CUR). Using CMD as the key building block, we further propose procedures to efficiently construct and analyze dynamic graphs from real-time application data. We provide theoretical guarantee for our methods, and present results on two real, large datasets, one on network flow data (100GB trace of 22K hosts over one month) and one on DBLP (200MB over 25 years). We show that CMD is often an order of magnitude more efficient than the state of the art (SVD and CUR): it is over 10X faster, but requires less than 1/10 of the space, for the same reconstruction accuracy. Finally, we demonstrate how CMD is used for detecting anomalies and monitoring timeevolving graphs, in which it successfully detects worm-like hierarchical scanning patterns in real network data.
Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams
"... Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Aside from timestamp, type and location are the two additional aspects. How to model such streams? How ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Aside from timestamp, type and location are the two additional aspects. How to model such streams? How to simultaneously find patterns within and across the multiple aspects? How to do it incrementally in a streaming fashion? In this paper, all these problems are addressed through a general data model, tensor streams, and an effective algorithmic framework, window-based tensor analysis (WTA). Two variations of WTA, independentwindow tensor analysis (IW) and moving-window tensor analysis (MW), are presented and evaluated extensively on real datasets. Finally, we illustrate one important application, Multi-Aspect Correlation Analysis (MACA), which uses WTA and we demonstrate its effectiveness on an environmental monitoring application. 1
Gamps: Compressing multi sensor data by grouping and amplitude scaling
- In: ACM SIGMOD. (2009
"... We consider the problem of collectively approximating a set of sensor signals using the least amount of space so that any individual signal can be efficiently reconstructed within a given maximum (L∞) error ε. The problem arises naturally in applications that need to collect large amounts of data fr ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We consider the problem of collectively approximating a set of sensor signals using the least amount of space so that any individual signal can be efficiently reconstructed within a given maximum (L∞) error ε. The problem arises naturally in applications that need to collect large amounts of data from multiple concurrent sources, such as sensors, servers and network routers, and archive them over a long period of time for offline data mining. We present GAMPS, a general framework that addresses this problem by combining several novel techniques. First, it dynamically groups multiple signals together so that signals within each group are correlated and can be maximally compressed jointly. Second, it appropriately scales the amplitudes of different signals within a group and compresses them within the maximum allowed reconstruction error bound. Our schemes are polynomial time O(α, β) approximation schemes, meaning that the maximum (L∞) error is at most αε and it uses at most β times the optimal memory. Finally, GAMPS maintains an index so that various queries can be issued directly on compressed data. Our experiments on several real-world sensor datasets show that GAMPS significantly reduces space without compromising the quality of search and query. Categories and Subject Descriptors
Colibri: Fast Mining of Large Static and Dynamic Graphs
"... Low-rank approximations of the adjacency matrix of a graph are essential in finding patterns (such as communities) and detecting anomalies. Additionally, it is desirable to track the low-rank structure as the graph evolves over time, efficiently and within limited storage. Real graphs typically have ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Low-rank approximations of the adjacency matrix of a graph are essential in finding patterns (such as communities) and detecting anomalies. Additionally, it is desirable to track the low-rank structure as the graph evolves over time, efficiently and within limited storage. Real graphs typically have thousands or millions of nodes, but are usually very sparse. However, standard decompositions such as SVD do not preserve sparsity. This has led to the development of methods such as CUR and CMD, which seek a nonorthogonal basis by sampling the columns and/or rows of the sparse matrix. However, these approaches will typically produce overcomplete bases, which wastes both space and time. In this paper we propose the family of Colibri methods to deal with these challenges. Our version for static graphs, Colibri-S, iteratively finds a nonredundant basis and we prove that it has no loss of accuracy compared to the best competitors (CUR and CMD), while achieving significant savings in space and time: on real data, Colibri-S requires much less space and is orders of magnitude faster (in proportion to the square of the number of non-redundant columns). Additionally, we propose an efficient update algorithm for dynamic, time-evolving graphs, Colibri-D. Our evaluation on a large, real network traffic dataset shows that Colibri-D is over 100 times faster than the best published competitor (CMD).
Troubleshooting Chronic Conditions in Large IP Networks
, 2008
"... Chronic network conditions are caused by performance impairing events that occur intermittently over an extended period of time. Such conditions can cause repeated performance degradation to customers, and sometimes can even turn into serious hard failures. It is therefore critical to troubleshoot a ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
Chronic network conditions are caused by performance impairing events that occur intermittently over an extended period of time. Such conditions can cause repeated performance degradation to customers, and sometimes can even turn into serious hard failures. It is therefore critical to troubleshoot and repair chronic network conditions in a timely fashion in order to ensure high reliability and performance in large IP networks. Today, troubleshooting chronic conditions is often performed manually, making it a tedious, timeconsuming and error-prone process. In this paper, we present NICE (Network-wide Information Correlation and Exploration), a novel infrastructure that enables the troubleshooting of chronic network conditions by detecting and analyzing statistical correlations across multiple data sources. NICE uses a novel circular permutation test to determine the statistical significance of correlation. It also allows flexible analysis at various spatial granularity (e.g., link, router, network level, etc.). We validate NICE using real measurement data collected at a tier-1 ISP network. The results are quite positive. We then apply NICE to troubleshoot real network issues in the tier-1 ISP network. In all three case studies conducted so far, NICE successfully uncovers previously unknown chronic network conditions, resulting in improved network operations.
Optimal multi-scale patterns in time series streams
- In SIGMOD
, 2006
"... We introduce a method to discover optimal local patterns, which concisely describe the main trends in a time series. Our approach examines the time series at multiple time scales (i.e., window sizes) and efficiently discovers the key patterns in each. We also introduce a criterion to select the best ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
We introduce a method to discover optimal local patterns, which concisely describe the main trends in a time series. Our approach examines the time series at multiple time scales (i.e., window sizes) and efficiently discovers the key patterns in each. We also introduce a criterion to select the best window sizes, which most concisely capture the key oscillatory as well as aperiodic trends. Our key insight lies in learning an optimal orthonormal transform from the data itself, as opposed to using a predetermined basis or approximating function (such as piecewise constant, shortwindow Fourier or wavelets), which essentially restricts us to a particular family of trends. Our method lifts that limitation, while lending itself to fast, incremental estimation in a streaming setting. Experimental evaluation shows that our method can capture meaningful patterns in a variety of settings. Our streaming approach requires order of magnitude less time and space, while still producing concise and informative patterns. 1.
Stream Monitoring under the Time Warping Distance
"... Data stream processing has recently attracted an increasing amount of interest. The goal of this paper is to monitor numerical streams, and to find subsequences that are similar to a given query sequence, under the DTW (Dynamic Time Warping) distance. Applications include word spotting, sensor patte ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Data stream processing has recently attracted an increasing amount of interest. The goal of this paper is to monitor numerical streams, and to find subsequences that are similar to a given query sequence, under the DTW (Dynamic Time Warping) distance. Applications include word spotting, sensor pattern matching, and monitoring of biomedical signals (e.g., EKG, ECG), and monitoring of environmental (seismic and volcanic) signals. DTW is a very popular distance measure, permitting accelerations and decelerations, and it has been studied for finite, stored sequence sets. However, in many applications such as network analysis and sensor monitoring, massive amounts of data arrive continuously and it is infeasible to save all the historical data. We propose SPRING, a novel algorithm that can solve the problem. We provide a theoretical analysis and prove that SPRING does not sacrifice accuracy, while it requires constant space and time per time-tick. These are dramatic improvements over the naive method. Our experiments on real and realistic data illustrate that SPRING does indeed detect the qualifying subsequences correctly and that it can offer dramatic improvements in speed (up to 650,000 times) over the naive implementation. 1
Managing Massive Time Series Streams with Multi-Scale Compressed Trickles ABSTRACT
"... We present Cypress, a novel framework to archive and query massive time series streams such as those generated by sensor networks, data centers, and scientific computing. Cypress applies multi-scale analysis to decompose time series and to obtain sparse representations in various domains (e.g. frequ ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
We present Cypress, a novel framework to archive and query massive time series streams such as those generated by sensor networks, data centers, and scientific computing. Cypress applies multi-scale analysis to decompose time series and to obtain sparse representations in various domains (e.g. frequency domain and time domain). Relying on the sparsity, the time series streams can be archived with reduced storage space. We then show that many statistical queries such as trend, histogram and correlations can be answered directly from compressed data rather than from reconstructed raw data. Our evaluation with server utilization data collected from real data centers shows significant benefit of our framework. 1.
Spade: On shape-based pattern detection in streaming time series
- in ICDE, 2007
"... Monitoring predefined patterns in streaming time series is useful to applications such as trend-related analysis, sensor networks and video surveillance. Most current studies on such monitoring employ Euclidean distance to calculate the similarities between given query patterns and subsequences of s ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Monitoring predefined patterns in streaming time series is useful to applications such as trend-related analysis, sensor networks and video surveillance. Most current studies on such monitoring employ Euclidean distance to calculate the similarities between given query patterns and subsequences of streaming time series. Euclidean distance has been shown to be ineffective in measuring distances of time series in which shifting and scaling usually exist. Consequently, warping distances such as dynamic time warping (DTW), longest common subsequence (LCSS), have been proposed to handle warps in temporal dimension. However, they are inadequate in handling shifting and scaling in amplitude dimension. Moreover, they have been designed mainly for full sequence matching, whereas in online monitoring applications, we typically have no knowledge on the positions and lengths of possible matching subsequences. In this paper, we first discuss the weaknesses of existing warping distances on detecting patterns from streaming time series. We then propose a novel warping distance, which we name Spatial Assembling Distance (SpADe), that is able to handle shifting and scaling in both temporal and amplitude dimensions. We further propose an efficient approach for continuous pattern detection using SpADe, that is fundamental for subsequence matching on streaming data. Finally, our experimental results show that SpADe is effective and efficient for continuous pattern detection in streaming time series. 1
Distributed Pattern Discovery in Multiple Streams
- In Proc. of PAKDD
, 2006
"... and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties. ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties.

