Results 1  10
of
18
OnePass Wavelet Decompositions of Data Streams
 IEEE TKDE
, 2003
"... We present techniques for computing small space representations of massive data streams. These are inspired by traditional waveletbased approximations that consist of specific linear projections of the underlying data. We present general "sketch" based methods for capturing various linear projec ..."
Abstract

Cited by 44 (2 self)
 Add to MetaCart
We present techniques for computing small space representations of massive data streams. These are inspired by traditional waveletbased approximations that consist of specific linear projections of the underlying data. We present general "sketch" based methods for capturing various linear projections and use them to provide pointwise and rangesum estimation of data streams.
Local and Global Methods in Data Mining: Basic Techniques and Open Problems
 In ICALP 2002, 29th International Colloquium on Automata, Languages, and Programming, Malaga
, 2002
"... Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns oc ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns occurring in the data. We discuss briefly some simple local and global techniques, review two attempts at combining the approaches, and list open problems with an algorithmic flavor.
A data streaming algorithm for estimating entropies of od flows
 In IMC ’07: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
, 2007
"... Entropy has recently gained considerable significance as an important metric for network measurement. Previous research has shown its utility in clustering traffic and detecting traffic anomalies. While measuring the entropy of the traffic observed at a single point has already been studied, an inte ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
Entropy has recently gained considerable significance as an important metric for network measurement. Previous research has shown its utility in clustering traffic and detecting traffic anomalies. While measuring the entropy of the traffic observed at a single point has already been studied, an interesting open problem is to measure the entropy of the traffic between every origindestination pair. In this paper, we propose the first solution to this challenging problem. Our sketch builds upon and extends the Lp sketch of Indyk with significant additional innovations. We present calculations showing that our data streaming algorithm is feasible for high link speeds using commodity CPU/memory at a reasonable cost. Our algorithm is shown to be very accurate in practice via simulations, using traffic traces collected at a tier1 ISP backbone link.
Fast moment estimation in data streams in optimal space
 In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC
, 2011
"... We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptimal alg ..."
Abstract

Cited by 20 (6 self)
 Add to MetaCart
We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptimal algorithm of [KaneNelsonWoodruff, SODA 2010], which had update time Ω(1/ε 2). 1
On the exact space complexity of sketching and streaming small norms
 In SODA
, 2010
"... We settle the 1pass space complexity of (1 ± ε)approximating the Lp norm, for real p with 1 ≤ p ≤ 2, of a lengthn vector updated in a lengthm stream with updates to its coordinates. We assume the updates are integers in the range [−M, M]. In particular, we show the space required is Θ(ε −2 log(mM ..."
Abstract

Cited by 18 (10 self)
 Add to MetaCart
We settle the 1pass space complexity of (1 ± ε)approximating the Lp norm, for real p with 1 ≤ p ≤ 2, of a lengthn vector updated in a lengthm stream with updates to its coordinates. We assume the updates are integers in the range [−M, M]. In particular, we show the space required is Θ(ε −2 log(mM) + log log(n)) bits. Our result also holds for 0 < p < 1; although Lp is not a norm in this case, it remains a welldefined function. Our upper bound improves upon previous algorithms of [Indyk, JACM ’06] and [Li, SODA ’08]. This improvement comes from showing an improved derandomization of the Lp sketch of Indyk by using kwise independence for small k, as opposed to using the heavy hammer of a generic pseudorandom generator against spacebounded computation such as Nisan’s PRG. Our lower bound improves upon previous work of [AlonMatiasSzegedy, JCSS ’99] and [Woodruff, SODA ’04], and is based on showing a direct sum property for the 1way communication of the gapHamming problem. 1
Fast window correlations over uncooperative time series
 In Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
, 2005
"... Data arriving in time order (a data stream) arises in fields including physics, finance, medicine, and music, to name a few. Often the data comes from sensors (in physics and medicine for example) whose data rates continue to improve dramatically as sensor technology improves. Further, the number of ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
Data arriving in time order (a data stream) arises in fields including physics, finance, medicine, and music, to name a few. Often the data comes from sensors (in physics and medicine for example) whose data rates continue to improve dramatically as sensor technology improves. Further, the number of sensors is increasing, so correlating data between sensors becomes ever more critical in order to distill knowlege from the data. In many applications such as finance, recent correlations are of far more interest than longterm correlation, so correlation over sliding windows (windowed correlation) is the desired operation. Fast response is desirable in many applications (e.g., to aim a telescope at an activity of interest or to perform a stock trade). These three factors – data size, windowed correlation, and fast response – motivate this work. Previous work [10, 14] showed how to compute Pearson correlation using Fast Fourier Transforms and Wavelet transforms, but such techniques don’t work for time series in which the energy is spread over many frequency components, thus resembling white noise. For such “uncooperative” time series, this paper shows how to combine several simple techniques – sketches (random projections), convolution, structured random vectors, grid structures, and combinatorial design – to achieve high performance windowed Pearson correlation over a variety of data sets. 2.
Managing Massive Time Series Streams with MultiScale Compressed Trickles ABSTRACT
"... We present Cypress, a novel framework to archive and query massive time series streams such as those generated by sensor networks, data centers, and scientific computing. Cypress applies multiscale analysis to decompose time series and to obtain sparse representations in various domains (e.g. frequ ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
We present Cypress, a novel framework to archive and query massive time series streams such as those generated by sensor networks, data centers, and scientific computing. Cypress applies multiscale analysis to decompose time series and to obtain sparse representations in various domains (e.g. frequency domain and time domain). Relying on the sparsity, the time series streams can be archived with reduced storage space. We then show that many statistical queries such as trend, histogram and correlations can be answered directly from compressed data rather than from reconstructed raw data. Our evaluation with server utilization data collected from real data centers shows significant benefit of our framework. 1.
XML stream processing using treeedit distance embeddings
 ACM Trans. on Database Systems
, 2005
"... We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general treeedit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tr ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general treeedit distance metric. The key element of our solution is a novel algorithm for obliviously embedding treeedit distance metrics into an L1 vector space while guaranteeing a (worstcase) upper bound of O(log 2 n log ∗ n)onthe distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate treeedit distance computations; and (2) approximate the result of treeeditdistance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and reallife XML data trees validate our approach, demonstrating that the averagecase behavior of our embedding techniques is much better than what would be predicted from our theoretical worstcase distortion bounds. To the best of our knowledge, these are the first algorithmic results on lowdistortion embeddings for treeedit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems—Query processing; G.2.1 [Discrete Mathematics]: Combinatorics—Combinatorial algorithms
Spectral clustering in telephone call graphs
 In WebKDD/SNAKDD Workshop 2007 in conjunction with KDD
, 2007
"... We evaluate various heuristics for hierarchical spectral clustering in large telephone call graphs. Spectral clustering without additional heuristics often produces very uneven cluster sizes or low quality clusters that may consist of several disconnected components, a fact that appears to be common ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
We evaluate various heuristics for hierarchical spectral clustering in large telephone call graphs. Spectral clustering without additional heuristics often produces very uneven cluster sizes or low quality clusters that may consist of several disconnected components, a fact that appears to be common for several data sources but, to our knowledge, not described in the literature. DivideandMerge, a recently described postfiltering procedure may be used to eliminate bad quality branches in a binary tree hierarchy. We propose an alternate solution that enables kway cuts in each step by immediately filtering unbalanced or low quality clusters before splitting them further. Our experiments are performed on graphs with various weight and normalization built based on call detail records. We investigate a period of eight months of more than two millions of Hungarian landline telephone users. We measure clustering quality both by cluster ratio as well as by the geographic homogeneity of the clusters obtained from telephone location data. Although divideandmerge optimizes its clusters for cluster ratio, our method produces clusters of similar ratio much faster, furthermore we give geographically much more homogeneous clusters with the size distribution of our clusters resembling to that of the settlement structure.
Fast Manhattan sketches in data streams
 In Proceedings of the 29th ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems (PODS
, 2010
"... The ℓ1distance, also known as the Manhattan or taxicab distance, between two vectors x, y in R n is Pn xi − yi. i=1 Approximating this distance is a fundamental primitive on massive databases, with applications to clustering, nearest neighbor search, network monitoring, regression, sampling, and ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
The ℓ1distance, also known as the Manhattan or taxicab distance, between two vectors x, y in R n is Pn xi − yi. i=1 Approximating this distance is a fundamental primitive on massive databases, with applications to clustering, nearest neighbor search, network monitoring, regression, sampling, and support vector machines. We give the first 1pass streaming algorithm for this problem in the turnstile model with O ∗ (ε −2) space and O ∗ (1) update time. The O ∗ notation hides polylogarithmic factors in ε, n, and the precision required to store vector entries. All previous algorithms either required Ω(ε −3) space or Ω(ε −2) update time and/or could not work in the turnstile model (i.e., support an arbitrary number of updates to each coordinate). Our bounds are optimal up to O ∗ (1) factors.