Results 1  10
of
20
Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation
, 2000
"... In this paper we show several results obtained by combining the use of stable distributions with pseudorandom generators for bounded space. In particular: ffl we show how to maintain (using only O(log n=ffl 2 ) words of storage) a sketch C(p) of a point p 2 l n 1 under dynamic updates of its coo ..."
Abstract

Cited by 266 (14 self)
 Add to MetaCart
In this paper we show several results obtained by combining the use of stable distributions with pseudorandom generators for bounded space. In particular: ffl we show how to maintain (using only O(log n=ffl 2 ) words of storage) a sketch C(p) of a point p 2 l n 1 under dynamic updates of its coordinates, such that given sketches C(p) and C(q) one can estimate jp \Gamma qj 1 up to a factor of (1 + ffl) with large probability. This solves the main open problem of [10]. ffl we obtain another sketch function C 0 which maps l n 1 into a normed space l m 1 (as opposed to C), such that m = m(n) is much smaller than n; to our knowledge this is the first dimensionality reduction lemma for l 1 norm ffl we give an explicit embedding of l n 2 into l n O(log n) 1 with distortion (1 + 1=n \Theta(1) ) and a nonconstructive embedding of l n 2 into l O(n) 1 with distortion (1 + ffl) such that the embedding can be represented using only O(n log 2 n) bits (as opposed to at least...
OnePass Wavelet Decompositions of Data Streams
 IEEE TKDE
, 2003
"... We present techniques for computing small space representations of massive data streams. These are inspired by traditional waveletbased approximations that consist of specific linear projections of the underlying data. We present general "sketch" based methods for capturing various lin ..."
Abstract

Cited by 46 (2 self)
 Add to MetaCart
(Show Context)
We present techniques for computing small space representations of massive data streams. These are inspired by traditional waveletbased approximations that consist of specific linear projections of the underlying data. We present general "sketch" based methods for capturing various linear projections and use them to provide pointwise and rangesum estimation of data streams.
Local and global methods in data mining: basic techniques and open problems
 In Automata, Languages, and Programming
, 2002
"... ..."
(Show Context)
Fast moment estimation in data streams in optimal space
 In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC
, 2011
"... We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptim ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
(Show Context)
We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptimal algorithm of [KaneNelsonWoodruff, SODA 2010], which had update time Ω(1/ε 2). 1
A data streaming algorithm for estimating entropies of od flows
 In IMC ’07: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
, 2007
"... Entropy has recently gained considerable significance as an important metric for network measurement. Previous research has shown its utility in clustering traffic and detecting traffic anomalies. While measuring the entropy of the traffic observed at a single point has already been studied, an inte ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
Entropy has recently gained considerable significance as an important metric for network measurement. Previous research has shown its utility in clustering traffic and detecting traffic anomalies. While measuring the entropy of the traffic observed at a single point has already been studied, an interesting open problem is to measure the entropy of the traffic between every origindestination pair. In this paper, we propose the first solution to this challenging problem. Our sketch builds upon and extends the Lp sketch of Indyk with significant additional innovations. We present calculations showing that our data streaming algorithm is feasible for high link speeds using commodity CPU/memory at a reasonable cost. Our algorithm is shown to be very accurate in practice via simulations, using traffic traces collected at a tier1 ISP backbone link.
On the exact space complexity of sketching and streaming small norms
 In SODA
, 2010
"... We settle the 1pass space complexity of (1 ± ε)approximating the Lp norm, for real p with 1 ≤ p ≤ 2, of a lengthn vector updated in a lengthm stream with updates to its coordinates. We assume the updates are integers in the range [−M, M]. In particular, we show the space required is Θ(ε −2 log(mM ..."
Abstract

Cited by 18 (10 self)
 Add to MetaCart
(Show Context)
We settle the 1pass space complexity of (1 ± ε)approximating the Lp norm, for real p with 1 ≤ p ≤ 2, of a lengthn vector updated in a lengthm stream with updates to its coordinates. We assume the updates are integers in the range [−M, M]. In particular, we show the space required is Θ(ε −2 log(mM) + log log(n)) bits. Our result also holds for 0 < p < 1; although Lp is not a norm in this case, it remains a welldefined function. Our upper bound improves upon previous algorithms of [Indyk, JACM ’06] and [Li, SODA ’08]. This improvement comes from showing an improved derandomization of the Lp sketch of Indyk by using kwise independence for small k, as opposed to using the heavy hammer of a generic pseudorandom generator against spacebounded computation such as Nisan’s PRG. Our lower bound improves upon previous work of [AlonMatiasSzegedy, JCSS ’99] and [Woodruff, SODA ’04], and is based on showing a direct sum property for the 1way communication of the gapHamming problem. 1
Fast Window Correlations Over Uncooperative Time Series
, 2005
"... Data arriving in time order (a data stream) arises in fields including physics, finance, medicine, and music, to name a few. Often the data comes from sensors (in physics and medicine for example) whose data rates continue to improve dramatically as sensor technology improves. Further, the number of ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
(Show Context)
Data arriving in time order (a data stream) arises in fields including physics, finance, medicine, and music, to name a few. Often the data comes from sensors (in physics and medicine for example) whose data rates continue to improve dramatically as sensor technology improves. Further, the number of sensors is increasing, so correlating data between sensors becomes ever more critical in order to distill knowledge from the data. In many applications such as finance, recent correlations are of far more interest than longterm correlation, so correlation over sliding windows (windowed correlation) is the desired operation. Fast response is desirable in many applications (e.g., to aim a telescope at an activity of interest or to perform a stock trade). These three factors – data size, windowed correlation, and fast response – motivate this work. Previous work [10, 14] showed how to compute Pearson correlation using Fast Fourier Transforms and Wavelet transforms, but such techniques don’t work for time series in which the energy is spread over many frequency components, thus resembling white noise. For such “uncooperative” time series, this paper shows how to combine several simple techniques – sketches (random projections), convolution, structured random vectors, grid structures, and combinatorial design – to achieve high performance windowed Pearson correlation over a variety of data sets.
Spectral clustering in telephone call graphs
 In WebKDD/SNAKDD Workshop 2007 in conjunction with KDD
, 2007
"... We evaluate various heuristics for hierarchical spectral clustering in large telephone call graphs. Spectral clustering without additional heuristics often produces very uneven cluster sizes or low quality clusters that may consist of several disconnected components, a fact that appears to be common ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
We evaluate various heuristics for hierarchical spectral clustering in large telephone call graphs. Spectral clustering without additional heuristics often produces very uneven cluster sizes or low quality clusters that may consist of several disconnected components, a fact that appears to be common for several data sources but, to our knowledge, not described in the literature. DivideandMerge, a recently described postfiltering procedure may be used to eliminate bad quality branches in a binary tree hierarchy. We propose an alternate solution that enables kway cuts in each step by immediately filtering unbalanced or low quality clusters before splitting them further. Our experiments are performed on graphs with various weight and normalization built based on call detail records. We investigate a period of eight months of more than two millions of Hungarian landline telephone users. We measure clustering quality both by cluster ratio as well as by the geographic homogeneity of the clusters obtained from telephone location data. Although divideandmerge optimizes its clusters for cluster ratio, our method produces clusters of similar ratio much faster, furthermore we give geographically much more homogeneous clusters with the size distribution of our clusters resembling to that of the settlement structure.
XML stream processing using treeedit distance embeddings
 ACM Trans. on Database Systems
, 2005
"... We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general treeedit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tr ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general treeedit distance metric. The key element of our solution is a novel algorithm for obliviously embedding treeedit distance metrics into an L1 vector space while guaranteeing a (worstcase) upper bound of O(log 2 n log ∗ n)onthe distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate treeedit distance computations; and (2) approximate the result of treeeditdistance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and reallife XML data trees validate our approach, demonstrating that the averagecase behavior of our embedding techniques is much better than what would be predicted from our theoretical worstcase distortion bounds. To the best of our knowledge, these are the first algorithmic results on lowdistortion embeddings for treeedit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems—Query processing; G.2.1 [Discrete Mathematics]: Combinatorics—Combinatorial algorithms
Managing Massive Time Series Streams with MultiScale Compressed Trickles ABSTRACT
"... We present Cypress, a novel framework to archive and query massive time series streams such as those generated by sensor networks, data centers, and scientific computing. Cypress applies multiscale analysis to decompose time series and to obtain sparse representations in various domains (e.g. frequ ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
(Show Context)
We present Cypress, a novel framework to archive and query massive time series streams such as those generated by sensor networks, data centers, and scientific computing. Cypress applies multiscale analysis to decompose time series and to obtain sparse representations in various domains (e.g. frequency domain and time domain). Relying on the sparsity, the time series streams can be archived with reduced storage space. We then show that many statistical queries such as trend, histogram and correlations can be answered directly from compressed data rather than from reconstructed raw data. Our evaluation with server utilization data collected from real data centers shows significant benefit of our framework. 1.