Results 1  10
of
14
Fast moment estimation in data streams in optimal space
 In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC
, 2011
"... We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptimal alg ..."
Abstract

Cited by 20 (6 self)
 Add to MetaCart
We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptimal algorithm of [KaneNelsonWoodruff, SODA 2010], which had update time Ω(1/ε 2). 1
The data stream space complexity of cascaded norms
 In FOCS
, 2009
"... Abstract — We consider the problem of estimating cascaded aggregates over a matrix presented as a sequence of updates in a data stream. A cascaded aggregate P ◦ Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of ..."
Abstract

Cited by 11 (7 self)
 Add to MetaCart
Abstract — We consider the problem of estimating cascaded aggregates over a matrix presented as a sequence of updates in a data stream. A cascaded aggregate P ◦ Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of values. This problem was introduced by Cormode and Muthukrishnan, PODS, 2005 [CM]. We analyze the space complexity of estimating cascaded norms on an n × d matrix to within a small relative error. Let Lp denote the pth norm, where p is a nonnegative integer. We abbreviate the cascaded norm L k ◦ Lp by L k,p. (1) For any constant k ≥ p ≥ 2, we obtain a 1pass Õ(n1−2/k d 1−2/p)space algorithm for estimating Lk,p. This is optimal up to polylogarithmic factors and resolves an open question of [CM] regarding the space complexity of L4,2. We also obtain 1pass spaceoptimal algorithms for estimating L∞,k and Lk,∞. (2) We prove a space lower bound of Ω(n1−1/k) on estimating Lk,0 and Lk,1, resolving an open question due to Indyk, IITK Data Streams Workshop (Problem 8), 2006. We also resolve two more questions of [CM] concerning Lk,2 estimation and block heavy hitter problems. Ganguly, Bansal and Dube (FAW, 2008) claimed an Õ(1)space algorithm for estimating Lk,p for any k, p ∈ [0,2]. Our lower bounds show this claim is incorrect. 1.
Fast Approximation of Matrix Coherence and Statistical Leverage
"... The statistical leverage scores of a matrix A are the squared rownorms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recentlypopular problems such as matrix completion and Nyströmbased lowrank matrix ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
The statistical leverage scores of a matrix A are the squared rownorms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recentlypopular problems such as matrix completion and Nyströmbased lowrank matrix approximation as well as in largescale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n×d matrix A, with n ≫ d, and that returns as output relativeerror approximations to all n of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of n and d) in O(nd logn) time, as opposed to the O(nd 2) time required by the naïve algorithm that involves computing an orthogonal basis for the range of A. Our analysis may be viewed in terms of computing a relativeerror approximation to an underconstrained leastsquares approximation problem, or, relatedly, it may be viewed as an application of JohnsonLindenstrauss type ideas. Several practicallyimportant extensions of our basic result are also described, including the approximation of socalled crossleverage scores, the extension of these ideas to matrices with n≈d, and the extension to streaming environments.
Fast Manhattan sketches in data streams
 In Proceedings of the 29th ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems (PODS
, 2010
"... The ℓ1distance, also known as the Manhattan or taxicab distance, between two vectors x, y in R n is Pn xi − yi. i=1 Approximating this distance is a fundamental primitive on massive databases, with applications to clustering, nearest neighbor search, network monitoring, regression, sampling, and ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
The ℓ1distance, also known as the Manhattan or taxicab distance, between two vectors x, y in R n is Pn xi − yi. i=1 Approximating this distance is a fundamental primitive on massive databases, with applications to clustering, nearest neighbor search, network monitoring, regression, sampling, and support vector machines. We give the first 1pass streaming algorithm for this problem in the turnstile model with O ∗ (ε −2) space and O ∗ (1) update time. The O ∗ notation hides polylogarithmic factors in ε, n, and the precision required to store vector entries. All previous algorithms either required Ω(ε −3) space or Ω(ε −2) update time and/or could not work in the turnstile model (i.e., support an arbitrary number of updates to each coordinate). Our bounds are optimal up to O ∗ (1) factors.
Streaming Algorithms via Precision Sampling ∗
"... (STOC 2005) has inspired several recent advances in datastream algorithms. We show that a number of these results follow easily from the application of a single probabilistic method called Precision Sampling. Using this method, we obtain simple datastream algorithms that maintain a randomized sketc ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(STOC 2005) has inspired several recent advances in datastream algorithms. We show that a number of these results follow easily from the application of a single probabilistic method called Precision Sampling. Using this method, we obtain simple datastream algorithms that maintain a randomized sketch of an input vector x = (x1,x2,...,xn), which is useful for the following applications: • Estimating the Fkmoment of x, fork>2. • Estimating the ℓpnorm of x, forp ∈ [1, 2], with small update time. • Estimating cascaded norms ℓp(ℓq) for all p, q> 0. • ℓ1 sampling, where the goal is to produce an element i with probability (approximately) xi/‖x‖1. It extends to similarly defined ℓpsampling, for p ∈ [1, 2]. For all these applications the algorithm is essentially the same: scale the vector x entrywise by a wellchosen random vector, and run a heavyhitter estimation algorithm on the resulting vector. Our sketch is a linear function of x, thereby allowing general updates to the vector x. Precision Sampling itself addresses the problem of estimating a sum Pn i=1 ai from weak estimates of each real ai ∈ [0, 1]. More precisely, the estimator first chooses a desired precision ui ∈ (0, 1] for each i ∈ [n], and then it receives an estimate of every ai Pwithin additive ui. Its goal is to provide a good approximation P to ai while keeping a tab on the “approximation cost” i (1/ui). Here we refine previous work (Andoni, Krauthgamer, and Onak, FOCS 2010) which shows that as long as P ai =Ω(1), a good multiplicative approximation can be achieved using total precision of only O(n log n). Keywordsstreaming, sampling, moments, cascaded norms 1.
Periodicity and cyclic shifts via linear sketches
 In APPROXRANDOM
, 2011
"... Abstract. We consider the problem of identifying periodic trends in data streams. We say a signal a ∈ R n is pperiodic if ai = ai+p for all i ∈ [n − p]. Recently, Ergün et al. [4] presented a onepass, O(polylog n)space algorithm for identifying the smallest period of a signal. Their algorithm requ ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract. We consider the problem of identifying periodic trends in data streams. We say a signal a ∈ R n is pperiodic if ai = ai+p for all i ∈ [n − p]. Recently, Ergün et al. [4] presented a onepass, O(polylog n)space algorithm for identifying the smallest period of a signal. Their algorithm required a to be presented in the timeseries model, i.e., ai is the ith element in the stream. We present a more general linear sketch algorithm that has the advantages of being applicable to a) the turnstile stream model, where coordinates can be incremented/decremented in an arbitrary fashion and b) the parallel or distributed setting where the signal is distributed over multiple locations/machines. We also present sketches for (1+ɛ) approximating the ℓ2 distance between a and the nearest pperiodic signal for a given p. Our algorithm uses O(ɛ −2 polylog n) space, comparing favorably to an earlier timeseries result that used O(ɛ −5.5√ p polylog n) space for estimating the Hamming distance to the nearest pperiodic signal. Our last periodicity result is an algorithm for estimating the periodicity of a sequence in the presence of noise. We conclude with a smallspace algorithm for identifying when two signals are exact (or nearly) cyclic shifts of one another. Our algorithms are based on bilinear sketches [10] and combining Fourier transforms with stream processing techniques such as ℓp sampling and sketching [11, 13]. 1
Periodicity in Streams
 RANDOM 2010
, 2010
"... In this work we study sublinear space algorithms for detecting periodicity over data streams. A sequence of length n is said to be periodic if it consists of repetitions of a block of length p for some p ≤ n 2. In the first part of this paper, we give a 1pass randomized streaming algorithm that use ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this work we study sublinear space algorithms for detecting periodicity over data streams. A sequence of length n is said to be periodic if it consists of repetitions of a block of length p for some p ≤ n 2. In the first part of this paper, we give a 1pass randomized streaming algorithm that uses O(log 2 n) space and reports the shortest period if the given stream is periodic. At the heart of this result is a 1pass O(log n log m) space streaming pattern matching algorithm. This algorithm uses similar ideas to Porat and Porat’s algorithm in FOCS 2009 but it does not need an offline preprocessing stage and is considerably simpler.
In the second part, we study distance to pperiodicity under the Hamming metric, where we estimate the minimum number of character substitutions needed to make a given sequence pperiodic. In streaming terminology, this problem can be described as computing the cascaded aggregate L1 ◦ F res(1) 1 over a matrix Ap×d given in column ordering. For this problem, we present a randomized streaming algorithm with approximation factor 2 + ɛ that takes O~(eps^{2}) space. We also show a 1+eps randomized streaming algorithm which uses O(eps^{5.5}p^{1/2}) space.
Sketching and Streaming HighDimensional Vectors
, 2011
"... A sketch of a dataset is a smallspace data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sk ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
A sketch of a dataset is a smallspace data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sketch itself be computable by a smallspace algorithm given just one pass over the data, a socalled streaming algorithm. Sketching and streaming have found numerous applications in network traffic monitoring, data mining, trend detection, sensor networks, and databases. In this thesis, I describe several new contributions in the area of sketching and streaming algorithms. • The first spaceoptimal streaming algorithm for the distinct elements problem. Our algorithm also achieves O(1) update and reporting times. • A streaming algorithm for Hamming norm estimation in the turnstile model which achieves the best known space complexity.
NearOptimal Private Approximation Protocols via a Black Box Transformation
"... We show the following transformation: any twoparty protocol for outputting a (1 + ε)approximation to f(x, y) = n j=1 g(xj, yj) with probability at least 2/3, for any nonnegative efficienty computable function g, can be transformed into a twoparty private approximation protocol with only a polylo ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We show the following transformation: any twoparty protocol for outputting a (1 + ε)approximation to f(x, y) = n j=1 g(xj, yj) with probability at least 2/3, for any nonnegative efficienty computable function g, can be transformed into a twoparty private approximation protocol with only a polylogarithmic factor loss in communication, computation, and round complexity. In general it is insufficient to use secure function evaluation or fully homomorphic encryption on a standard, nonprivate protocol for approximating f. This is because the approximation may reveal information about x and y that does not follow from f(x, y). Applying our transformation and variations of it, we obtain nearoptimal private approximation protocols for a wide range of problems in the data stream literature for which previously nothing was known. We give nearoptimal private approximation protocols for the ℓpdistance for every p ≥ 0, for the heavy hitters and importance sampling problems with respect to any ℓpnorm, for the maxdominance and other dominant ℓpnorms, for the distinct summation problem, for entropy, for cascaded frequency moments, for subspace approximation and block sampling, and for measuring independence of datasets. Using a result for data streams, we obtain private approximation protocols with polylogarithmic communication for every nondecreasing and symmetric function g(xj, yj) = h(xj − yj) with at most quadratic growth. If the original (nonprivate) protocol is a simultaneous protocol, e.g., a sketching algorithm, then our only cryptographic assumption is efficient symmetric computationallyprivate information retrieval; otherwise it is fully homomorphic encryption. For all but one of these problems, the original protocol is a sketching algorithm. Our protocols generalize straightforwardly to more than two parties.
Tight Lower Bound for Linear Sketches of Moments
"... Abstract. The problem of estimating frequency moments of a data stream has attracted a lot of attention since the onset of streaming algorithms [AMS99]. While the space complexity for approximately computing the p th moment, for p ∈ (0, 2] has been settled [KNW10], for p> 2 the exact complexity rema ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. The problem of estimating frequency moments of a data stream has attracted a lot of attention since the onset of streaming algorithms [AMS99]. While the space complexity for approximately computing the p th moment, for p ∈ (0, 2] has been settled [KNW10], for p> 2 the exact complexity remains open. For p> 2 the current best algorithm uses O(n 1−2/p log n) words of space [AKO11,BO10], whereas the lower bound is of Ω(n 1−2/p) [BJKS04]. In this paper, we show a tight lower bound of Ω(n 1−2/p log n) words for the class of algorithms based on linear sketches, which store only a sketch Ax of input vector x and some (possibly randomized) matrix A. We note that all known algorithms for this problem are linear sketches. 1