Results 1 -
6 of
6
On the exact space complexity of sketching and streaming small norms
- In SODA
, 2010
"... We settle the 1-pass space complexity of (1 ± ε)approximating the Lp norm, for real p with 1 ≤ p ≤ 2, of a length-n vector updated in a length-m stream with updates to its coordinates. We assume the updates are integers in the range [−M, M]. In particular, we show the space required is Θ(ε −2 log(mM ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
We settle the 1-pass space complexity of (1 ± ε)approximating the Lp norm, for real p with 1 ≤ p ≤ 2, of a length-n vector updated in a length-m stream with updates to its coordinates. We assume the updates are integers in the range [−M, M]. In particular, we show the space required is Θ(ε −2 log(mM) + log log(n)) bits. Our result also holds for 0 < p < 1; although Lp is not a norm in this case, it remains a well-defined function. Our upper bound improves upon previous algorithms of [Indyk, JACM ’06] and [Li, SODA ’08]. This improvement comes from showing an improved derandomization of the Lp sketch of Indyk by using k-wise independence for small k, as opposed to using the heavy hammer of a generic pseudorandom generator against space-bounded computation such as Nisan’s PRG. Our lower bound improves upon previous work of [Alon-Matias-Szegedy, JCSS ’99] and [Woodruff, SODA ’04], and is based on showing a direct sum property for the 1-way communication of the gap-Hamming problem. 1
The data stream space complexity of cascaded norms
- In FOCS
, 2009
"... Abstract — We consider the problem of estimating cascaded aggregates over a matrix presented as a sequence of updates in a data stream. A cascaded aggregate P ◦ Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
Abstract — We consider the problem of estimating cascaded aggregates over a matrix presented as a sequence of updates in a data stream. A cascaded aggregate P ◦ Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of values. This problem was introduced by Cormode and Muthukrishnan, PODS, 2005 [CM]. We analyze the space complexity of estimating cascaded norms on an n × d matrix to within a small relative error. Let Lp denote the p-th norm, where p is a non-negative integer. We abbreviate the cascaded norm L k ◦ Lp by L k,p. (1) For any constant k ≥ p ≥ 2, we obtain a 1-pass Õ(n1−2/k d 1−2/p)-space algorithm for estimating Lk,p. This is optimal up to polylogarithmic factors and resolves an open question of [CM] regarding the space complexity of L4,2. We also obtain 1-pass space-optimal algorithms for estimating L∞,k and Lk,∞. (2) We prove a space lower bound of Ω(n1−1/k) on estimating Lk,0 and Lk,1, resolving an open question due to Indyk, IITK Data Streams Workshop (Problem 8), 2006. We also resolve two more questions of [CM] concerning Lk,2 estimation and block heavy hitter problems. Ganguly, Bansal and Dube (FAW, 2008) claimed an Õ(1)-space algorithm for estimating Lk,p for any k, p ∈ [0,2]. Our lower bounds show this claim is incorrect. 1.
1-Pass Relative-Error Lp-Sampling with Applications
"... For any p ∈ [0, 2], we give a 1-pass poly(ε −1 log n)-space algorithm which, given a data stream of length m with insertions and deletions of an n-dimensional vector a, with updates in the range {−M, −M + 1, · · · , M − 1, M}, outputs a sample of [n] = {1, 2, · · · , n} for which for all i th ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
For any p ∈ [0, 2], we give a 1-pass poly(ε −1 log n)-space algorithm which, given a data stream of length m with insertions and deletions of an n-dimensional vector a, with updates in the range {−M, −M + 1, · · · , M − 1, M}, outputs a sample of [n] = {1, 2, · · · , n} for which for all i the probability that i is returned is (1 ± ɛ) |ai | p Fp(a) ± n −C, where ai denotes the (possibly negative) value of coordinate i, Fp(a) = ∑n i=1 |ai|p = ||a| | p p denotes the p-th frequency moment (i.e., the p-th power of the Lp norm), and C> 0 is an arbitrarily large constant. Here we assume that n, m, and M are polynomially related. Our generic sampling framework improves and unifies algorithms for several communication and streaming problems, including cascaded norms, heavy hitters, and moment estimation. It also gives the first relative-error forward sampling algorithm in a data stream with deletions, answering an open question of Cormode et al. 1
Bypassing UGC from some Optimal Geometric Inapproximability Results
- ELECTRONIC COLLOQUIUM ON COMPUTATIONAL COMPLEXITY, REPORT NO. 177
, 2010
"... The Unique Games conjecture (UGC) has emerged in recent years as the starting point for several optimal inapproximability results. While for none of these results a reverse reduction to Unique Games is known, the assumption of bijective projections in the Label Cover instance seems critical in these ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The Unique Games conjecture (UGC) has emerged in recent years as the starting point for several optimal inapproximability results. While for none of these results a reverse reduction to Unique Games is known, the assumption of bijective projections in the Label Cover instance seems critical in these proofs. In this work we bypass the UGC assumption in inapproximability results for two geometric problems, obtaining a tight NP-hardness result in each case. The first problem known as the Lp Subspace Approximation is a generalization of the classic least squares regression problem. Here, the input consists of a set of points S = {a1,..., am} ⊆ R n and a parameter k (possibly depending on n). The goal is to find a subspace H of R n of dimension k that minimizes the sum of the p th powers of the distances to the points. For p = 2, k = n − 1, this reduces to the least squares regression problem, while for p = ∞, k = 0 it reduces to the problem of finding a ball of minimum radius enclosing all the points. We show that for any fixed p (2 < p < ∞) it is NP-hard to approximate this problem to within a factor of γp − ɛ for constant ɛ> 0, where γp is the pth moment of a standard Gaussian variable. This matches the factor γp approximation algorithm obtained by Deshpande, Tulsiani and Vishnoi
A Near-Linear Algorithm for Projective Clustering Integer Points ∗
"... We consider the problem of projective clustering in Euclidean spaces of non-fixed dimension. Here, we are given a set P of n points in R m and integers j ≥ 1, k ≥ 0, and the goal is to find j k-subspaces so that the sum of the distances of each point in P to the nearest subspace is minimized. Observ ..."
Abstract
- Add to MetaCart
We consider the problem of projective clustering in Euclidean spaces of non-fixed dimension. Here, we are given a set P of n points in R m and integers j ≥ 1, k ≥ 0, and the goal is to find j k-subspaces so that the sum of the distances of each point in P to the nearest subspace is minimized. Observe that this is a shape fitting problem where we wish to find the best fit in the L1 sense. Here we will treat the number j of subspaces we want to fit and the dimension k of each of them as constants. We consider instances of projective clustering where the point coordinates are integers of magnitude polynomial in m and n. Our main result is a randomized algorithm that for any ε> 0 runs in time O(mn polylog(mn)) and outputs a solution that with high probability is within (1 + ε) of the optimal solution. To obtain this result, we show that the fixed dimensional version of the above projective clustering problem has a small coreset. We do that by observing that in a fairly general sense, shape fitting problems that have small coresets in the L ∞ setting also have small coresets in the L1 setting, and then exploiting an existing construction for the L∞ setting. This observation seems to be quite useful for other shape fitting problems as well, as we demonstrate by constructing the first “regular ” coreset for the circle fitting problem in the plane. 1
Sketching and Streaming High-Dimensional Vectors
, 2011
"... A sketch of a dataset is a small-space data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sk ..."
Abstract
- Add to MetaCart
A sketch of a dataset is a small-space data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sketch itself be computable by a small-space algorithm given just one pass over the data, a so-called streaming algorithm. Sketching and streaming have found numerous applications in network traffic monitoring, data mining, trend detection, sensor networks, and databases. In this thesis, I describe several new contributions in the area of sketching and streaming algorithms. • The first space-optimal streaming algorithm for the distinct elements problem. Our algorithm also achieves O(1) update and reporting times. • A streaming algorithm for Hamming norm estimation in the turnstile model which achieves the best known space complexity.

