Results 1  10
of
12
On the exact space complexity of sketching and streaming small norms
 In SODA
, 2010
"... We settle the 1pass space complexity of (1 ± ε)approximating the Lp norm, for real p with 1 ≤ p ≤ 2, of a lengthn vector updated in a lengthm stream with updates to its coordinates. We assume the updates are integers in the range [−M, M]. In particular, we show the space required is Θ(ε −2 log(mM ..."
Abstract

Cited by 18 (10 self)
 Add to MetaCart
We settle the 1pass space complexity of (1 ± ε)approximating the Lp norm, for real p with 1 ≤ p ≤ 2, of a lengthn vector updated in a lengthm stream with updates to its coordinates. We assume the updates are integers in the range [−M, M]. In particular, we show the space required is Θ(ε −2 log(mM) + log log(n)) bits. Our result also holds for 0 < p < 1; although Lp is not a norm in this case, it remains a welldefined function. Our upper bound improves upon previous algorithms of [Indyk, JACM ’06] and [Li, SODA ’08]. This improvement comes from showing an improved derandomization of the Lp sketch of Indyk by using kwise independence for small k, as opposed to using the heavy hammer of a generic pseudorandom generator against spacebounded computation such as Nisan’s PRG. Our lower bound improves upon previous work of [AlonMatiasSzegedy, JCSS ’99] and [Woodruff, SODA ’04], and is based on showing a direct sum property for the 1way communication of the gapHamming problem. 1
1Pass RelativeError LpSampling with Applications
"... For any p ∈ [0, 2], we give a 1pass poly(ε −1 log n)space algorithm which, given a data stream of length m with insertions and deletions of an ndimensional vector a, with updates in the range {−M, −M + 1, · · · , M − 1, M}, outputs a sample of [n] = {1, 2, · · · , n} for which for all i th ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
For any p ∈ [0, 2], we give a 1pass poly(ε −1 log n)space algorithm which, given a data stream of length m with insertions and deletions of an ndimensional vector a, with updates in the range {−M, −M + 1, · · · , M − 1, M}, outputs a sample of [n] = {1, 2, · · · , n} for which for all i the probability that i is returned is (1 ± ɛ) ai  p Fp(a) ± n −C, where ai denotes the (possibly negative) value of coordinate i, Fp(a) = ∑n i=1 aip = a  p p denotes the pth frequency moment (i.e., the pth power of the Lp norm), and C> 0 is an arbitrarily large constant. Here we assume that n, m, and M are polynomially related. Our generic sampling framework improves and unifies algorithms for several communication and streaming problems, including cascaded norms, heavy hitters, and moment estimation. It also gives the first relativeerror forward sampling algorithm in a data stream with deletions, answering an open question of Cormode et al. 1
A unified framework for approximating and clustering data. Manuscript available at arXiv.org
, 2011
"... Given a set F of n positive functions over a ground set X, we consider the problem of computing x ∗ that minimizes the expression ∑ f∈F f(x), over x ∈ X. A typical application is shape fitting, where we wish to approximate a set P of n elements (say, points) by a shape x from a (possibly infinite) f ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
Given a set F of n positive functions over a ground set X, we consider the problem of computing x ∗ that minimizes the expression ∑ f∈F f(x), over x ∈ X. A typical application is shape fitting, where we wish to approximate a set P of n elements (say, points) by a shape x from a (possibly infinite) family X of shapes. Here, each point p ∈ P corresponds to a function f such that f(x) is the distance from p to x, and we seek a shape x that minimizes the sum of distances from each point in P. In the kclustering variant, each x ∈ X is a tuple ofk shapes, andf(x) is the distance frompto its closest shape inx. Our main result is a unified framework for constructing coresets and approximate clustering for such general sets of functions. To achieve our results, we forge a link between the classic and well defined notion of εapproximations from the theory of PAC Learning and VC dimension, to the relatively new (and not so consistent) paradigm of coresets, which are some kind of “compressed representation " of the input set F. Using traditional techniques, a coreset usually implies an LTAS (linear time approximation scheme) for the corresponding optimization problem, which can be computed in parallel, via one pass over the data, and using only polylogarithmic space (i.e, in the streaming model). For several function families F for which coresets are known not to exist, or the corresponding (approximate) optimization problems are hard, our framework yields bicriteria approximations, or coresets that are large, but contained in a lowdimensional space. We demonstrate our unified framework by applying it on projective clustering problems. We obtain new coreset constructions and significantly smaller coresets, over the ones that
The data stream space complexity of cascaded norms
 In FOCS
, 2009
"... Abstract — We consider the problem of estimating cascaded aggregates over a matrix presented as a sequence of updates in a data stream. A cascaded aggregate P ◦ Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of ..."
Abstract

Cited by 11 (7 self)
 Add to MetaCart
Abstract — We consider the problem of estimating cascaded aggregates over a matrix presented as a sequence of updates in a data stream. A cascaded aggregate P ◦ Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of values. This problem was introduced by Cormode and Muthukrishnan, PODS, 2005 [CM]. We analyze the space complexity of estimating cascaded norms on an n × d matrix to within a small relative error. Let Lp denote the pth norm, where p is a nonnegative integer. We abbreviate the cascaded norm L k ◦ Lp by L k,p. (1) For any constant k ≥ p ≥ 2, we obtain a 1pass Õ(n1−2/k d 1−2/p)space algorithm for estimating Lk,p. This is optimal up to polylogarithmic factors and resolves an open question of [CM] regarding the space complexity of L4,2. We also obtain 1pass spaceoptimal algorithms for estimating L∞,k and Lk,∞. (2) We prove a space lower bound of Ω(n1−1/k) on estimating Lk,0 and Lk,1, resolving an open question due to Indyk, IITK Data Streams Workshop (Problem 8), 2006. We also resolve two more questions of [CM] concerning Lk,2 estimation and block heavy hitter problems. Ganguly, Bansal and Dube (FAW, 2008) claimed an Õ(1)space algorithm for estimating Lk,p for any k, p ∈ [0,2]. Our lower bounds show this claim is incorrect. 1.
Nearoptimal Columnbased Matrix Reconstruction. arXiv report: arxiv.org/abs/1103.0995
, 2011
"... Abstract — We consider lowrank reconstruction of a matrix using a subset of its columns and we present asymptotically optimal algorithms for both spectral norm and Frobenius norm reconstruction. The main tools we introduce to obtain our results are: (i) the use of fast approximate SVDlike decompos ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Abstract — We consider lowrank reconstruction of a matrix using a subset of its columns and we present asymptotically optimal algorithms for both spectral norm and Frobenius norm reconstruction. The main tools we introduce to obtain our results are: (i) the use of fast approximate SVDlike decompositions for columnbased matrix reconstruction, and (ii) two deterministic algorithms for selecting rows from matrices with orthonormal columns, building upon the sparse representation theorem for decompositions of the identity that appeared in [1]. Keywordslowrank matrix approximation; subset selection; SVD; approximate SVD; spectral sparsification 1.
Fast Manhattan sketches in data streams
 In Proceedings of the 29th ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems (PODS
, 2010
"... The ℓ1distance, also known as the Manhattan or taxicab distance, between two vectors x, y in R n is Pn xi − yi. i=1 Approximating this distance is a fundamental primitive on massive databases, with applications to clustering, nearest neighbor search, network monitoring, regression, sampling, and ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
The ℓ1distance, also known as the Manhattan or taxicab distance, between two vectors x, y in R n is Pn xi − yi. i=1 Approximating this distance is a fundamental primitive on massive databases, with applications to clustering, nearest neighbor search, network monitoring, regression, sampling, and support vector machines. We give the first 1pass streaming algorithm for this problem in the turnstile model with O ∗ (ε −2) space and O ∗ (1) update time. The O ∗ notation hides polylogarithmic factors in ε, n, and the precision required to store vector entries. All previous algorithms either required Ω(ε −3) space or Ω(ε −2) update time and/or could not work in the turnstile model (i.e., support an arbitrary number of updates to each coordinate). Our bounds are optimal up to O ∗ (1) factors.
A NearLinear Algorithm for Projective Clustering Integer Points ∗
"... We consider the problem of projective clustering in Euclidean spaces of nonfixed dimension. Here, we are given a set P of n points in R m and integers j ≥ 1, k ≥ 0, and the goal is to find j ksubspaces so that the sum of the distances of each point in P to the nearest subspace is minimized. Observ ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We consider the problem of projective clustering in Euclidean spaces of nonfixed dimension. Here, we are given a set P of n points in R m and integers j ≥ 1, k ≥ 0, and the goal is to find j ksubspaces so that the sum of the distances of each point in P to the nearest subspace is minimized. Observe that this is a shape fitting problem where we wish to find the best fit in the L1 sense. Here we will treat the number j of subspaces we want to fit and the dimension k of each of them as constants. We consider instances of projective clustering where the point coordinates are integers of magnitude polynomial in m and n. Our main result is a randomized algorithm that for any ε> 0 runs in time O(mn polylog(mn)) and outputs a solution that with high probability is within (1 + ε) of the optimal solution. To obtain this result, we show that the fixed dimensional version of the above projective clustering problem has a small coreset. We do that by observing that in a fairly general sense, shape fitting problems that have small coresets in the L ∞ setting also have small coresets in the L1 setting, and then exploiting an existing construction for the L∞ setting. This observation seems to be quite useful for other shape fitting problems as well, as we demonstrate by constructing the first “regular ” coreset for the circle fitting problem in the plane. 1
Subspace Embeddings for the L1norm with Applications
"... We show there is a distribution over linear mappings R: ℓ n O(d log d) 1 → ℓ1, such that with arbitrarily large constant probability, for any fixed ddimensional subspace L, for all x ∈ L we have ‖x‖1 ≤ ‖Rx‖1 = O(d log d)‖x‖1. This provides the first analogue of the ubiquitous subspace JohnsonLinde ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We show there is a distribution over linear mappings R: ℓ n O(d log d) 1 → ℓ1, such that with arbitrarily large constant probability, for any fixed ddimensional subspace L, for all x ∈ L we have ‖x‖1 ≤ ‖Rx‖1 = O(d log d)‖x‖1. This provides the first analogue of the ubiquitous subspace JohnsonLindenstrauss embedding for the ℓ1norm. Importantly, the target dimension and distortion are independent of the ambient dimension n. We give several applications of this result. First, we give a faster algorithm for computing wellconditioned bases. Our algorithm is simple, avoiding the linear programming machinery required of previous algorithms. We also give faster algorithms for least absolute deviation regression and ℓ1norm best fit hyperplane problems, as well as the first single pass streaming algorithms with low space for these problems. These results are motivated by practical problems in image analysis, spam detection, and statistics, where the ℓ1norm is used in studies where outliers may be safely and effectively ignored. This is because the ℓ1norm is more robust to outliers than the ℓ2norm.
Bypassing UGC from some Optimal Geometric Inapproximability Results
 ELECTRONIC COLLOQUIUM ON COMPUTATIONAL COMPLEXITY, REPORT NO. 177
, 2010
"... The Unique Games conjecture (UGC) has emerged in recent years as the starting point for several optimal inapproximability results. While for none of these results a reverse reduction to Unique Games is known, the assumption of bijective projections in the Label Cover instance seems critical in these ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The Unique Games conjecture (UGC) has emerged in recent years as the starting point for several optimal inapproximability results. While for none of these results a reverse reduction to Unique Games is known, the assumption of bijective projections in the Label Cover instance seems critical in these proofs. In this work we bypass the UGC assumption in inapproximability results for two geometric problems, obtaining a tight NPhardness result in each case. The first problem known as the Lp Subspace Approximation is a generalization of the classic least squares regression problem. Here, the input consists of a set of points S = {a1,..., am} ⊆ R n and a parameter k (possibly depending on n). The goal is to find a subspace H of R n of dimension k that minimizes the sum of the p th powers of the distances to the points. For p = 2, k = n − 1, this reduces to the least squares regression problem, while for p = ∞, k = 0 it reduces to the problem of finding a ball of minimum radius enclosing all the points. We show that for any fixed p (2 < p < ∞) it is NPhard to approximate this problem to within a factor of γp − ɛ for constant ɛ> 0, where γp is the pth moment of a standard Gaussian variable. This matches the factor γp approximation algorithm obtained by Deshpande, Tulsiani and Vishnoi
Sketching and Streaming HighDimensional Vectors
, 2011
"... A sketch of a dataset is a smallspace data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sk ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
A sketch of a dataset is a smallspace data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sketch itself be computable by a smallspace algorithm given just one pass over the data, a socalled streaming algorithm. Sketching and streaming have found numerous applications in network traffic monitoring, data mining, trend detection, sensor networks, and databases. In this thesis, I describe several new contributions in the area of sketching and streaming algorithms. • The first spaceoptimal streaming algorithm for the distinct elements problem. Our algorithm also achieves O(1) update and reporting times. • A streaming algorithm for Hamming norm estimation in the turnstile model which achieves the best known space complexity.