Results 1  10
of
53
Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation
, 2000
"... In this paper we show several results obtained by combining the use of stable distributions with pseudorandom generators for bounded space. In particular: ffl we show how to maintain (using only O(log n=ffl 2 ) words of storage) a sketch C(p) of a point p 2 l n 1 under dynamic updates of its coo ..."
Abstract

Cited by 324 (13 self)
 Add to MetaCart
In this paper we show several results obtained by combining the use of stable distributions with pseudorandom generators for bounded space. In particular: ffl we show how to maintain (using only O(log n=ffl 2 ) words of storage) a sketch C(p) of a point p 2 l n 1 under dynamic updates of its coordinates, such that given sketches C(p) and C(q) one can estimate jp \Gamma qj 1 up to a factor of (1 + ffl) with large probability. This solves the main open problem of [10]. ffl we obtain another sketch function C 0 which maps l n 1 into a normed space l m 1 (as opposed to C), such that m = m(n) is much smaller than n; to our knowledge this is the first dimensionality reduction lemma for l 1 norm ffl we give an explicit embedding of l n 2 into l n O(log n) 1 with distortion (1 + 1=n \Theta(1) ) and a nonconstructive embedding of l n 2 into l O(n) 1 with distortion (1 + ffl) such that the embedding can be represented using only O(n log 2 n) bits (as opposed to at least...
On the optimality of the dimensionality reduction method
 in Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS
"... We investigate the optimality of (1+ɛ)approximation algorithms obtained via the dimensionality reduction method. We show that: • Any data structure for the (1 + ɛ)approximate nearest neighbor problem in Hamming space, which uses constant number of probes to answer each query, must use n Ω(1/ɛ2) sp ..."
Abstract

Cited by 32 (4 self)
 Add to MetaCart
(Show Context)
We investigate the optimality of (1+ɛ)approximation algorithms obtained via the dimensionality reduction method. We show that: • Any data structure for the (1 + ɛ)approximate nearest neighbor problem in Hamming space, which uses constant number of probes to answer each query, must use n Ω(1/ɛ2) space. • Any algorithm for the (1+ɛ)approximate closest substring problem must run in time exponential in 1/ɛ 2−γ for any γ> 0 (unless 3SAT can be solved in subexponential time) Both lower bounds are (essentially) tight. 1.
Compressing Large Boolean Matrices Using Reordering Techniques
, 2004
"... Large boolean matrices are a basic representational unit in a variety of applications, with some notable examples being interactive visualization systems, mining large graph structures, and association rule mining. Designing space and time e#cient scalable storage and query mechanisms for such ..."
Abstract

Cited by 31 (1 self)
 Add to MetaCart
Large boolean matrices are a basic representational unit in a variety of applications, with some notable examples being interactive visualization systems, mining large graph structures, and association rule mining. Designing space and time e#cient scalable storage and query mechanisms for such large matrices is a challenging problem.
Nonlinear Estimators and Tail Bounds for Dimension Reduction in l1 Using Cauchy Random Projections
, 2007
"... For1 dimension reduction in the l1 norm, the method of Cauchy random projections multiplies the original data matrix A ∈ Rn×D with a random matrix R ∈ RD×k (k ≪ D) whose entries are i.i.d. samples of the standard Cauchy C(0,1). Because of the impossibility result, one can not hope to recover the pai ..."
Abstract

Cited by 29 (0 self)
 Add to MetaCart
For1 dimension reduction in the l1 norm, the method of Cauchy random projections multiplies the original data matrix A ∈ Rn×D with a random matrix R ∈ RD×k (k ≪ D) whose entries are i.i.d. samples of the standard Cauchy C(0,1). Because of the impossibility result, one can not hope to recover the pairwise l1 distances in A from B = A × R ∈ Rn×k, using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computations, information retrieval, learning, and data mining. We study three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators � (MLE). We derive tail bounds for the logn geometric mean estimators and establish that k = O ε2 � suffices with the constants explicitly given. Asymptotically (as k → ∞), both the sample median and the geometric mean estimators are about 80 % efficient compared to the MLE. We analyze the moments of the MLE and propose approximating its distribution of by an inverse Gaussian. Keywords: dimension reduction, l1 norm, JohnsonLindenstrauss (JL) lemma, Cauchy random projections
Metric embeddings with relaxed guarantees
 IN PROCEEDINGS OF THE 46TH IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE
, 2005
"... We consider the problem of embedding finite metrics with slack: we seek to produce embeddings with small dimension and distortion while allowing a (small) constant fraction of all distances to be arbitrarily distorted. This definition is motivated by recent research in the networking community, whic ..."
Abstract

Cited by 25 (6 self)
 Add to MetaCart
We consider the problem of embedding finite metrics with slack: we seek to produce embeddings with small dimension and distortion while allowing a (small) constant fraction of all distances to be arbitrarily distorted. This definition is motivated by recent research in the networking community, which achieved striking empirical success at embedding Internet latencies with low distortion into lowdimensional Euclidean space, provided that some small slack is allowed. Answering an open question of Kleinberg, Slivkins, and Wexler [29], we show that provable guarantees of this type can in fact be achieved in general: any finite metric can be embedded, with constant slack and constant distortion, into constantdimensional Euclidean space. We then show that there exist stronger embeddings into ℓ1 which exhibit
Conditional random sampling: A sketchbased sampling technique for sparse data
 In NIPS
, 2006
"... We1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In largescale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stag ..."
Abstract

Cited by 23 (14 self)
 Add to MetaCart
(Show Context)
We1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In largescale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using realworld data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations. 1
Earth Mover Distance over HighDimensional Spaces
, 2007
"... The Earth Mover Distance (EMD) between two equalsize sets of points in R d is defined to be the minimum cost of a bipartite matching between the two pointsets. It is a natural metric for comparing sets of features, and as such, it has received significant interest in computer vision. Motivated by re ..."
Abstract

Cited by 22 (8 self)
 Add to MetaCart
The Earth Mover Distance (EMD) between two equalsize sets of points in R d is defined to be the minimum cost of a bipartite matching between the two pointsets. It is a natural metric for comparing sets of features, and as such, it has received significant interest in computer vision. Motivated by recent developments in that area, we address computational problems involving EMD over highdimensional pointsets. A natural approach is to embed the EMD metric into ℓ1, and use the algorithms designed for the latter space. However, Khot and Naor [KN06] show that any embedding of EMD over the ddimensional Hamming cube into ℓ1 must incur a distortion Ω(d), thus practically losing all distance information. We circumvent this roadblock by focusing on sets with cardinalities upperbounded by a parameter s, and achieve a distortion of only O(log s · log d). Since in applications the feature sets have bounded size, the resulting distortion is much smaller than the Ω(d) lower bound. Our approach is quite general and easily extends to EMD over R d. We then provide a strong lower bound on the multiround communication complexity of estimating EMD, which in particular strengthens the known nonembeddability result of [KN06]. Our bound exhibits a smooth tradeoff between approximation and communication, and for example implies that every algorithm that estimates EMD using constant size sketches can only achieve Ω(log s) approximation.