Results 1  10
of
355
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality
, 1998
"... The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the ddimens ..."
Abstract

Cited by 715 (33 self)
 Add to MetaCart
The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the ddimensional Euclidean space where X = ! d under some l p norm. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large d, in theory or in practice, they provide little improvement over the bruteforce algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem, which is: Find a point p 2 P that is an fflapproximate nearest neighbor of the query q in that for all p 0 2 P , d(p; q) (1 + ffl)d(p 0 ; q). We present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in n and d, and a trul...
Similarity search in high dimensions via hashing
, 1999
"... The nearest or nearneighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over highdimensional data, e.g., image dat ..."
Abstract

Cited by 415 (12 self)
 Add to MetaCart
The nearest or nearneighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over highdimensional data, e.g., image databases, document collections, timeseries databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the \curse of dimensionality. " That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in kd trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than bruteforce linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should su ce for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points
Convolution Kernels on Discrete Structures
, 1999
"... We introduce a new method of constructing kernels on sets whose elements are discrete structures like strings, trees and graphs. The method can be applied iteratively to build a kernel on an infinite set from kernels involving generators of the set. The family of kernels generated generalizes the fa ..."
Abstract

Cited by 368 (0 self)
 Add to MetaCart
We introduce a new method of constructing kernels on sets whose elements are discrete structures like strings, trees and graphs. The method can be applied iteratively to build a kernel on an infinite set from kernels involving generators of the set. The family of kernels generated generalizes the family of radial basis kernels. It can also be used to define kernels in the form of joint Gibbs probability distributions. Kernels can be built from hidden Markov random elds, generalized regular expressions, pairHMMs, or ANOVA decompositions. Uses of the method lead to open problems involving the theory of infinitely divisible positive definite functions. Fundamentals of this theory and the theory of reproducing kernel Hilbert spaces are reviewed and applied in establishing the validity of the method.
Probabilistic Approximation of Metric Spaces and its Algorithmic Applications
 In 37th Annual Symposium on Foundations of Computer Science
, 1996
"... The goal of approximating metric spaces by more simple metric spaces has led to the notion of graph spanners [PU89, PS89] and to lowdistortion embeddings in lowdimensional spaces [LLR94], having many algorithmic applications. This paper provides a novel technique for the analysis of randomized ..."
Abstract

Cited by 323 (28 self)
 Add to MetaCart
The goal of approximating metric spaces by more simple metric spaces has led to the notion of graph spanners [PU89, PS89] and to lowdistortion embeddings in lowdimensional spaces [LLR94], having many algorithmic applications. This paper provides a novel technique for the analysis of randomized algorithms for optimization problems on metric spaces, by relating the randomized performance ratio for any metric space to the randomized performance ratio for a set of "simple" metric spaces. We define a notion of a set of metric spaces that probabilisticallyapproximates another metric space. We prove that any metric space can be probabilisticallyapproximated by hierarchically wellseparated trees (HST) with a polylogarithmic distortion. These metric spaces are "simple" as being: (1) tree metrics. (2) natural for applying a divideandconquer algorithmic approach. The technique presented is of particular interest in the context of online computation. A large number of online al...
Exact Matrix Completion via Convex Optimization
, 2008
"... We consider a problem of considerable practical interest: the recovery of a data matrix from a sampling of its entries. Suppose that we observe m entries selected uniformly at random from a matrix M. Can we complete the matrix and recover the entries that we have not seen? We show that one can perfe ..."
Abstract

Cited by 320 (19 self)
 Add to MetaCart
We consider a problem of considerable practical interest: the recovery of a data matrix from a sampling of its entries. Suppose that we observe m entries selected uniformly at random from a matrix M. Can we complete the matrix and recover the entries that we have not seen? We show that one can perfectly recover most lowrank matrices from what appears to be an incomplete set of entries. We prove that if the number m of sampled entries obeys m ≥ C n 1.2 r log n for some positive numerical constant C, then with very high probability, most n × n matrices of rank r can be perfectly recovered by solving a simple convex optimization program. This program finds the matrix with minimum nuclear norm that fits the data. The condition above assumes that the rank is not too large. However, if one replaces the 1.2 exponent with 1.25, then the result holds for all values of the rank. Similar results hold for arbitrary rectangular matrices as well. Our results are connected with the recent literature on compressed sensing, and show that objects other than signals and images can be perfectly reconstructed from very limited information.
A Tight Bound on Approximating Arbitrary Metrics by Tree Metrics
 In Proceedings of the 35th Annual ACM Symposium on Theory of Computing
, 2003
"... In this paper, we show that any n point metric space can be embedded into a distribution over dominating tree metrics such that the expected stretch of any edge is O(log n). This improves upon the result of Bartal who gave a bound of O(log n log log n). Moreover, our result is existentially tight; t ..."
Abstract

Cited by 269 (7 self)
 Add to MetaCart
In this paper, we show that any n point metric space can be embedded into a distribution over dominating tree metrics such that the expected stretch of any edge is O(log n). This improves upon the result of Bartal who gave a bound of O(log n log log n). Moreover, our result is existentially tight; there exist metric spaces where any tree embedding must have distortion#sto n)distortion. This problem lies at the heart of numerous approximation and online algorithms including ones for group Steiner tree, metric labeling, buyatbulk network design and metrical task system. Our result improves the performance guarantees for all of these problems.
Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation
, 2000
"... In this paper we show several results obtained by combining the use of stable distributions with pseudorandom generators for bounded space. In particular: ffl we show how to maintain (using only O(log n=ffl 2 ) words of storage) a sketch C(p) of a point p 2 l n 1 under dynamic updates of its coo ..."
Abstract

Cited by 263 (15 self)
 Add to MetaCart
In this paper we show several results obtained by combining the use of stable distributions with pseudorandom generators for bounded space. In particular: ffl we show how to maintain (using only O(log n=ffl 2 ) words of storage) a sketch C(p) of a point p 2 l n 1 under dynamic updates of its coordinates, such that given sketches C(p) and C(q) one can estimate jp \Gamma qj 1 up to a factor of (1 + ffl) with large probability. This solves the main open problem of [10]. ffl we obtain another sketch function C 0 which maps l n 1 into a normed space l m 1 (as opposed to C), such that m = m(n) is much smaller than n; to our knowledge this is the first dimensionality reduction lemma for l 1 norm ffl we give an explicit embedding of l n 2 into l n O(log n) 1 with distortion (1 + 1=n \Theta(1) ) and a nonconstructive embedding of l n 2 into l O(n) 1 with distortion (1 + ffl) such that the embedding can be represented using only O(n log 2 n) bits (as opposed to at least...
On Approximating Arbitrary Metrics by Tree Metrics
 In Proceedings of the 30th Annual ACM Symposium on Theory of Computing
, 1998
"... This paper is concerned with probabilistic approximation of metric spaces. In previous work we introduced the method of ecient approximation of metrics by more simple families of metrics in a probabilistic fashion. In particular we study probabilistic approximations of arbitrary metric spaces by \hi ..."
Abstract

Cited by 260 (13 self)
 Add to MetaCart
This paper is concerned with probabilistic approximation of metric spaces. In previous work we introduced the method of ecient approximation of metrics by more simple families of metrics in a probabilistic fashion. In particular we study probabilistic approximations of arbitrary metric spaces by \hierarchically wellseparated tree" metric spaces. This has proved as a useful technique for simplifying the solutions to various problems.
Expander Flows, Geometric Embeddings and Graph Partitioning
 IN 36TH ANNUAL SYMPOSIUM ON THE THEORY OF COMPUTING
, 2004
"... We give a O( log n)approximation algorithm for sparsest cut, balanced separator, and graph conductance problems. This improves the O(log n)approximation of Leighton and Rao (1988). We use a wellknown semidefinite relaxation with triangle inequality constraints. Central to our analysis is a ..."
Abstract

Cited by 238 (18 self)
 Add to MetaCart
We give a O( log n)approximation algorithm for sparsest cut, balanced separator, and graph conductance problems. This improves the O(log n)approximation of Leighton and Rao (1988). We use a wellknown semidefinite relaxation with triangle inequality constraints. Central to our analysis is a geometric theorem about projections of point sets in , whose proof makes essential use of a phenomenon called measure concentration.
Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions
, 2008
"... In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The ..."
Abstract

Cited by 237 (4 self)
 Add to MetaCart
In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.