Results 1 - 10
of
24
An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions
- ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS
, 1994
"... Consider a set S of n data points in real d-dimensional space, R d , where distances are measured using any Minkowski metric. In nearest neighbor searching we preprocess S into a data structure, so that given any query point q 2 R d , the closest point of S to q can be reported quickly. Given any po ..."
Abstract
-
Cited by 634 (29 self)
- Add to MetaCart
Consider a set S of n data points in real d-dimensional space, R d , where distances are measured using any Minkowski metric. In nearest neighbor searching we preprocess S into a data structure, so that given any query point q 2 R d , the closest point of S to q can be reported quickly. Given any positive real ffl, a data point p is a (1 + ffl)-approximate nearest neighbor of q if its distance from q is within a factor of (1 + ffl) of the distance to the true nearest neighbor. We show that it is possible to preprocess a set of n points in R d in O(dn log n) time and O(dn) space, so that given a query point q 2 R d , and ffl ? 0, a (1 + ffl)-approximate nearest neighbor of q can be computed in O(c d;ffl log n) time, where c d;ffl d d1 + 6d=ffle d is a factor depending only on dimension and ffl. In general, we show that given an integer k 1, (1 + ffl)-approximations to the k nearest neighbors of q can be computed in additional O(kd log n) time.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality
, 1998
"... The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the d-dimens ..."
Abstract
-
Cited by 533 (28 self)
- Add to MetaCart
The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the d-dimensional Euclidean space where X = ! d under some l p norm. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large d, in theory or in practice, they provide little improvement over the brute-force algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem, which is: Find a point p 2 P that is an ffl-approximate nearest neighbor of the query q in that for all p 0 2 P , d(p; q) (1 + ffl)d(p 0 ; q). We present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in n and d, and a trul...
Similarity search in high dimensions via hashing
, 1999
"... The nearest- or near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image dat ..."
Abstract
-
Cited by 275 (11 self)
- Add to MetaCart
The nearest- or near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the \curse of dimensionality. " That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in k-d trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than brute-force linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should su ce for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points
Geometric Range Searching and Its Relatives
- CONTEMPORARY MATHEMATICS
"... ... process a set S of points in so that the points of S lying inside a query R region can be reported or counted quickly. Wesurvey the known techniques and data structures for range searching and describe their application to other related searching problems. ..."
Abstract
-
Cited by 223 (35 self)
- Add to MetaCart
... process a set S of points in so that the points of S lying inside a query R region can be reported or counted quickly. Wesurvey the known techniques and data structures for range searching and describe their application to other related searching problems.
When Is "Nearest Neighbor" Meaningful?
- In Int. Conf. on Database Theory
, 1999
"... . We explore the effect of dimensionality on the "nearest neighbor " problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the fa ..."
Abstract
-
Cited by 222 (1 self)
- Add to MetaCart
. We explore the effect of dimensionality on the "nearest neighbor " problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10-15 dimensions. These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple...
Similarity Indexing: Algorithms and Performance
- In Storage and Retrieval for Image and Video Databases (SPIE
, 1996
"... Efficient indexing support is essential to allow content-based image and video databases using similaritybased retrieval to scale to large databases (tens of thousands up to millions of images). In this paper, we take an in depth look at this problem. One of the major difficulties in solving this pr ..."
Abstract
-
Cited by 105 (1 self)
- Add to MetaCart
Efficient indexing support is essential to allow content-based image and video databases using similaritybased retrieval to scale to large databases (tens of thousands up to millions of images). In this paper, we take an in depth look at this problem. One of the major difficulties in solving this problem is the high dimension (6-100) of the feature vectors that are used to represent objects. We provide an overview of the work in computational geometry on this problem and highlight the results we found are most useful in practice, including the use of approximate nearest neighbor algorithms. We also present a variant of the optimized k-d tree we call the VAM k-d tree, and provide algorithms to create an optimized R-tree we call the VAMSplit R-tree. We found that the VAMSplit R-tree provided better overall performance than all competing structures we tested for main memory and secondary memory applications. We observed large improvements in performance relative to the R*-tree and SS-tree...
Approximate Nearest Neighbor Queries Revisited
, 1998
"... This paper proposes new methods to answer approximate nearest neighbor queries on a set of n points in d-dimensional Euclidean space. For any fixed constant d, a data structure with O(" (1\Gammad)=2 n log n) preprocessing time and O(" (1\Gammad)=2 log n) query time achieves approximation factor ..."
Abstract
-
Cited by 51 (3 self)
- Add to MetaCart
This paper proposes new methods to answer approximate nearest neighbor queries on a set of n points in d-dimensional Euclidean space. For any fixed constant d, a data structure with O(" (1\Gammad)=2 n log n) preprocessing time and O(" (1\Gammad)=2 log n) query time achieves approximation factor 1 + " for any given 0 ! " ! 1; a variant reduces the "-dependence by a factor of " \Gamma1=2 . For any arbitrary d, a data structure with O(d 2 n log n) preprocessing time and O(d 2 log n) query time achieves approximation factor O(d 3=2 ). Applications to various proximity problems are discussed. 1 Introduction Let P be a set of n point sites in d-dimensional space IR d . In the well-known post office problem, we want to preprocess P into a data structure so that a site closest to a given query point q (called the nearest neighbor of q) can be found efficiently. Distances are measured under the Euclidean metric. The post office problem has many applications within computational...
A local search approximation algorithm for k-means clustering
, 2004
"... In k-means clustering we are given a set of n data points in d-dimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are kno ..."
Abstract
-
Cited by 47 (1 self)
- Add to MetaCart
In k-means clustering we are given a set of n data points in d-dimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance. We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9 + ε)-approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9 − ε) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with
A Cost Model for Query Processing in High-Dimensional Data Spaces
, 2000
"... During the last decade, multimedia databases have become increasingly important in many application areas such as medicine, CAD, geography or molecular biology. An important research issue in the field of multimedia databases is similarity search in large data sets. Most current approaches addressin ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
During the last decade, multimedia databases have become increasingly important in many application areas such as medicine, CAD, geography or molecular biology. An important research issue in the field of multimedia databases is similarity search in large data sets. Most current approaches addressing similarity search use the so-called feature approach which transforms important properties of the stored objects into points of a high-dimensional space (feature vectors). Thus, the similarity search is transformed into a neighborhood search in the feature space. For the management of the feature vectors, multidimensional index structures are usually applied. The performance of query processing can be substantially improved by opti...
A Lower Bound on the Complexity of Approximate Nearest-Neighbor Searching on the Hamming Cube
- In Proc. 31th Annual ACM Symposium on Theory of Computing (STOC’99
, 1999
"... We consider the nearest-neighbor problem over the d-cube: given a collection of points in {0, 1} d , find the one nearest to a query point (in the L 1 sense). We establish a lower bound of###90 log d/ log log log d) on the worst-case query time. This result holds in the cell probe model with ( ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
We consider the nearest-neighbor problem over the d-cube: given a collection of points in {0, 1} d , find the one nearest to a query point (in the L 1 sense). We establish a lower bound of###90 log d/ log log log d) on the worst-case query time. This result holds in the cell probe model with (any amount of) polynomial storage and word-size d O(1) . The same lower bound holds for the approximate version of the problem, where the answer may be any point further than the nearest neighbor by a factor as large as 2 #(log d) 1-# # , for any fixed # > 0. 1 Introduction For a variety of practical reasons ranging from molecular biology to web searching, nearest-neighbor searching has been a focus of attention lately [2]--[9], [11]--[21], [26]. In the applications considered, the dimension of the ambient space is usually high, and predictably, classical lines of attack based on space partitioning fail. To overcome the wellknown "curse of dimensionality," it is typical to relax the s...

