Results 1 - 10
of
46
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
, 2008
"... In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The ..."
Abstract
-
Cited by 131 (1 self)
- Add to MetaCart
In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.
Nearest-neighbor searching and metric space dimensions
- In Nearest-Neighbor Methods for Learning and Vision: Theory and Practice
, 2006
"... Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives a data structure for this problem; the data structure is built using the distan ..."
Abstract
-
Cited by 63 (0 self)
- Add to MetaCart
Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives a data structure for this problem; the data structure is built using the distance function as a “black box”. The structure is able to speed up nearest neighbor searching in a variety of settings, for example: points in low-dimensional or structured Euclidean space, strings under Hamming and edit distance, and bit vector data from an OCR application. The data structures are observed to need linear space, with a modest constant factor. The preprocessing time needed per site is observed to match the query time. The data structure can be viewed as an application of a “kd-tree ” approach in the metric space setting, using Voronoi regions of a subset in place of axis-aligned boxes. 1
Entropy based Nearest Neighbor Search in High Dimensions
, 2005
"... In this paper we study the problem of finding the approximate nearest neighbor of a query point in the high dimensional space, focusing on the Euclidean space. The earlier approaches use locality-preserving hash functions (that tend to map nearby points to the same value) to construct several hash t ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
In this paper we study the problem of finding the approximate nearest neighbor of a query point in the high dimensional space, focusing on the Euclidean space. The earlier approaches use locality-preserving hash functions (that tend to map nearby points to the same value) to construct several hash tables to ensure that the query point hashes to the same bucket as its nearest neighbor in at least one table. Our approach is different – we use one (or a few) hash table and hash several randomly chosen points in the neighborhood of the query point showing that at least one of them will hash to the bucket containing its nearest neighbor. We show that the number of randomly chosen points in the neighborhood of the query point q required depends on the entropy of the hash value h(p) of a random point p at the same distance from q at its nearest neighbor, given q and the locality preserving hash function h chosen randomly from the hash family. Precisely, we show that if the entropy I(h(p)|q, h) = M and g is a bound on the probability that two far-off points will hash to the same bucket, then we can find the approximate nearest neighbor in O(nρ) time and near linear Õ(n) space where ρ = M/log(1/g). Alternatively we can build a data structure of size Õ(n1/(1−ρ) ) to answer queries in Õ(d) time. By applying this analysis to the locality preserving hash functions in [17, 21, 6] and adjusting the parameters we show that the c nearest neighbor can be computed in time Õ(nρ) and near linear space where ρ ≈ 2.06/c as c becomes large.
Efficient mean-shift tracking via a new similarity measure
- IEEE Computer Society
, 2005
"... The mean shift algorithm has achieved considerable success in object tracking due to its simplicity and robustness. It finds local minima of a similarity measure between the color histograms or kernel density estimates of the model and target image. The most typically used similarity measures are th ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
The mean shift algorithm has achieved considerable success in object tracking due to its simplicity and robustness. It finds local minima of a similarity measure between the color histograms or kernel density estimates of the model and target image. The most typically used similarity measures are the Bhattacharyya coefficient or the Kullback-Leibler divergence. In practice, these approaches face three difficulties. First, the spatial information of the target is lost when the color histogram is employed, which precludes the application of more elaborate motion models. Second, the classical similarity measures are not very discriminative. Third, the sample-based classical similarity measures require a calculation that is quadratic in the number of samples, making real-time performance difficult. To deal with these difficulties we propose a new, simple-tocompute and more discriminative similarity measure in spatial-feature spaces. The new similarity measure allows the mean shift algorithm to track more general motion models in an integrated way. To reduce the complexity of the computation to linear order we employ the recently proposed improved fast Gauss transform. This leads to a very efficient and robust nonparametric spatial-feature tracking algorithm. The algorithm is tested on several image sequences and shown to achieve robust and reliable frame-rate tracking. Keywords: Mean-shift algorithm, Improved fast Gauss transform, Similarity measure, Spatial-feature tracking.
On the optimality of the dimensionality reduction method
- in Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS
"... We investigate the optimality of (1+ɛ)-approximation algorithms obtained via the dimensionality reduction method. We show that: • Any data structure for the (1 + ɛ)-approximate nearest neighbor problem in Hamming space, which uses constant number of probes to answer each query, must use n Ω(1/ɛ2) sp ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
We investigate the optimality of (1+ɛ)-approximation algorithms obtained via the dimensionality reduction method. We show that: • Any data structure for the (1 + ɛ)-approximate nearest neighbor problem in Hamming space, which uses constant number of probes to answer each query, must use n Ω(1/ɛ2) space. • Any algorithm for the (1+ɛ)-approximate closest substring problem must run in time exponential in 1/ɛ 2−γ for any γ> 0 (unless 3SAT can be solved in subexponential time) Both lower bounds are (essentially) tight. 1.
Spotsigs: robust and efficient near duplicate detection in large web collections
- In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
, 2008
"... Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage porti ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage portions of Web pages over advertisements and navigational bars. The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure n-gram-based approaches such as Shingling; 2) we provide an exact and efficient, selftuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as Shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative “Gold Set ” of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.
Ramsey partitions and proximity data structures
- J. Eur. Math. Soc. (JEMS
, 2006
"... This paper addresses the non-linear isomorphic Dvoretzky theorem and the design of good approximate distance oracles for large distortion. We introduce and construct optimal Ramsey partitions, and use them to show that for every ε ∈ (0, 1), any n-point metric space has a subset of size n 1−ε which e ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
This paper addresses the non-linear isomorphic Dvoretzky theorem and the design of good approximate distance oracles for large distortion. We introduce and construct optimal Ramsey partitions, and use them to show that for every ε ∈ (0, 1), any n-point metric space has a subset of size n 1−ε which embeds into Hilbert space with distortion O(1/ε). This result is best possible and improves part of the metric Ramsey theorem of Bartal, Linial, Mendel and Naor [5], in addition to considerably simplifying its proof. We use our new Ramsey partitions to design approximate distance oracles with a universal constant query time, closing a gap left open by Thorup and Zwick in [26]. Namely, we show that for any n point metric space X, and k ≥ 1, there exists an O(k)-approximate distance oracle whose storage requirement is O � n 1+1/k � , and whose query time is a universal constant. We also discuss applications to various other geometric data structures, and the relation to well separated pair decompositions. 1.
Disorder inequality: A combinatorial approach to nearest neighbor search
- In WSDM’08
"... We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use the triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. The disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a’th most similar object to z and y is the b’th most similar object to z, then x is among the D(S) · (a + b) most similar objects to y. Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zero-error algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses near-linear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For an arbitrary query q, there is only a small probability that the chosen data structure does not support q. Finally, we show that for the Reuters corpus average disorder is indeed quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.
Quantitative analysis of nearest-neighbors search in high-dimensional sampling-based motion planning
- in Workshop on Algo. Found. of Robot
, 2006
"... Abstract: We quantitatively analyze the performance of exact and approximate nearest-neighbors algorithms on increasingly high-dimensional problems in the context of sampling-based motion planning. We study the impact of the dimension, number of samples, distance metrics, and sampling schemes on the ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Abstract: We quantitatively analyze the performance of exact and approximate nearest-neighbors algorithms on increasingly high-dimensional problems in the context of sampling-based motion planning. We study the impact of the dimension, number of samples, distance metrics, and sampling schemes on the efficiency and accuracy of nearest-neighbors algorithms. Efficiency measures computation time and accuracy indicates similarity between exact and approximate nearest neighbors. Our analysis indicates that after a critical dimension, which varies between 15 and 30, exact nearest-neighbors algorithms examine almost all the samples. As a result, exact nearest-neighbors algorithms become impractical for sampling-based motion planners when a considerably large number of samples needs to be generated. The impracticality of exact nearest-neighbors algorithms motivates the use of approximate algorithms, which trade off accuracy for efficiency. We propose a simple algorithm, termed Distance-based Projection onto Euclidean Space (DPES), which computes approximate nearest neighbors by using a distance-based projection of high-dimensional metric spaces onto low-dimensional Euclidean spaces. Our results indicate DPES achieves high efficiency and only a negligible loss in accuracy. 1
A.: High-Dimensional Feature Matching: Employing the Concept of Meaningful Nearest Neighbors
- In: ICCV (2007
"... Matching of high-dimensional features using nearest neighbors search is an important part of image matching methods which are based on local invariant features. In this work we highlight effects pertinent to high-dimensional spaces that are significant for matching, yet have not been explicitly acco ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Matching of high-dimensional features using nearest neighbors search is an important part of image matching methods which are based on local invariant features. In this work we highlight effects pertinent to high-dimensional spaces that are significant for matching, yet have not been explicitly accounted for in previous work. In our approach, we require every nearest neighbor to be meaningful, that is, sufficiently close to a query feature such that it is an outlier to a background feature distribution. We estimate the background feature distribution from the extended neighborhood of a query feature given by its k nearest neighbors. Based on the concept of meaningful nearest neighbors, we develop a novel high-dimensional feature matching method and evaluate its performance by conducting image matching on two challenging image data sets. A superior performance in terms of accuracy is shown in comparison to several state-of-the-art approaches. Additionally, to make search for k nearest neighbors more efficient, we develop a novel approximate nearest neighbors search method based on sparse coding with an overcomplete basis set that provides a ten-fold speed-up over an exhaustive search even for high dimensional spaces and retains excellent approximation to an exact nearest neighbors search. 1.

