Results 1  10
of
57
Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions
, 2008
"... In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The ..."
Abstract

Cited by 242 (4 self)
 Add to MetaCart
In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.
Nearestneighbor searching and metric space dimensions
 In NearestNeighbor Methods for Learning and Vision: Theory and Practice
, 2006
"... Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives a data structure for this problem; the data structure is built using the distan ..."
Abstract

Cited by 87 (0 self)
 Add to MetaCart
Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives a data structure for this problem; the data structure is built using the distance function as a “black box”. The structure is able to speed up nearest neighbor searching in a variety of settings, for example: points in lowdimensional or structured Euclidean space, strings under Hamming and edit distance, and bit vector data from an OCR application. The data structures are observed to need linear space, with a modest constant factor. The preprocessing time needed per site is observed to match the query time. The data structure can be viewed as an application of a “kdtree ” approach in the metric space setting, using Voronoi regions of a subset in place of axisaligned boxes. 1
Efficient meanshift tracking via a new similarity measure
 in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’05
, 2005
"... The mean shift algorithm has achieved considerable success in object tracking due to its simplicity and robustness. It finds local minima of a similarity measure between the color histograms or kernel density estimates of the model and target image. The most typically used similarity measures are th ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
The mean shift algorithm has achieved considerable success in object tracking due to its simplicity and robustness. It finds local minima of a similarity measure between the color histograms or kernel density estimates of the model and target image. The most typically used similarity measures are the Bhattacharyya coefficient or the KullbackLeibler divergence. In practice, these approaches face three difficulties. First, the spatial information of the target is lost when the color histogram is employed, which precludes the application of more elaborate motion models. Second, the classical similarity measures are not very discriminative. Third, the samplebased classical similarity measures require a calculation that is quadratic in the number of samples, making realtime performance difficult. To deal with these difficulties we propose a new, simpletocompute and more discriminative similarity measure in spatialfeature spaces. The new similarity measure allows the mean shift algorithm to track more general motion models in an integrated way. To reduce the complexity of the computation to linear order we employ the recently proposed improved fast Gauss transform. This leads to a very efficient and robust nonparametric spatialfeature tracking algorithm. The algorithm is tested on several image sequences and shown to achieve robust and reliable framerate tracking.
Fast k nearest neighbor search using GPU
 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
, 2008
"... Statistical measures coming from information theory represent interesting bases for image and video processing tasks such as image retrieval and video object tracking. For example, let us mention the entropy and the KullbackLeibler divergence. Accurate estimation of these measures requires to adapt ..."
Abstract

Cited by 33 (3 self)
 Add to MetaCart
Statistical measures coming from information theory represent interesting bases for image and video processing tasks such as image retrieval and video object tracking. For example, let us mention the entropy and the KullbackLeibler divergence. Accurate estimation of these measures requires to adapt to the local sample density, especially if the data are highdimensional. The k nearest neighbor (kNN) framework has been used to define efficient variablebandwidth kernelbased estimators with such a locally adaptive property. Unfortunately, these estimators are computationally intensive since they rely on searching neighbors among large sets of ddimensional vectors. This computational burden can be reduced by prestructuring the data, e.g. using binary trees as proposed by the Approximated Nearest Neighbor (ANN) library. Yet, the recent opening of Graphics Processing Units (GPU) to generalpurpose computation by means of the NVIDIA CUDA API offers the image and video processing community a powerful platform with parallel calculation capabilities. In this paper, we propose a CUDA implementation of the “brute force ” kNN search and we compare its performances to several CPUbased implementations including an equivalent brute force algorithm and ANN. We show a speed increase on synthetic and real data by up to one or two orders of magnitude depending on the data, with a quasilinear behavior with respect to the data size in a given, practical range. 1.
Entropy based nearest neighbor search in high dimensions
 In SODA ’06: Proceedings of the seventeenth annual ACMSIAM Symposium on Discrete Algorithms
"... In this paper we study the problem of finding the approximate nearest neighbor of a query point in the high dimensional space, focusing on the Euclidean space. The earlier approaches use localitypreserving hash functions (that tend to map nearby points to the same value) to construct several hash ta ..."
Abstract

Cited by 27 (5 self)
 Add to MetaCart
In this paper we study the problem of finding the approximate nearest neighbor of a query point in the high dimensional space, focusing on the Euclidean space. The earlier approaches use localitypreserving hash functions (that tend to map nearby points to the same value) to construct several hash tables to ensure that the query point hashes to the same bucket as its nearest neighbor in at least one table. Our approach is different – we use one (or a few) hash table and hash several randomly chosen points in the neighborhood of the query point showing that at least one of them will hash to the bucket containing its nearest neighbor. We show that the number of randomly chosen points in the neighborhood of the query point q required depends on the entropy of the hash value h(p) of a random point p at the same distance from q at its nearest neighbor, given q and the locality preserving hash function h chosen randomly from the hash family. Precisely, we show that if the entropy I(h(p)q, h) = M and g is a bound on the probability that two faroff points will hash to the same bucket, then we can find the approximate nearest neighbor in O(nρ) time and near linear Õ(n) space where ρ = M / log(1/g). Alternatively we can build a data structure of size Õ(n1/(1−ρ)) to answer queries in Õ(d) time. By applying this analysis to the locality preserving hash functions in [15, 19, 6] and adjusting the parameters we show that the c nearest neighbor can be computed in time Õ(nρ) and near linear space where ρ ≈ 2.06/c as c becomes large. 1
Spotsigs: robust and efficient near duplicate detection in large web collections
 In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
, 2008
"... Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage porti ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage portions of Web pages over advertisements and navigational bars. The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure ngrambased approaches such as Shingling; 2) we provide an exact and efficient, selftuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for highdimensional similarity search. Experiments confirm an increase in combined precision and recall of more than 24 percent over stateoftheart approaches such as Shingling or IMatch and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative “Gold Set ” of manually assessed nearduplicate news articles as well as the TREC WT10g Web collection.
The fast JohnsonLindenstrauss transform and approximate nearest neighbors
 SIAM J. Comput
, 2009
"... Abstract. We introduce a new lowdistortion embedding of ℓd n) 2 into ℓO(log p (p =1, 2) called the fast Johnson–Lindenstrauss transform (FJLT). The FJLT is faster than standard random projections and just as easy to implement. It is based upon the preconditioning of a sparse projection matrix with ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Abstract. We introduce a new lowdistortion embedding of ℓd n) 2 into ℓO(log p (p =1, 2) called the fast Johnson–Lindenstrauss transform (FJLT). The FJLT is faster than standard random projections and just as easy to implement. It is based upon the preconditioning of a sparse projection matrix with a randomized Fourier transform. Sparse random projections are unsuitable for lowdistortion embeddings. We overcome this handicap by exploiting the “Heisenberg principle ” of the Fourier transform, i.e., its localglobal duality. The FJLT can be used to speed up search algorithms based on lowdistortion embeddings in ℓ1 and ℓ2. We consider the case of approximate nearest neighbors in ℓd 2. We provide a faster algorithm using classical projections, which we then speed up further by plugging in the FJLT. We also give a faster algorithm for searching over the hypercube.
On the optimality of the dimensionality reduction method
 in Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS
"... We investigate the optimality of (1+ɛ)approximation algorithms obtained via the dimensionality reduction method. We show that: • Any data structure for the (1 + ɛ)approximate nearest neighbor problem in Hamming space, which uses constant number of probes to answer each query, must use n Ω(1/ɛ2) sp ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
We investigate the optimality of (1+ɛ)approximation algorithms obtained via the dimensionality reduction method. We show that: • Any data structure for the (1 + ɛ)approximate nearest neighbor problem in Hamming space, which uses constant number of probes to answer each query, must use n Ω(1/ɛ2) space. • Any algorithm for the (1+ɛ)approximate closest substring problem must run in time exponential in 1/ɛ 2−γ for any γ> 0 (unless 3SAT can be solved in subexponential time) Both lower bounds are (essentially) tight. 1.
Disorder inequality: A combinatorial approach to nearest neighbor search
 In WSDM’08
"... We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use the triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. The disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a’th most similar object to z and y is the b’th most similar object to z, then x is among the D(S) · (a + b) most similar objects to y. Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zeroerror algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses nearlinear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For an arbitrary query q, there is only a small probability that the chosen data structure does not support q. Finally, we show that for the Reuters corpus average disorder is indeed quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.
Ramsey partitions and proximity data structures
 J. European Math. Soc 9
"... This paper addresses two problems lying at the intersection of geometric analysis and theoretical computer science: The nonlinear isomorphic Dvoretzky theorem and the design of good approximate distance oracles for large distortion. We introduce the notion of Ramsey partitions of a finite metric sp ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
This paper addresses two problems lying at the intersection of geometric analysis and theoretical computer science: The nonlinear isomorphic Dvoretzky theorem and the design of good approximate distance oracles for large distortion. We introduce the notion of Ramsey partitions of a finite metric space, and show that the existence of good Ramsey partitions implies a solution to the metric Ramsey problem for large distortion (a.k.a. the nonlinear version of the isomorphic Dvoretzky theorem, as introduced by Bourgain, Figiel, and Milman in [8]). We then proceed to construct optimal Ramsey partitions, and use them to show that for everyε∈(0, 1), any npoint metric space has a subset of size n 1−ε which embeds into Hilbert space with distortion O(1/ε). This result is best possible and improves part of the metric Ramsey theorem of Bartal, Linial, Mendel and Naor [5], in addition to considerably simplifying its proof. We use our new Ramsey partitions to design the best known approximate distance oracles when the distortion is large, closing a gap left open by Thorup and Zwick in [31]. Namely, we show that for any n point metric space X, and k≥1, there exists an O(k)approximate distance oracle whose storage requirement is O ( n 1+1/k) , and whose query time is a universal constant. We also discuss applications of Ramsey partitions to various other geometric data structure problems, such as the design of efficient data structures for approximate ranking.