Results 1  10
of
51
BMultiProbe LSH: Efficient indexing for highdimensional similarity search
 in Proc. 33rd Int. Conf. Very Large Data Bases
"... Similarity indices for highdimensional data are very desirable for building contentbased search systems for featurerich data such as audio, images, videos, and other sensor data. Recently, locality sensitive hashing (LSH) and its variations have been proposed as indexing techniques for approximate ..."
Abstract

Cited by 117 (3 self)
 Add to MetaCart
(Show Context)
Similarity indices for highdimensional data are very desirable for building contentbased search systems for featurerich data such as audio, images, videos, and other sensor data. Recently, locality sensitive hashing (LSH) and its variations have been proposed as indexing techniques for approximate similarity search. A significant drawback of these approaches is the requirement for a large number of hash tables in order to achieve good search quality. This paper proposes a new indexing scheme called multiprobe LSH that overcomes this drawback. Multiprobe LSH is built on the wellknown LSH technique, but it intelligently probes multiple buckets that are likely to contain query results in a hash table. Our method is inspired by and improves upon recent theoretical work on entropybased LSH designed to reduce the space requirement of the basic LSH method. We have implemented the multiprobe LSH method and evaluated the implementation with two different highdimensional datasets. Our evaluation shows that the multiprobe LSH method substantially improves upon previously proposed methods in both space and time efficiency. To achieve the same search quality, multiprobe LSH has a similar timeefficiency as the basic LSH method while reducing the number of hash tables by an order of magnitude. In comparison with the entropybased LSH method, to achieve the same search quality, multiprobe LSH uses less query time and 5 to 8 times fewer number of hash tables. 1.
Modeling LSH for Performance Tuning
"... Although LocalitySensitive Hashing (LSH) is a promising approach to similarity search in highdimensional spaces, it has not been considered practical partly because its search quality is sensitive to several parameters that are quite data dependent. Previous research on LSH, though obtained intere ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
Although LocalitySensitive Hashing (LSH) is a promising approach to similarity search in highdimensional spaces, it has not been considered practical partly because its search quality is sensitive to several parameters that are quite data dependent. Previous research on LSH, though obtained interesting asymptotic results, provides little guidance on how these parameters should be chosen, and tuning parameters for a given dataset remains a tedious process. To address this problem, we present a statistical performance model of Multiprobe LSH, a stateoftheart variance of LSH. Our model can accurately predict the average search quality and latency given a small sample dataset. Apart from automatic parameter tuning with the performance model, we also use the model to devise an adaptive LSH search algorithm to determine the probing parameter dynamically for each query. The adaptive probing method addresses the problem that even though the average performance is tuned for optimal, the variance of the performance is extremely high. We experimented with three different datasets including audio, images and 3D shapes to evaluate our methods. The results show the accuracy of the proposed model: the recall errors predicted are within 5 % from the real values for most cases; the adaptive search method reduces the standard deviation of recall by about 50 % over the existing method.
A posteriori multiprobe locality sensitive hashing
 ACM Multimedia
, 2008
"... Efficient highdimensional similarity search structures are essential for building scalable contentbased search systems on featurerich multimedia data. In the last decade, Locality Sensitive Hashing (LSH) has been proposed as indexing technique for approximate similarity search. Among the most rec ..."
Abstract

Cited by 40 (11 self)
 Add to MetaCart
(Show Context)
Efficient highdimensional similarity search structures are essential for building scalable contentbased search systems on featurerich multimedia data. In the last decade, Locality Sensitive Hashing (LSH) has been proposed as indexing technique for approximate similarity search. Among the most recent variations of LSH, multiprobe LSH techniques have been proved to overcome the overlinear space cost drawback of common LSH. Multiprobe LSH is built on the wellknown LSH technique, but it intelligently probes multiple buckets that are likely to contain query results in a hash table. Our method is inspired by previous work on probabilistic similarity search structures and improves upon recent theoretical work on multiprobe and query adaptive LSH. Whereas these methods are based on likelihood criteria that a given bucket contains query results, we define a more reliable a posteriori model taking account some prior about the queries and the searched objects. This prior knowledge allows a better quality control of the search and a more accurate selection of the most probable buckets. We implemented a nearest neighbors search based on this paradigm and performed experiments on different real visual features datasets. We show that our a posteriori scheme outperforms other multiprobe LSH while offering a better quality control. Comparisons to the basic LSH technique show that our method allows consistent improvements both in space and time efficiency.
Quality and Efficiency in High Dimensional Nearest Neighbor Search
"... Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sublinearly with the dataset size, regardless of the data and query distributi ..."
Abstract

Cited by 32 (1 self)
 Add to MetaCart
(Show Context)
Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sublinearly with the dataset size, regardless of the data and query distributions. Despite the bulk of NN literature, no solution fulfills both requirements, except locality sensitive hashing (LSH). The existing LSH implementations are either rigorous or adhoc. RigorousLSH ensures good quality of query results, but requires expensive space and query cost. Although adhocLSH is more efficient, it abandons quality control, i.e., the neighbor it outputs can be arbitrarily bad. As a result, currently no method is able to ensure both quality and efficiency simultaneously in practice. Motivated by this, we propose a new access method called the locality sensitive Btree (LSBtree) that enables fast highdimensional NN search with excellent quality. The combination of several LSBtrees leads to a structure called the LSBforest that ensures the same result quality as rigorousLSH, but reduces its space and query cost dramatically. The LSBforest also outperforms adhocLSH, even though the latter has no quality guarantee. Besides its appealing theoretical properties, the LSBtree itself also serves as an effective index that consumes linear space, and supports efficient updates. Our extensive experiments confirm that the LSBtree is faster than (i) the state of the art of exact NN search by two orders of magnitude, and (ii) the best (linearspace) method of approximate retrieval by an order of magnitude, and at the same time, returns neighbors with much better quality.
Complementary hashing for approximate nearest neighbor search
 In Proc. ICCV
, 2011
"... Recently, hashing based Approximate Nearest Neighbor (ANN) techniques have been attracting lots of attention in computer vision. The datadependent hashing methods, e.g., Spectral Hashing, expects better performance than the datablind counterparts, e.g., Locality Sensitive Hashing (LSH). However, m ..."
Abstract

Cited by 31 (11 self)
 Add to MetaCart
(Show Context)
Recently, hashing based Approximate Nearest Neighbor (ANN) techniques have been attracting lots of attention in computer vision. The datadependent hashing methods, e.g., Spectral Hashing, expects better performance than the datablind counterparts, e.g., Locality Sensitive Hashing (LSH). However, most datadependent hashing methods only employ a single hash table. When higher recall is desired, they have to retrieve exponentially growing number of hash buckets around the bucket containing the query, which may drag down the precision rapidly. In this paper, we propose a socalled complementary hashing approach, which is able to balance the precision and recall in a more effective way. The key idea is to employ multiple complementary hash tables, which are learned sequentially in a boosting manner, so that, given a query, its true nearest neighbors missed from the active bucket of one hash table are more likely to be found in the active bucket of the next hash table. Compared with LSH that also can exploit multiple hash tables, our approach is more effective to find true NNs, thanks to the complementarity property of the hash tables from our approach. Experimental results on large scale ANN search show that the proposed method significantly improves the performance and outperforms the stateoftheart. 1.
Lower bounds on Locality Sensitive Hashing
, 2006
"... Given a metric space (X, dX), c ≥ 1, r> 0, and p, q ∈ [0, 1], a distribution over mappings H: X → N is called a (r, cr, p, q)sensitive hash family if any two points in X at distance at most r are mapped by H to the same value with probability at least p, and any two points at distance greater th ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
Given a metric space (X, dX), c ≥ 1, r> 0, and p, q ∈ [0, 1], a distribution over mappings H: X → N is called a (r, cr, p, q)sensitive hash family if any two points in X at distance at most r are mapped by H to the same value with probability at least p, and any two points at distance greater than cr are mapped by H to the same value with probability at most q.This notion was introduced by Indyk and Motwani in 1998 as the basis for an efficient approximate nearest neighbor search algorithm, and has since been used extensively for this purpose.The performance of these algorithms is governed by the parameter ρ = log(1/p) log(1/q),and constructing hash families with small ρ automatically yields improved nearest neighbor algorithms.Here we show that for X = ℓ1 it is impossible to achieve ρ ≤ 1.This al2c most matches the construction of Indyk and Motwani which achieves ρ ≤ 1 c.
Nearest Neighbor Retrieval Using DistanceBased Hashing
"... Abstract — A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string space ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
Abstract — A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including nonmetric distance measures. First, we describe a domainindependent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several realworld data sets demonstrate that our method produces good tradeoffs between accuracy and efficiency, and significantly outperforms VPtrees, which are a wellknown method for distancebased indexing. I.
Fast and accurate kmeans for large datasets.
 In NIPS*24,
, 2011
"... Abstract Clustering is a popular problem with many applications. We consider the kmeans problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is base ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
Abstract Clustering is a popular problem with many applications. We consider the kmeans problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute kmeans in o(nk) (where n is the number of data points; note that computing the cost, given a solution, takes Θ(nk) time). We show that our algorithm compares favorably to existing algorithms both theoretically and experimentally, thus providing stateoftheart performance in both theory and practice.
A geometric approach to lower bounds for approximate nearneighbor search and partial match
 In Proc. 49th IEEE Symposium on Foundations of Computer Science (FOCS
, 2008
"... This work investigates a geometric approach to proving cell probe lower bounds for data structure problems. We consider the approximate nearest neighbor search problem on the Boolean hypercube ({0, 1} d, ‖ · ‖1) with d = Θ(log n). We show that any (randomized) data structure for the problem that a ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
This work investigates a geometric approach to proving cell probe lower bounds for data structure problems. We consider the approximate nearest neighbor search problem on the Boolean hypercube ({0, 1} d, ‖ · ‖1) with d = Θ(log n). We show that any (randomized) data structure for the problem that answers capproximate nearest neighbor search queries using t probes must use space at least n1+Ω(1/ct). In particular, our bound implies that any data structure that uses space Õ(n) with polylogarithmic word size, and with constant probability gives a constant approximation to nearest neighbor search queries must be probed Ω(log n / log log n) times. This improves on the lower bound of Ω(log log d / log log log d) probes shown by Chakrabarti and Regev [8] for any polynomial space data structure, and the Ω(log log d) lower bound in Pătras¸cu and Thorup [26] for linear space data structures. Our lower bound holds for the near neighbor problem, where the algorithm knows in advance a good approximation to the distance to the nearest neighbor. Additionally, it is an average case lower bound for the natural distribution for the problem. Our approach also gives the same bound for (2 − 1)approximation to the farthest neighbor problem. c For the case of nonadaptive algorithms we can improve the bound slightly and show a Ω(log n) lower bound on the time complexity of data structures with O(n) space and logarithmic word size. We also show similar lower bounds for the partial match problem: any randomized tprobe data structure that solves the partial match problem on {0, 1, ⋆} d for d = Θ(log n) must use space n1+Ω(1/t). This implies an Ω(log n / log log n) lower bound for time complexity of near linear space data structures, slightly improving the Ω(log n/(log log n) 2) lower bound from [25],[16] for this range of d. Recently and independently Pătras¸cu achieved similar bounds [24]. Our results also generalize to approximate partial match, improving on the bounds of [4, 25]. 1 1
Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality
, 2012
"... We present two algorithms for the approximate nearest neighbor problem in highdimensional spaces. For data sets of size n living in Rd, the algorithms require space that is only polynomial in n and d, while achieving query times that are sublinear in n and polynomial in d. We also show application ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
We present two algorithms for the approximate nearest neighbor problem in highdimensional spaces. For data sets of size n living in Rd, the algorithms require space that is only polynomial in n and d, while achieving query times that are sublinear in n and polynomial in d. We also show applications to other highdimensional geometric problems, such as the approximate minimum spanning tree. The article is based on the material from the authors’ STOC’98 and FOCS’01 papers. It unifies, generalizes and simplifies the results from those papers.