Results 1  10
of
11
Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions
, 2008
"... In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The ..."
Abstract

Cited by 237 (4 self)
 Add to MetaCart
In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.
Overcoming the ℓ1 nonembeddability barrier: Algorithms for product metrics
, 2008
"... A common approach for solving computational problems over a difficult metric space is to embed the “hard ” metric into L1, which admits efficient algorithms and is thus considered an “easy ” metric. This approach has proved successful or partially successful for important spaces such as the edit dis ..."
Abstract

Cited by 17 (8 self)
 Add to MetaCart
A common approach for solving computational problems over a difficult metric space is to embed the “hard ” metric into L1, which admits efficient algorithms and is thus considered an “easy ” metric. This approach has proved successful or partially successful for important spaces such as the edit distance, but it also has inherent limitations: it is provably impossible to go below certain approximation for some metrics. We propose a new approach, of embedding the difficult space into richer host spaces, namely iterated products of standard spaces like ℓ1 and ℓ∞. We show that this class is rich since it contains useful metric spaces with only a constant distortion, and, at the same time, it is tractable and admits efficient algorithms. Using this approach, we obtain for example the first nearest neighbor data structure with O(log log d) approximation for edit distance in nonrepetitive strings (the Ulam metric). This approximation is exponentially better than the lower bound for embedding into L1. Furthermore, we give constant factor approximation for two other computational problems. Along the way, we answer positively a question posed in [Ajtai, Jayram, Kumar, and Sivakumar, STOC 2002]. One of our algorithms has already found applications for smoothed edit distance over 01 strings [Andoni and Krauthgamer, ICALP 2008]. 1
Nearest Neighbor Retrieval Using DistanceBased Hashing
"... Abstract — A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string space ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Abstract — A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including nonmetric distance measures. First, we describe a domainindependent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several realworld data sets demonstrate that our method produces good tradeoffs between accuracy and efficiency, and significantly outperforms VPtrees, which are a wellknown method for distancebased indexing. I.
Nearest Neighbor Search Methods for Handshape Recognition
"... Gestures are an important modality for humanmachine communication, and robust gesture recognition can be an important component of intelligent homes and assistive environments in general. An important aspect of gestures is handshape. Handshapes can hold important information about the meaning of a ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Gestures are an important modality for humanmachine communication, and robust gesture recognition can be an important component of intelligent homes and assistive environments in general. An important aspect of gestures is handshape. Handshapes can hold important information about the meaning of a gesture, for example in sign languages, or about the intent of an action, for example in manipulative gestures or in virtual reality interfaces. At the same time, recognizing handshape can be a very challenging task, because the same handshape can look very different in different images, depending on the 3D orientation of the hand and the viewpoint of the camera. In this paper we examine a database approach for handshape classification, whereby a large database of tens of thousands of images is used to represent the wide variability of handshape appearance. Efficient and accurate indexing methods are important in such a database approach, to ensure that the system can match every incoming image to the large number of database images at interactive times. In this paper we examine the use of embeddingbased and hash tablebased indexing methods for handshape recognition, and we experimentally compare these two approaches on the task of recognizing 20 handshapes commonly used in American Sign Language (ASL).
Fast algorithms for nearest neighbour search
, 2007
"... The nearest neighbour problem is of practical significance in a number of fields. Often we are interested in finding an object near to a given query object. The problem is old, and a large number of solutions have been proposed for it in the literature. However, it remains the case that even the mos ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
The nearest neighbour problem is of practical significance in a number of fields. Often we are interested in finding an object near to a given query object. The problem is old, and a large number of solutions have been proposed for it in the literature. However, it remains the case that even the most popular of the techniques proposed for its solution have not been compared against each other. Also, many techniques, including the old and popular ones, can be implemented in a number of ways, and often the different implementations of a technique have not been thoroughly compared either. This research presents a detailed investigation of different implementations of two popular nearest neighbour search data structures, KDTrees and Metric Trees, and compares the different implementations of each of the two structures against each other. The best implementations of these structures are then compared against each other and against two other techniques, Annulus Method and Cover Trees. Annulus Method is an old technique that was rediscovered during the research for this thesis. Cover Trees are one of the most novel and promising data structures for nearest neighbour search that have been proposed in the literature. i Acknowledgments The continued support of Department of Computer Science’s Machine Learning group, and particularly my supervisor Dr. Eibe Frank, is greatly appreciated, without which this thesis would not have been possible.
unknown title
"... Optimal lower bounds for locality sensitive hashing (except when q is tiny) ..."
Abstract
 Add to MetaCart
Optimal lower bounds for locality sensitive hashing (except when q is tiny)
HIGH PERFORMANCE RECORD LINKAGE
"... In current world, the immense size of a data set makes problems in finding similar/identitcal data. In addition, the dirtiness of data, i.e. typos, missing/tilting information, and additional noises usually occurred by careless editing or entry mistakes, makes further difficulty to identify entityb ..."
Abstract
 Add to MetaCart
In current world, the immense size of a data set makes problems in finding similar/identitcal data. In addition, the dirtiness of data, i.e. typos, missing/tilting information, and additional noises usually occurred by careless editing or entry mistakes, makes further difficulty to identify entitybelongs. Therefore, we focus on the faster detection of data referring the same realworld entity from a large size data set under the error prone environments, while the high accuracy of detection is maintained. In this thesis, we study highperformance linkage algorithms using four different applications. First, we introduce the image linkage algorithm to find nearduplicate images with similar characteristics by bridging two seemingly unrelated fields – Multimedia Information Retrieval and Biology. Under this idea, we study how various image features and gene sequence generation methods affect the accuracy and performance of detecting nearduplicate images. Second, we develop the video linkage algorithm using record linkage methods to detect copied videos from a large multimedia database or sites such as YouTube and Yahoo Videos. The utilization of video characteristics is reflected to the hierarchical structure of
unknown title
"... Optimal lower bounds for locality sensitive hashing (except when q is tiny) ..."
Abstract
 Add to MetaCart
Optimal lower bounds for locality sensitive hashing (except when q is tiny)
Reporting Neighbors in HighDimensional Euclidean Space ∗
"... We consider the following problem, which arises in many database and webbased applications: Given a set P of n points in a highdimensional space Rd and a distance r, we want to report all pairs of points of P at Euclidean distance at most r. We present two randomized algorithms, one based on rando ..."
Abstract
 Add to MetaCart
We consider the following problem, which arises in many database and webbased applications: Given a set P of n points in a highdimensional space Rd and a distance r, we want to report all pairs of points of P at Euclidean distance at most r. We present two randomized algorithms, one based on randomly shifted grids, and the other on randomly shifted and rotated grids. The running time of both algorithms is of the form C(d)(n + k) log n, where k is the output size and C(d) is a constant that depends on the dimension d. The log n factor is needed to guarantee, with high probability, that all neighbor pairs are reported, and can be dropped if it suffices to report, in expectation, an arbitrarily large fraction of the pairs. When only translations are used, C(d) is of the form (a √ d) d, for some (small) absolute constant a ≈ 0.484; this bound is worstcase tight, up to an exponential factor of about 2 d. When both rotations and translations are used, C(d) can be improved to roughly 6.74 d, getting rid of the superexponential factor √ d d. When the input set (lies in a subset of dspace that) has low doubling dimension δ, the performance of the first algorithm improves to C(d, δ)(n + k) log n (or to C(d, δ)(n + k)), where C(d, δ) = O((ed/δ) δ), for δ ≤ √ ( d. Otherwise, C(d, δ) = O e √ d √ d δ) We also present experimental results on several large datasets, demonstrating that our algorithms run significantly faster than all the leading existing algorithms for reporting neighbors. ∗Work by Haim Kaplan and Micha Sharir has been supported
Recommended Citation
"... Optimal lower bounds for locality sensitive hashing (except when q is tiny) ..."
Abstract
 Add to MetaCart
Optimal lower bounds for locality sensitive hashing (except when q is tiny)