Results 1  10
of
25
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality
, 1998
"... The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the ddimens ..."
Abstract

Cited by 715 (33 self)
 Add to MetaCart
The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the ddimensional Euclidean space where X = ! d under some l p norm. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large d, in theory or in practice, they provide little improvement over the bruteforce algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem, which is: Find a point p 2 P that is an fflapproximate nearest neighbor of the query q in that for all p 0 2 P , d(p; q) (1 + ffl)d(p 0 ; q). We present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in n and d, and a trul...
Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions
, 2008
"... In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The ..."
Abstract

Cited by 237 (4 self)
 Add to MetaCart
In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.
Nearest Neighbors In HighDimensional Spaces
, 2004
"... In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer sci ..."
Abstract

Cited by 76 (2 self)
 Add to MetaCart
In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer science, including pattern recognition, searching in multimedial data, vector compression [GG91], computational statistics [DW82], and data mining. Many of these applications involve data sets which are very large (e.g., a database containing Web documents could contain over one billion documents). Moreover, the dimensionality of the points is usually large as well (e.g., in the order of a few hundred). Therefore, it is crucial to design algorithms which scale well with the database size as well as with the dimension. The nearestneighbor problem is an example of a large class of proximity problems, which, roughly speaking, are problems whose definitions involve the notion of...
Dictionary matching and indexing with errors and don’t cares
 In STOC ’04
, 2004
"... This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which ..."
Abstract

Cited by 50 (1 self)
 Add to MetaCart
This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which a bounded number of “don’t care ” characters are allowed. The specific problems we look at are: indexing, in which there is a single text t, and we seek locations where p matches a substring of t; dictionary queries, in which a collection of strings is given upfront, and we seek those strings which match p in their entirety; and dictionary matching, in which a collection of strings is given upfront, and we seek those substrings of a (long) p which match an original string in its entirety. These are all instances of an alltoall matching problem, for which we provide a single solution. The performance bounds all have a similar character. For example, for the indexing problem with n = t  and m = p, the query time for k substitutions is O(m + (c1 log n) k k! # matches), with a data structure of size O(n (c2 log n) k k! and a preprocessing time of O(n (c2 log n) k), where c1, c2> k! 1 are constants. The deterministic preprocessing assumes a weakly nonuniform RAM model; this assumption is not needed if randomization is used in the preprocessing.
A Practical qGram Index for Text Retrieval Allowing Errors
 CLEI Electronic Journal
, 1998
"... We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substri ..."
Abstract

Cited by 33 (9 self)
 Add to MetaCart
We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experi...
NonExpansive Hashing
 In Proc. 28th STOC
, 1996
"... In a nonexpansive hashing scheme, similar inputs are stored in memory locations which are close. We develop a nonexpansive hashing scheme wherein any set of size O(R 1\Gamma" ) from a large universe may be stored in a memory of size R (any " ? 0, and R ? R 0 (ffl)), and where retrieval takes O(1 ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
In a nonexpansive hashing scheme, similar inputs are stored in memory locations which are close. We develop a nonexpansive hashing scheme wherein any set of size O(R 1\Gamma" ) from a large universe may be stored in a memory of size R (any " ? 0, and R ? R 0 (ffl)), and where retrieval takes O(1) operations. We explain how to use nonexpansive hashing schemes for efficient storage and retrieval of noisy data. A dynamic version of this hashing scheme is presented as well. 1
Indexing and Dictionary Matching with One Error (Extended Abstract)
, 1999
"... The indexing problem is the one where a text is preprocessed and subsequent queries of the form: "Find all occurrences of pattern P in the text" are answered in time proportional to the length of the query and the number of occurrences. In the dictionary matching problem a set of patterns is preproc ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
The indexing problem is the one where a text is preprocessed and subsequent queries of the form: "Find all occurrences of pattern P in the text" are answered in time proportional to the length of the query and the number of occurrences. In the dictionary matching problem a set of patterns is preprocessed and subsequent queries of the form: "Find all occurrences of dictionary patterns in text T" are answered in time proportional to the length of the text and the number of occurrences. There exist efficient worstcase solutions for the indexing problem and the dictionary matching problem, but none that find approximate occurrences of the patterns, i.e. where the pattern is within a bound edit (or hamming...
Multiple Approximate String Matching
 In Proc. of WADS'97, LNCS 1272
, 1997
"... We present two new algorithms for online multiple approximate string matching. These are extensions of previous algorithms that search for a single pattern. The singlepattern version of the first one is based on the simulation with bits of a nondeterministic finite automaton built from the patter ..."
Abstract

Cited by 19 (9 self)
 Add to MetaCart
We present two new algorithms for online multiple approximate string matching. These are extensions of previous algorithms that search for a single pattern. The singlepattern version of the first one is based on the simulation with bits of a nondeterministic finite automaton built from the pattern and using the text as input. To search for multiple patterns, we superimpose their automata, using the result as a filter. The second algorithm partitions the pattern in subpatterns that are searched with no errors, with a fast exact multipattern search algorithm. To handle multiple patterns, we search the subpatterns of all of them together. The average running time achieved is in both cases O(n) for moderate error level, pattern length and number of patterns. They adapt (with higher costs) to the other cases. However, the algorithms differ in speed and thresholds of usefulness. We analyze theoretically when each algorithm should be used, and show experimentally that they are faster ...
New and Faster Filters for Multiple Approximate String Matching
 RANDOM STRUCTURES AND ALGORITHMS (RSA
, 1998
"... We present three new algorithms for online multiple string matching allowing errors. These are extensions of previous algorithms that search for a single pattern. The average running time achieved is in all cases linear in the text size for moderate error level, pattern length and number of patte ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
We present three new algorithms for online multiple string matching allowing errors. These are extensions of previous algorithms that search for a single pattern. The average running time achieved is in all cases linear in the text size for moderate error level, pattern length and number of patterns. They adapt (with higher costs) to the other cases. However, the algorithms differ in speed and thresholds of usefulness. We analyze theoretically when each algorithm should be used, and show experimentally their performance. The only previous solution for this problem allows only one error. Our algorithms are the first to allow more errors, and are faster than previous work for a moderate number of patterns (e.g. less than 50100 on English text, depending on the pattern length).
Approximate Dictionary Queries
, 1996
"... . Given a set of n binary strings of length m each. We consider the problem of answering dqueries. Given a binary query string ff of length m, a dquery is to report if there exists a string in the set within Hamming distance d of ff. We present a data structure of size O(nm) supporting 1queri ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
. Given a set of n binary strings of length m each. We consider the problem of answering dqueries. Given a binary query string ff of length m, a dquery is to report if there exists a string in the set within Hamming distance d of ff. We present a data structure of size O(nm) supporting 1queries in time O(m) and the reporting of all strings within Hamming distance 1 of ff in time O(m). The data structure can be constructed in time O(nm). A slightly modified version of the data structure supports the insertion of new strings in amortized time O(m). 1 Introduction Let W = fw 1 ; : : : ; wng be a set of n binary strings of length m each, i.e. w i 2 f0; 1g m . The set W is called the dictionary. We are interested in answering d queries, i.e. for any query string ff 2 f0; 1g m to decide if there is a string w i in W with at most Hamming distance d of ff. Minsky and Papert originally raised this problem in [12]. Recently a sequence of papers have considered how to solve thi...