Abstract:
A successful technique to search large textual databases allowing errors relies on an online search in the vocabulary of the text. To reduce the time of that online search, we index the vocabulary as a metric space. We show that with reasonable space overhead we can improve by a factor of two over the fastest online algorithms, when the tolerated error level is low (which is reasonable in text searching). 1 Introduction Approximate string matching is a recurrent problem in many branches of computer science, with applications to text searching, computational biology, pattern recognition, signal processing, etc. The problem can be stated as follows: given a long text of length n, and a (comparatively short) pattern of length m, retrieve all the segments (or "occurrences") of the text whose edit distance to the pattern is at most k. The edit distance ed() between two strings is defined as the minimum number of character insertions, deletions and replacements needed to make them equal. I...
Citations
|
313
|
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia databases
– FALOUTSOS, LIN
- 1995
|
|
249
|
Fast text search allowing errors
– Manber, Wu
- 1992
|
|
196
|
The theory and computation of evolutionary distances: pattern recognition
– Sellers
- 1980
|
|
175
|
Overview of the third text retrieval conference
– Harman
- 1994
|
|
171
|
GLIMPSE: A Tool to Search Through Entire File Systems
– Manber, Wu
- 1994
|
|
163
|
Data structures and algorithms for nearest neighbor search in general metric spaces
– Yianilos
- 1993
|
|
131
|
Near neighbor search in large metric spaces. 21st VLDB
– Brin
- 1995
|
|
127
|
Finding approximate patterns in strings
– Ukkonen
- 1985
|
|
114
|
Satisfying general proximity/similarity queries with metric trees
– Uhlmann
- 1991
|
|
103
|
A fast bit-vector algorithm for approximate string matching based on dynamic programming
– Myers
- 1999
|
|
97
|
Some approaches to best-match file searching
– Burkhard, Keller
- 1973
|
|
83
|
Information Retrieval, Computational and Theoretical Aspects
– Heaps
- 1978
|
|
46
|
Proximity matching using fixedqueries trees
– Baeza-Yates, Cunto, et al.
|
|
46
|
A faster algorithm for approximate string matching
– Baeza-Yates, Navarro
- 1996
|
|
45
|
Theoretical and empirical comparisons of approximate string matching algorithms
– Chang, Lampe
- 1992
|
|
44
|
Fast string matching with k differences
– Landau, Vishkin
- 1988
|
|
42
|
Text retrieval: Theory and practice
– Baeza-Yates
- 1992
|
|
38
|
On using q-gram locations in approximate string matching
– Sutinen, Tarhio
- 1995
|
|
34
|
Block-addressing indices for approximate text retrieval
– Baeza-Yates, Navarro
- 1998
|
|
33
|
Large Text Searching Allowing Errors
– Araújo, Navarro, et al.
- 1997
|
|
29
|
Fast and practical approximate pattern matching
– Baeza-Yates, Perleberg
- 1996
|
|
28
|
The choice of reference points in best-match file searching
– Shapiro
- 1977
|
|
20
|
An algorithm for finding nearest neighbours in (approximately) constant average time
– Vidal
- 1986
|