## Fast Approximate Search in Large Dictionaries (2004)

### Cached

### Download Links

- [www.cis.uni-muenchen.de]
- [lml.bas.bg]
- [www.cs.mun.ca]
- [www.lml.bas.bg]
- [www.aclweb.org]
- [wing.comp.nus.edu.sg]
- [www.aclweb.org]
- [aclweb.org]
- [aclweb.org]
- CiteULike
- DBLP

### Other Repositories/Bibliography

Venue: | COMPUTATIONAL LINGUISTICS |

Citations: | 13 - 4 self |

### BibTeX

@MISC{Mihov04fastapproximate,

author = {Stoyan Mihov and Klaus U. Schulz},

title = {Fast Approximate Search in Large Dictionaries},

year = {2004}

}

### OpenURL

### Abstract

The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k. In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a "universal Levenshtein automaton," we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixed-distance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.

### Citations

3836 | J.D.: Introduction to automata theory, languages, and computation - Hopcroft, Motwani, et al. |

1201 |
Binary codes capable of correcting deletions, insertions, and reversals
- Levenshtein
- 1966
(Show Context)
Citation Context ...based on two steps. First, all entries W of the dictionary are selected for which the distance between P and W does not exceed a given bound k. Popular distance measures are the Levenshtein distance (=-=Levenshtein 1966-=-; Wagner and Fischer 1974; Owolabi and McGregor 1988; Weigel, Baumann, and Rohrschneider 1995; Seni, Kripasundar, and Srihari 1996; Oommen and Loke 1997) or n-gram distances (Angell, Freund, and Wille... |

409 | A Guided Tour to Approximate String Matching - Navarro |

352 |
Automatically correcting words in text
- Kukich
(Show Context)
Citation Context ...comment on the difficulties that we encountered when trying to combine dictionary automata and similarity keys (Davidson 1962; Angell, Freund, and Willett 1983; Owolabi and McGregor 1988; Sinha 1990; =-=Kukich 1992-=-; Anigbogu and Belaid 1995; Zobel and Dart 1995; de Bertrand de Beuvron and Trigano 1995). Theoretical bounds for correction times are discussed in Section 9. The problem considered in this article is... |

318 |
Fast text search allowing errors
- Manber, Wu
- 1992
(Show Context)
Citation Context ...4 4 4 4 4 h 0 1 0 1 2 3 i 0 1 1 1 2 3 Figure 2 Approximate search of pattern chold in a text using dynamic programming. l 0 1 2 2 1 2 d 0 1 2 3 2 1 distance ≤ k to some word in Σ ∗ ·P (Ukkonen 1985b; =-=Wu and Manber 1992-=-; Baeza-Yates and Navarro 1999). The automaton for pattern chold and distance bound k = 2 is shown in Figure 3. States are numbered in the form b e . The “base number” b determines the position of the... |

186 |
On approximate string matching
- Ukkonen
- 1983
(Show Context)
Citation Context ... 3 3 3 4 3 5 5 4 4 4 4 4 h 0 1 0 1 2 3 i 0 1 1 1 2 3 Figure 2 Approximate search of pattern chold in a text using dynamic programming. l 0 1 2 2 1 2 d 0 1 2 3 2 1 distance ≤ k to some word in Σ ∗ ·P (=-=Ukkonen 1985-=-b; Wu and Manber 1992; Baeza-Yates and Navarro 1999). The automaton for pattern chold and distance bound k = 2 is shown in Figure 3. States are numbered in the form b e . The “base number” b determine... |

153 |
Approximate string matching with q-grams and maximal matches
- Ukkonen
- 1992
(Show Context)
Citation Context ...nd McGregor 1988; Weigel, Baumann, and Rohrschneider 1995; Seni, Kripasundar, and Srihari 1996; Oommen and Loke 1997) or n-gram distances (Angell, Freund, and Willett 1983; Owolabi and McGregor 1988; =-=Ukkonen 1992-=-; Kim and Shawe-Taylor 1992, 1994) Second, statistical data, such as frequency information, may be used to compute a ranking of the correction candidates. In this article, we ignore the ranking proble... |

149 |
Finding approximate patterns in strings
- Ukkonen
- 1985
(Show Context)
Citation Context ... 3 3 3 4 3 5 5 4 4 4 4 4 h 0 1 0 1 2 3 i 0 1 1 1 2 3 Figure 2 Approximate search of pattern chold in a text using dynamic programming. l 0 1 2 2 1 2 d 0 1 2 3 2 1 distance ≤ k to some word in Σ ∗ ·P (=-=Ukkonen 1985-=-b; Wu and Manber 1992; Baeza-Yates and Navarro 1999). The automaton for pattern chold and distance bound k = 2 is shown in Figure 3. States are numbered in the form b e . The “base number” b determine... |

126 | Automata and Computability - Kozen - 1997 |

94 | String Searching Algorithms - STEPHEN - 1994 |

93 |
A sublinear algorithm for approximate keyword searching
- Myers
- 1994
(Show Context)
Citation Context ...oximate match of a given pattern P is not possible. (See Navarro [2001] and Navarro and Raffinot [2002] for surveys). In this section, we show how one general method of this form (Wu and Manber 1992; =-=Myers 1994-=-; Baeza-Yates and Navarro 1999; Navarro and BaezaYates 1999) can be adapted to approximate search in a dictionary, improving the basic correction algorithm. For approximate text search, the crucial ob... |

72 | Faster approximate string matching
- Baeza-Yates, Navarro
- 1999
(Show Context)
Citation Context ...2 3 i 0 1 1 1 2 3 Figure 2 Approximate search of pattern chold in a text using dynamic programming. l 0 1 2 2 1 2 d 0 1 2 3 2 1 distance ≤ k to some word in Σ ∗ ·P (Ukkonen 1985b; Wu and Manber 1992; =-=Baeza-Yates and Navarro 1999-=-). The automaton for pattern chold and distance bound k = 2 is shown in Figure 3. States are numbered in the form b e . The “base number” b determines the position of the state in the pattern. The “ex... |

55 | Automatic spelling correction using a trigram similarity measure - Willett, Angell - 1983 |

42 | R.E.: Incremental construction of minimal acyclic finitestate automata - Daciuk, Mihov, et al. - 2000 |

39 | Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction - Oflazer - 1996 |

32 | Very fast and simple approximate string matching - Navarro, Baeza-Yates - 1999 |

28 | Fast string correction with Levenshtein automata - Schulz, Mihov |

24 | Fast approximate string matching in a dictionary - Baeza-Yates, Navarro - 1998 |

23 | Computer Text Recognition and Error Correction - Srihari - 1984 |

19 | Integrating Diverse Knowledge Sources in Text Recognition - Srihari, Hull, et al. - 1983 |

11 | A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words - Ullman - 1977 |

10 | A program for correcting spelling errors - Blair - 1960 |

10 | Techniques for improving ocr results - Dengel, Hoch, et al. - 1997 |

10 | An approximate string-matching algorithm - Kim, Shawe-Taylor - 1992 |

10 | Fast string matching using an n-gram algorithm - Kim, Shawe-Taylor - 1994 |

9 | Generalizing edit distance to incorporate domain information: Handwritten text recognition as a case study - Seni, Kripasundar, et al. - 1996 |

8 |
Retrieval of misspelled names in an airline passenger record system
- Davidson
- 1962
(Show Context)
Citation Context ...n results are given for the three dictionaries mentioned above. In Section 8 we briefly comment on the difficulties that we encountered when trying to combine dictionary automata and similarity keys (=-=Davidson 1962-=-; Angell, Freund, and Willett 1983; Owolabi and McGregor 1988; Sinha 1990; Kukich 1992; Anigbogu and Belaid 1995; Zobel and Dart 1995; de Bertrand de Beuvron and Trigano 1995). Theoretical bounds for ... |

8 |
Partitioning a Dictionary for Visual Text Recognition
- Sinha
- 1990
(Show Context)
Citation Context ... we briefly comment on the difficulties that we encountered when trying to combine dictionary automata and similarity keys (Davidson 1962; Angell, Freund, and Willett 1983; Owolabi and McGregor 1988; =-=Sinha 1990-=-; Kukich 1992; Anigbogu and Belaid 1995; Zobel and Dart 1995; de Bertrand de Beuvron and Trigano 1995). Theoretical bounds for correction times are discussed in Section 9. The problem considered in th... |

7 | Fast dictionary look-up for contextual word recognition - Wells, Evett, et al. - 1990 |

6 | A fast algorithm for finding the nearest neighbor of a word in a dictionary - Bunke - 1993 |

6 | Lexical Postprocessing by Heuristic Search and Automatic Determination of the Edit Costs - Weigel, Baumann, et al. - 1995 |

5 | Hidden Markov models in text recognition
- Anigbogu, Belaid
- 1995
(Show Context)
Citation Context ...e difficulties that we encountered when trying to combine dictionary automata and similarity keys (Davidson 1962; Angell, Freund, and Willett 1983; Owolabi and McGregor 1988; Sinha 1990; Kukich 1992; =-=Anigbogu and Belaid 1995-=-; Zobel and Dart 1995; de Bertrand de Beuvron and Trigano 1995). Theoretical bounds for correction times are discussed in Section 9. The problem considered in this article is well-studied. Since the n... |

5 |
A hash code method for detecting and correcting spelling errors
- Mor, Fraenkel
- 1982
(Show Context)
Citation Context ... of this article. The second refinement, which is only interesting for bound k = 1 and short input words, also uses a filtering method from the field of approximate text search (Muth and Manber 1996; =-=Mor and Fraenkel 1981-=-). In this approach, “dictionaries with single deletions” are used to reduce approximate search in a dictionary D with bound k = 1 to a conventional lookup technique for finite-state transducers. Dict... |

3 | Direct construction of minimal acyclic subsequential transducers - Mihov, Maurel - 2001 |

2 | patent numbers 1,435,663 - s - 1922 |

1 | Introduction to Automata Theory, Languages, and Computation - Kim, Shawe-Taylor - 1979 |

1 | Flexible Pattern Matching in Strings - Odell, Russell - 2002 |

1 | Pattern recognition of strings with substitutions, insertions, deletions, and generalized transpositions - Owolabi, McGregor - 1997 |

1 | Fast approximate string matching. Software—Practice and Experience - Riseman, Ehrich - 1988 |

1 | Hierarchically coded lexicon with variants - Beuvron, Francois, et al. - 1995 |