## Fast String Correction with Levenshtein-Automata (2002)

Venue: | INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION |

Citations: | 28 - 5 self |

### BibTeX

@ARTICLE{Schulz02faststring,

author = {Klaus Schulz and Stoyan Mihov},

title = {Fast String Correction with Levenshtein-Automata},

journal = {INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION},

year = {2002},

volume = {5},

pages = {67--85}

}

### Years of Citing Articles

### OpenURL

### Abstract

The Levenshtein-distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein-automata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levenshtein-distance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W . Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshtein-automaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshtein-distance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshtein-automata and leads to even improved eciency. We also describe how to extend both methods to variants of the Levenshtein-distance where further primitive edit operations (transpositions, merges and splits) may be used.

### Citations

4091 | Introduction to Automata Theory, Languages, and Computation - HOPCROFT, ULLMAN - 1979 |

1330 | Binary codes capable of correcting deletions, insertions, and reversals - Levenshtein - 1966 |

695 |
The string-to-string correction problem
- Wagner, Fischer
- 1974
(Show Context)
Citation Context ...istances [AFW83, Ukk92, KST92, KST94]. In this paper, we take the Levenshtein-distance as a basis. The standard algorithm for computing the Levenshtein-distance between two words by Wagner and Fisher =-=[WF74]-=- uses a dynamic programming scheme that leads to quadratic time complexity. Even with more sophisticated algorithms (cf. [Ukk85]) it is not realistic to compute the Levenshtein-distance between the in... |

381 | Techniques for automatically correcting words in text - Kukich - 1992 |

198 |
Algorithms for approximate string matching
- Ukkonen
- 1985
(Show Context)
Citation Context ...omputing the Levenshtein-distance between two words by Wagner and Fisher [WF74] uses a dynamic programming scheme that leads to quadratic time complexity. Even with more sophisticated algorithms (cf. =-=[Ukk85]-=-) it is not realistic to compute the Levenshtein-distance between the input word W and each of the words in the dictionary, already for dictionaries of a modest size. The problem becomes even more ser... |

168 | Approximate string-matching with qgrams and maximal matches. Theoretical Computer Science 92 - Ukkonen - 1992 |

146 | A fast bit-vector algorithm for approximate string matching based on dynamic programming - Myers - 1999 |

135 | Automata and Computability - Kozen - 1997 |

60 | Automatic spelling correction using trigram similarity measure - Angell, Freund, et al. - 1983 |

54 |
Minimisation of acyclic deterministic automata in linear time, Theoret
- Revuz
- 1992
(Show Context)
Citation Context ...) in time and space O(jW j). Corollary 5.1.3 For any input W , the minimal deterministic Levenshtein-automaton of degree 1 for W can be computed in time and space O(jW j). Proof. A result by D. Revuz =-=[Rev92]-=- shows that acyclic deterministicsnite state automata can be minimalized in linear time. Since LEV 1 (W ) is deterministic and acyclic the result follows. Example 5.1.4 Figure 5.1 describes the automa... |

25 | Computer Text Recognition and Error Correction - Srihari - 1985 |

23 | Degraded Text Recognition Using Visual and Linguistic Context - Hong - 1995 |

21 | Incorporation of a markov model of language syntax in a text recognition algorithm - Hull |

20 | Integrating diverse knowledge sources in text recognition - Srihari, Hull, et al. - 1983 |

19 | A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors - Ullmann - 1977 |

15 | Fast approximate string matching - Owolabi, McGregor - 1988 |

13 | Pattern recognition of strings with substitutions, insertions, deletions, and generalized transpositions - Oommen, Loke - 1997 |

13 | A program for correcting spelling errors - Blair - 1960 |

12 | Fast string matching using an n-gram algorithm - Kim, Shawe-Taylor - 1994 |

12 | Generalizing edit distance to incorporate domain information: Handwritten text recognition as a case study - Seni, Kripasundar, et al. - 1996 |

11 | An approximate string-matching algorithm - Kim, Shawe-Taylor - 1992 |

11 | Direct building of minimal automaton for given list - Mihov - 1999 |

10 | A spelling correction method and its application to an OCR system - Takahashi, Itoh, et al. - 1990 |

9 | Contextual word recognition using binary digrams - Riseman, Ehrich - 1971 |

9 | On partitioning a dictionary for visual text recognition - Sinha - 1990 |

8 | Finding approximate matches in large lexicons. Software Pract Exp - Zobel, PW - 1995 |

7 | Lexical Postprocessing by Heuristic Search and Automatic Determination of the Edit Costs - Weigel, Baumann, et al. - 1995 |

6 | Incremental construction of minimal acyclic state automata - Daciuk, Mihov, et al. - 2000 |

6 | Selektionsklassen und Hyponymie im Lexikon - Langer - 1996 |

6 | Lexikon und automatische Lemmatisierung - Maier-Meyer - 1995 |

5 | Electronic lexica and corpora research at CIS - Guenthner - 1996 |

3 | A large vocabulary stochastic analyser for handwriting recognition - Keenan, Evett, et al. - 1991 |

2 |
A fast algorithm for the nearest neighbor of a word in a dictionary
- Bunke
- 1992
(Show Context)
Citation Context ...number of common n-grams with W exceed a certain threshold are selected in thesrst step [OM88, KST92, ZD95]. More recently, techniques from automata theory have been used to tackle the problem. Bunke =-=[Bun93]-=- has shown that for any given word W the columns of the table computed in the Fisher-Wagner algorithm can be compiled into a deterministicsnite state automaton. For any word V the automaton may be use... |

2 | de Beuvron and Philippe Trigano. Hierarchically coded lexicon with variants - Bertrand - 1995 |

1 | Error-tolerant recognition with applications to morphological analysis and spelling correction - Kemal - 1996 |

1 | Einfache deutsche Verben. Eine syntaktische und semantische Beschreibung der verbalen Simplizia fur das elektronische Lexikonsystem CISLEX - Schnorbusch - 1998 |