The Levenshtein-distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein-automata of degree n for a word W are dened as nite state automata that regognize the set of all words V where the Levenshtein-distance between V and W does not exceed n. We show how to compute, for any xed bound n and any input word W, a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W. Given an electronic dictionary that is implemented in the form of a trie or a nite state automaton, the Levenshtein-automaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshtein-distance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshtein-automata and leads to even improved eciency. We also describe how to extend both methods to variants of the Levenshtein-distance where further primitive edit operations (transpositions, merges and splits) may be used. Keywords: Spelling correction, Levenshtein-distance, optical character recognition, electronic dictionaries.
SVM HeaderParse 0.1
International Journal of Document Analysis and Recognition