Results 1 
6 of
6
Fast String Correction with LevenshteinAutomata
 INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION
, 2002
"... The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levensht ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levenshteindistance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshteinautomaton of degree n for W in time linear in the length of W . Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshteinautomaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshteindistance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshteinautomata and leads to even improved eciency. We also describe how to extend both methods to variants of the Levenshteindistance where further primitive edit operations (transpositions, merges and splits) may be used.
Fast Approximate Search in Large Dictionaries
 COMPUTATIONAL LINGUISTICS
, 2004
"... The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k. In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a "universal Levenshtein automaton," we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixeddistance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.
Message Extraction from Printed Documents  A Complete Solution 
 In 4th Int. Conf. on Document Analysis and Recognition (ICDAR 97
, 1997
"... The task to be solved within our core research was the design and development of a document analysis toolbox covering typical document analysis tasks such as document understanding, information extraction and text recognition. In order to prove the feasibility of our concepts, we have developed the ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
The task to be solved within our core research was the design and development of a document analysis toolbox covering typical document analysis tasks such as document understanding, information extraction and text recognition. In order to prove the feasibility of our concepts, we have developed the prototypical analysis system OfficeMAID. The system analyzes documents, as used in the daily work of a purchasing department, by apriori knowledge about workflows and document features. In this way the system provides goaldirected information extraction, shallow understanding and process identification for given documents (paper, fax, email). This work has been supported by a grant from the BMBF (ITW 9702). 1 Introduction Generally, printed documents are neither generated for scanning and automatic processing nor for easy integration into electronic workflows. Therefore, it is hard to transform them adequately for further processing by electronic means. This is the reason why DMS  in...
A visual and interactive tool for optimizing lexical postcorrection of OCR results
 In Proceedings of the IEEE Workshop on Document Image Analysis and Recognition, DIARâ€™03
, 2003
"... Systems for postcorrection of OCRresults can be fine tuned and adapted to new recognition tasks in many respects. One issue is the selection and adaption of a suitable background dictionary. Another issue is the choice of a correction model, which includes, among other decisions, the selection of a ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
Systems for postcorrection of OCRresults can be fine tuned and adapted to new recognition tasks in many respects. One issue is the selection and adaption of a suitable background dictionary. Another issue is the choice of a correction model, which includes, among other decisions, the selection of an appropriate distance measure for strings and the choice of a scoring function for ranking distinct correction alternatives. When combining the results obtained from distinct OCR engines, further parameters have to be fixed. Due to all these degrees of freedom, adaption and fine tuning of systems for lexical postcorrection is a difficult process. Here we describe a visual and interactive tool that semiautomates the generation of ground truth data, partially automates adjustment of parameters, yields active support for error analysis and thus helps to find correction strategies that lead to high accuracy with realistic effort.
Advances in Document Classification by Voting of Competitive Approaches
, 1997
"... This paper presents a complex approach for the contentbased text categorization of printed German business letters into predefined message types such as order, invoice, offer, etc. The categorization results of two competing classifiers are combined by means of a voting component embodying know ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
This paper presents a complex approach for the contentbased text categorization of printed German business letters into predefined message types such as order, invoice, offer, etc. The categorization results of two competing classifiers are combined by means of a voting component embodying knowledge about the strengths and weaknesses of the classifiers. The individual classifiers differ strongly in their basic assumptions: While the first one considers layout and typographic information with respect to certain keywords the second one is a more conventional text categorization approach which merely incorporates textual features. Since this whole categorization tool is embedded into a document analysis system, a highly precise classification is essential for a subsequent goaldirected extraction of structured information aimed at the integration of the document into the current business workflow of a company
From Paper to a Corporate Memory  A First Step
 In KI97 Workshop on KnowledgeBased Systems for Knowledge Management in Enterprises, Document D9703. Deutsches Forschungszentrum fur Kunstliche Intelligenz
, 1997
"... Computerbased corporate memories aim to enable an efficient use of corporate knowledge. ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Computerbased corporate memories aim to enable an efficient use of corporate knowledge.