Results 1 -
6 of
6
Fast String Correction with Levenshtein-Automata
- INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION
, 2002
"... The Levenshtein-distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein-automata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levensht ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
The Levenshtein-distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein-automata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levenshtein-distance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W . Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshtein-automaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshtein-distance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshtein-automata and leads to even improved eciency. We also describe how to extend both methods to variants of the Levenshtein-distance where further primitive edit operations (transpositions, merges and splits) may be used.
Message Extraction from Printed Documents - A Complete Solution -
- In 4th Int. Conf. on Document Analysis and Recognition (ICDAR 97
, 1997
"... The task to be solved within our core research was the design and development of a document analysis toolbox covering typical document analysis tasks such as document understanding, information extraction and text recognition. In order to prove the feasibility of our concepts, we have developed the ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
The task to be solved within our core research was the design and development of a document analysis toolbox covering typical document analysis tasks such as document understanding, information extraction and text recognition. In order to prove the feasibility of our concepts, we have developed the prototypical analysis system OfficeMAID. The system analyzes documents, as used in the daily work of a purchasing department, by a-priori knowledge about workflows and document features. In this way the system provides goal-directed information extraction, shallow understanding and process identification for given documents (paper, fax, e-mail). This work has been supported by a grant from the BMBF (ITW 9702). 1 Introduction Generally, printed documents are neither generated for scanning and automatic processing nor for easy integration into electronic workflows. Therefore, it is hard to transform them adequately for further processing by electronic means. This is the reason why DMS --- in...
Fast Approximate Search in Large Dictionaries
- COMPUTATIONAL LINGUISTICS
, 2004
"... The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k. In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a "universal Levenshtein automaton," we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixed-distance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.
Advances in Document Classification by Voting of Competitive Approaches
, 1997
"... This paper presents a complex approach for the content-based text categorization of printed German business letters into pre-defined message types such as order, invoice, offer, etc. The categorization results of two competing classifiers are combined by means of a voting component embodying know ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
This paper presents a complex approach for the content-based text categorization of printed German business letters into pre-defined message types such as order, invoice, offer, etc. The categorization results of two competing classifiers are combined by means of a voting component embodying knowledge about the strengths and weaknesses of the classifiers. The individual classifiers differ strongly in their basic assumptions: While the first one considers layout and typographic information with respect to certain keywords the second one is a more conventional text categorization approach which merely incorporates textual features. Since this whole categorization tool is embedded into a document analysis system, a highly precise classification is essential for a subsequent goal-directed extraction of structured information aimed at the integration of the document into the current business workflow of a company
From Paper to a Corporate Memory - A First Step
- In KI--97 Workshop on Knowledge-Based Systems for Knowledge Management in Enterprises, Document D--97--03. Deutsches Forschungszentrum fur Kunstliche Intelligenz
, 1997
"... Computer-based corporate memories aim to enable an efficient use of corporate knowledge. ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Computer-based corporate memories aim to enable an efficient use of corporate knowledge.
A visual and interactive tool for optimizing lexical postcorrection of OCR results
- In Proceedings of the IEEE Workshop on Document Image Analysis and Recognition, DIAR’03
, 2003
"... Systems for postcorrection of OCR-results can be fine tuned and adapted to new recognition tasks in many respects. One issue is the selection and adaption of a suitable background dictionary. Another issue is the choice of a correction model, which includes, among other decisions, the selection of a ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Systems for postcorrection of OCR-results can be fine tuned and adapted to new recognition tasks in many respects. One issue is the selection and adaption of a suitable background dictionary. Another issue is the choice of a correction model, which includes, among other decisions, the selection of an appropriate distance measure for strings and the choice of a scoring function for ranking distinct correction alternatives. When combining the results obtained from distinct OCR engines, further parameters have to be fixed. Due to all these degrees of freedom, adaption and fine tuning of systems for lexical postcorrection is a difficult process. Here we describe a visual and interactive tool that semi-automates the generation of ground truth data, partially automates adjustment of parameters, yields active support for error analysis and thus helps to find correction strategies that lead to high accuracy with realistic effort.

