Results 1 
5 of
5
Fast String Correction with LevenshteinAutomata
 INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION
, 2002
"... The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levensht ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levenshteindistance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshteinautomaton of degree n for W in time linear in the length of W . Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshteinautomaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshteindistance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshteinautomata and leads to even improved eciency. We also describe how to extend both methods to variants of the Levenshteindistance where further primitive edit operations (transpositions, merges and splits) may be used.
Orthographic errors in web pages: Towards cleaner web corpora
 Computational Lingusitics
, 2006
"... Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a nonnegligible number o ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a nonnegligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a byproduct, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web. 1.
Tools for Automatic Lexicon Maintenance: Acquisition, Error Correction, and the Generation of Missing Values
 In: Proc. First International Conference on Language Resources & Evaluation [LREC
, 1998
"... The paper describes the algorithmic methods used in a German monolingual lexicon project dealing with a multimillion entry lexicon. We describe the usability of different information which can be extracted from the lexicon: For German nouns and adjectives, candidates for their inflexion classes are ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
The paper describes the algorithmic methods used in a German monolingual lexicon project dealing with a multimillion entry lexicon. We describe the usability of different information which can be extracted from the lexicon: For German nouns and adjectives, candidates for their inflexion classes are automatically detected. Forms which do not fit in these classes are good error candidates. A ngram model is used to find unusual combinations of letters which also indicate an error or foreign language entries. Regularity is used especially for compounds to get inflection information. In all algorithms, frequency information is used to select terms for correction. Quality information is attached to all entries. Generation and use of this quality information gives an automatic control over both the data and the correctness of the algorithms. The algorithms are designed to be language independent. Language specific data (as inflexion classes and ngrams) should be available or relatively e...
UNITEXPB, a set of flexible language resources for Brazilian Portuguese ∗
"... Abstract. This work documents the project and development of various computational linguistic resources that support the Brazilian Portuguese language according to the formal methodology used by the corpus processing system called UNITEX. The delivered resources include computational lexicons, libra ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Abstract. This work documents the project and development of various computational linguistic resources that support the Brazilian Portuguese language according to the formal methodology used by the corpus processing system called UNITEX. The delivered resources include computational lexicons, libraries to access compressed lexicons, and additional tools to validate those resources. 1.
Efficient DictionaryBased Text Rewriting using
"... Problems in the area of text and document processing can often be described as text rewriting tasks: given an input text, produce a new text by applying some fixed set of rewriting rules. In its simplest form, a rewriting rule is given by a pair of strings, representing a source string (the “origina ..."
Abstract
 Add to MetaCart
Problems in the area of text and document processing can often be described as text rewriting tasks: given an input text, produce a new text by applying some fixed set of rewriting rules. In its simplest form, a rewriting rule is given by a pair of strings, representing a source string (the “original”) and its substitute. By a rewriting dictionary, we mean a finite list of such pairs; dictionarybased text rewriting means to replace in an input text occurrences of originals by their substitutes. We present an efficient method for constructing, given a rewriting dictionary D, a subsequential transducer T that accepts any text t as input and outputs the intended rewriting result under the socalled “leftmostlongest match ” replacement with skips, t ′. The time needed to compute the transducer is linear in the size of the input dictionary. Given the transducer, any text t of length t  is rewritten in a deterministic manner in time O(t  + t ′ ), where t ′ denotes the resulting output text. Hence the resulting rewriting mechanism is very efficient. As a second advantage, using standard tools, the transducer can be directly composed with other transducers to efficiently solve more complex rewriting tasks in a single processing step. 1