Results 1 
5 of
5
Fast String Correction with LevenshteinAutomata
 INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION
, 2002
"... The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levensht ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levenshteindistance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshteinautomaton of degree n for W in time linear in the length of W . Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshteinautomaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshteindistance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshteinautomata and leads to even improved eciency. We also describe how to extend both methods to variants of the Levenshteindistance where further primitive edit operations (transpositions, merges and splits) may be used.
Zur Morphologie und Semantik von Nominalkomposita
"... Compounding suffixes (Fugenmorpheme / Fugenelemente) in German compound nouns have to be encoded as morphological features of the potential first constituents. This paper contains a description of the methods and the results of the encoding of compounding suffixes for a complete electronic dictionar ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
Compounding suffixes (Fugenmorpheme / Fugenelemente) in German compound nouns have to be encoded as morphological features of the potential first constituents. This paper contains a description of the methods and the results of the encoding of compounding suffixes for a complete electronic dictionary of German nouns.
Orthographic errors in web pages: Towards cleaner web corpora
 Computational Lingusitics
, 2006
"... Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a nonnegligible number o ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a nonnegligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a byproduct, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web. 1.
Aiding Web Searches by Statistical Classification Tools
 Informationskompetenz  Basiskompetenz in der Informationsgesellschaft. Proc. 7
, 2000
"... We describe an infrastructure for the collection and management of large amounts of text, and discuss the possibility of information extraction and visualisation from text corpora with statistical methods. The paper gives an overview of processing steps, the contents of our text databases as well as ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
We describe an infrastructure for the collection and management of large amounts of text, and discuss the possibility of information extraction and visualisation from text corpora with statistical methods. The paper gives an overview of processing steps, the contents of our text databases as well as different query facilities. Our focus is on the extraction and visualisation of collocations and their usage for aiding web searches.