Results 1 -
8 of
8
Fast String Correction with Levenshtein-Automata
- INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION
, 2002
"... The Levenshtein-distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein-automata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levensht ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
The Levenshtein-distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein-automata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levenshtein-distance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W . Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshtein-automaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshtein-distance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshtein-automata and leads to even improved eciency. We also describe how to extend both methods to variants of the Levenshtein-distance where further primitive edit operations (transpositions, merges and splits) may be used.
Fast Approximate Search in Large Dictionaries
- COMPUTATIONAL LINGUISTICS
, 2004
"... The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k. In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a "universal Levenshtein automaton," we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixed-distance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.
Multilingual String-to-String Correction in Grif, a structured editor
, 1992
"... : This paper describes the integration of a spelling corrector into the structured editor Grif. This corrector is based on the Levenshtein metric concept which is particularly efficient for string correction. This method can be implemented efficiently and can produce good results with short response ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
: This paper describes the integration of a spelling corrector into the structured editor Grif. This corrector is based on the Levenshtein metric concept which is particularly efficient for string correction. This method can be implemented efficiently and can produce good results with short response time on a new RISC workstation even with large dictionaries. The integration within Grif enables checking of textual content of structured documents where large vocabularies are required. Thanks to an attribute language the editor can automatically adapt the correction to the language and can apply a specific word recognition algorithmand dictionaries, thus allowing checking and correcting of multilingual documents. Keywords: spelling correction, integration, multilingualism, structured documents. Introduction Grif is an interactive system for the production of complex documents. It is essentially intended for handling structured documents [Furuta88] [Andr'e89] [Quint90]. Thus it is well...
Approximate personal name-matching through finite-state graphs
- Journal of the American Society for Information Science and Technology
, 2006
"... This article shows how finite-state methods can be employed in a new and different task: the conflation of personal name variants in standard forms. In bibliographic databases and citation index systems, variant forms create problems of inaccuracy that affect information retrieval, the quality of in ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
This article shows how finite-state methods can be employed in a new and different task: the conflation of personal name variants in standard forms. In bibliographic databases and citation index systems, variant forms create problems of inaccuracy that affect information retrieval, the quality of information from databases, and the citation statistics used for the evaluation of scientists’ work. A number of approximate string matching techniques have been developed to validate variant forms, based on similarity and equivalence relations. We classify the personal name variants as nonvalid and valid forms. In establishing an equivalence relation between valid variants and the standard form of its equivalence class, we defend the application of finite-state transducers. The process of variant identification requires the elaboration of: (a) binary matrices and (b) finite-state graphs. This procedure was tested on samples of author names from bibliographic records, selected from the Library and Information Science Abstracts and Science Citation Index Expanded databases. The evaluation involved calculating the measures of precision and recall, based on completeness and accuracy. The results demonstrate the usefulness of this approach, although it should be complemented with methods based on similarity relations for the recognition of spelling variants and misspellings.
Fast retrieval of electronic messages that contain mistyped words or spelling errors
- IEEE hnsactions on System, Man and Cybernetics
, 1996
"... Abstract—This paper presents an index structure for retrieving electronic messages that contain mistyped words or spelling errors. Given a query string (e.g., a search key), we want to find those messages that approximately contain the query, i.e., certain inserts, deletes and mismatches are allowed ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—This paper presents an index structure for retrieving electronic messages that contain mistyped words or spelling errors. Given a query string (e.g., a search key), we want to find those messages that approximately contain the query, i.e., certain inserts, deletes and mismatches are allowed when matching the query with a word (or phrase) in the messages. Our approach is to store the messages sequentially in a database and hash their “fingerprints ” into a number of “fingerprint files. ” When the query is given, its fingerprints are also hashed into the files and a histogram of votes is constructed on the messages. We derive a lower bound, based on which one can prune a large number of nonqualifying messages (i.e., those whose votes are below the lower bound) during searching. The paper presents some experimental results, which demonstrate the effectiveness of the index structure and the lower bound. I.
Aspelling Checker
"... One of my concerns when conducting the assessment was that the carer was a single female parent and the child who was to be fostered was a teenage male. My concern was that the carer would be able, firstly to be able to protect the child and fulfill the needs of the young person ..."
Abstract
- Add to MetaCart
One of my concerns when conducting the assessment was that the carer was a single female parent and the child who was to be fostered was a teenage male. My concern was that the carer would be able, firstly to be able to protect the child and fulfill the needs of the young person
Approximate Text Searching
, 1998
"... This thesis focuses on the problem of text retrieval allowing errors, also called "approximate" string matching. The problem is to find a pattern in a text, where the pattern and the text may have "errors". This problem has received a lot of attention in recent years because of its applications in m ..."
Abstract
- Add to MetaCart
This thesis focuses on the problem of text retrieval allowing errors, also called "approximate" string matching. The problem is to find a pattern in a text, where the pattern and the text may have "errors". This problem has received a lot of attention in recent years because of its applications in many areas, such as information retrieval, computational biology and signal processing, to name a few. The aim of this work is the development and analysis of novel algorithms to deal with the problem under various conditions, as well as a better understanding of the problem itself and its statistical behavior. Although our results are valid in many different areas, we focus our attention on typical text searching for information retrieval applications. This makes some ranges of values for the parameters of the problem more interesting than others. We have divided this presentation in two parts. The first one deals with on-line approximate string matching, i.e. when there is no time or space ...
© 1988 by Kluwer Academic Publishers'. Computerized Correction of Phonographic Errors
"... Abstract: When computers are confronted with text (C.A.I., ..."

