Results 1  10
of
11
Finding Approximate Matches in Large Lexicons
 SOFTWARE  PRACTICE AND EXPERIENCE
, 1995
"... Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including ngrams and p ..."
Abstract

Cited by 34 (5 self)
 Add to MetaCart
Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including ngrams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching. KEY WORDS: pattern matching; string indexing; approximate matching; compressed inverted files; Soundex
Tries for Approximate String Matching
 IEEE Transactions on Knowledge and Data Engineering
, 1996
"... Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers), case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern d ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
(Show Context)
Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers), case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern differs from the document by k substitutions, transpositions, insertions or deletions, have hitherto been carried out only at costs linear in the size of the document. We present a triebased method whose cost is independent of document size. H. Shang and T.H. Merrett are at the School of Computer Science, McGill University, Montr'eal, Qu'ebec, Canada H3A 2A7, Email: fshang, timg@cs.mcgill.ca 100 Our experiments show that this new method significantly outperforms the nearest competitor for k=0 and k=1, which are arguably the most important cases. The linear cost (in k) of the other methods begins to catch up, for our small files, only at k=2. For larger files, complexity arguments i...
Fast String Correction with LevenshteinAutomata
 INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION
, 2002
"... The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levensht ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
The Levenshteindistance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshteinautomata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levenshteindistance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshteinautomaton of degree n for W in time linear in the length of W . Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshteinautomaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshteindistance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshteinautomata and leads to even improved eciency. We also describe how to extend both methods to variants of the Levenshteindistance where further primitive edit operations (transpositions, merges and splits) may be used.
Fast Approximate Search in Large Dictionaries
 COMPUTATIONAL LINGUISTICS
, 2004
"... The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k. In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a "universal Levenshtein automaton," we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixeddistance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.
Indexing Methods for Approximate Text Retrieval (Extended Abstract)
 IEEE Data Eng Bull
, 1997
"... While the problem of online approximate string matching is well studied, only recently the first offline indexing techniques have emerged. We study the different indexing mechanisms for this problem, proposing a taxonomy to classify them. We also propose and analyze two new techniques which are ad ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
While the problem of online approximate string matching is well studied, only recently the first offline indexing techniques have emerged. We study the different indexing mechanisms for this problem, proposing a taxonomy to classify them. We also propose and analyze two new techniques which are adaptations of recent online algorithms. For the final version we plan to experimentally compare all the algorithms in terms of index construction time, space overhead, query efficiency and tolerance to errors, determining the best compromises for each case. 1 Introduction Approximate string matching is a recurrent problem in many branches of computer science, with applications to text searching, computational biology, pattern recognition, signal processing, etc. The problem can be stated as follows: given a long text of length n, and a (comparatively short) pattern of length m, retrieve all the segments ...
9 1995 SpringerVerlag New York Inc. Multiple Filtration and Approximate Pattern Matching
"... Abstract. Given a text of length n and a query of length q, we present an algorithm for finding all locations of mtuples in the text and in the query that differ by at most k mismatches. This problem is motivated by the dotmatrix constructions for sequence comparison and optimal oligonucleotide pr ..."
Abstract
 Add to MetaCart
Abstract. Given a text of length n and a query of length q, we present an algorithm for finding all locations of mtuples in the text and in the query that differ by at most k mismatches. This problem is motivated by the dotmatrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the case q = m the problem coincides with the classical approximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a twostage process. The first stage (multiple filtration) uses a new technique to preselect roughly similar mtuples. The second stage compares these mtuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching. Key Words. String matching, Computational molecular biology. 1. Introduction. Suppose we are given a string of length n, T[1..n], called the text, a shorter string of length q, Q[1... q], called the query, and integers k and m. The substring matching problem with kmismatches [CL] is to find all &quot;starting&quot;
Multiple Filtration and Approximate Pattern Matching’
, 1995
"... P. A. Pe~zner’. ~ and M. S. water ma^^'.^ Abstract. Given a text of length n and a query of length q, we present an algorithm for finding all locations of mtuples in the text and in the query that differ by at most k mismatches. This problem is motivated by the dotmatrix constructions for seq ..."
Abstract
 Add to MetaCart
P. A. Pe~zner’. ~ and M. S. water ma^^'.^ Abstract. Given a text of length n and a query of length q, we present an algorithm for finding all locations of mtuples in the text and in the query that differ by at most k mismatches. This problem is motivated by the dotmatrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the case q = m the problem coincides with the classical approximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a twostage process. The first stage (multiple filtration) uses a new technique to preselect roughly similar mtuples. The second stage compares these mtuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching. Key Wok String matching, Computational molecular biology. 1. Introduction. Suppose we are given a string of length n, T[1*. n], called the text, a shorter string of length q, Q[1*q], called the query, and integers k and m. The substring matching problem with kmismatches [CL] is to find all “starting”
AND
"... Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including grams and p ..."
Abstract
 Add to MetaCart
Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including grams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching. KEY WORDS Pattern matching string indexing approximate matching compressed inverted files Soundex
Bulgarian Academy of Sciences
"... Abstract. This paper presents a brief description of the semantic relations included in the Bulgarian wordnet. A complete and decidable formal logic for the wordnet structure is also proposed. This logic provides sufficient expressive power for all important verifications, queries, and consistency a ..."
Abstract
 Add to MetaCart
Abstract. This paper presents a brief description of the semantic relations included in the Bulgarian wordnet. A complete and decidable formal logic for the wordnet structure is also proposed. This logic provides sufficient expressive power for all important verifications, queries, and consistency and completeness proofs required for wordnet applications. Some parameters concerning Bulgarian synsets and languageinternal relations, as well as the distinctive features characterizing the completeness and consistency of the Bulgarian wordnet are