Results 1  10
of
14
Applications of Approximate Word Matching in Information Retrieval
, 1997
"... As more online databases are integrated into digital libraries, the issue of quality control of the data becomes increasingly important, especially as it relates to the effective retrieval of information. The need to discover and reconcile variant forms of strings in bibliographic entries, i.e., aut ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
As more online databases are integrated into digital libraries, the issue of quality control of the data becomes increasingly important, especially as it relates to the effective retrieval of information. The need to discover and reconcile variant forms of strings in bibliographic entries, i.e., authority work, will become more critical in the future. Spelling variants, misspellings, and transllteration differences will all increase the difficulty of retrieving information. Approximate string matching has traditionally been used to help with this problem. In this paper we introduce the notion of approximate word matching and show how it can be used to improve detection and categorization of variant forms.
Identification of Confusable Drug Names: A New Approach and Evaluation Methodology
 In Proceedings of COLING 2004
, 2004
"... This paper addresses the mitigation of medical errors due to the confusion of soundalike and lookalike drug names. Our approach involves application of two new methods one based on orthographic similarity ("lookalike ") and the other based on phonetic similarity ("soundalike"). We presen ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
This paper addresses the mitigation of medical errors due to the confusion of soundalike and lookalike drug names. Our approach involves application of two new methods one based on orthographic similarity ("lookalike ") and the other based on phonetic similarity ("soundalike"). We present a new recallbased evaluation methodology for determining the effectiveness of different similarity measures on drug names. We show that the new orthographic measure (BISIM) outperforms other commonly used measures of similarity on a set containing both lookalike and soundalike pairs, and that the featurebased phonetic approach (ALINE) outperforms orthographic approaches on a test set containing solely soundalike confusion pairs. However, an approach that combines several different measures achieves the best results on both test sets.
Pattern Recognition of Strings With Substitutions, Insertions, Deletions and Generalized Transpositions
 Pattern Recognition
"... We study the problem of recognizing a string Y which is the noisy version of some unknown string X * chosen from a finite dictionary, H. The traditional case which has been extensively studied in the literature is the one in which Y contains substitution, insertion and deletion (SID) errors. Altho ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We study the problem of recognizing a string Y which is the noisy version of some unknown string X * chosen from a finite dictionary, H. The traditional case which has been extensively studied in the literature is the one in which Y contains substitution, insertion and deletion (SID) errors. Although some work has been done to extend the traditional set of edit operations to include the straightforward transposition of adjacent characters 2 [14] the problem is unsolved when the transposed characters are themselves subsequently substituted, as is typical in cursive and typewritten script, in molecular biology and in noisy chaincoded boundaries. In this paper we present the first reported solution to the analytic problem of editing one string X to another, Y using these four edit operations. A scheme for obtaining the optimal edit operations has also been given. Both these solutions are optimal for the infinite alphabet case. Using these algorithms we present a syntactic pattern rec...
The Normalized String Editing Problem Revisited
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 1996
"... Marzal and Vidal [8] recently considered the problem of computing the normalized edit distance between two strings, and reported experimental results which demonstrated the use of the measure to recognize handwritten characters. Their paper formulated the theoretical properties of the measure and de ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Marzal and Vidal [8] recently considered the problem of computing the normalized edit distance between two strings, and reported experimental results which demonstrated the use of the measure to recognize handwritten characters. Their paper formulated the theoretical properties of the measure and developed two algorithms to compute it. In this short communication we shall demonstrate how this measure is related to an auxiliary measure already defined in the literature  the interstring constrained edit distance [10,11,15]. Since the normalized edit distance can be computed efficiently using the latter, the analytic and experimental results reported in [8] can be obtained just as accurately, but more efficiently, using the strategies presented here. I. PROBLEM STATEMENT In the comparison of text patterns, phonemes and biological macromolecules a question that has attracted much interest is that of quantifying the dissimilarity between strings. A review of such distance measures and ...
Automatic Identification of Cognates, False Friends, and Partial Cognates
, 2006
"... Cognates are words in different languages that have similar spelling and meaning. They can help secondlanguage learners with vocabulary expansion and reading comprehension tasks. Special attention needs to be paid to pairs of words that appear similar but are in fact false friends: they have differ ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Cognates are words in different languages that have similar spelling and meaning. They can help secondlanguage learners with vocabulary expansion and reading comprehension tasks. Special attention needs to be paid to pairs of words that appear similar but are in fact false friends: they have different meanings in all contexts. Partial cognates are pairs of words in two languages that have the same meaning in some, but not all, contexts. Detecting the actual meaning of a partial cognate in context can be useful for Machine Translation and ComputerAssisted Language Learning tools. Our research on cognate and falsefriend words between two pair of languages (French and English in our case) consists in automatically classifying a pair of words from two languages as cognates or false friends. We use Machine Learning techniques with several measures of orthographic similarity as features for classification. We study the impact of selecting different features, averaging them, and combining them through Machine Learning techniques. The methods work on different pair of languages as long as a small amount of annotated pairs of words is provided as training data. In addition to the work done on cognate and falsefriend identification we propose a
Symbolic Channel Modelling For Noisy Channels Which Permit Arbitrary Noise Distributions
 Proc. of the 1993 Int. Symp. on Comp. and Inform. Sci
, 1993
"... In this paper we present a new model for noisy channels which permit arbitrarily distributed substitution, deletion and insertion errors. Apart from its straightforward applications in string generation and recognition, the model also has potential applications in speech and unidimensional signal pr ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
In this paper we present a new model for noisy channels which permit arbitrarily distributed substitution, deletion and insertion errors. Apart from its straightforward applications in string generation and recognition, the model also has potential applications in speech and unidimensional signal processing. The model is specified in terms of a noisy string generation technique. Let A be any finite alphabet and A* be the set of words over A. Given any arbitrary string U A*, we specify a stochastically consistent scheme by which this word can be transformed into any Y A*. This is achieved by specifying the process by which U is transformed by performing substitution, deletion and insertion operations. The scheme is shown to be Functionally Complete and stochastically consistent. The probability distributions for these respective operations can be completely arbitrary. Apart from presenting the channel in which all the possible strings in A* can be potentially generated, we also specify ...
A Formal Theory for Optimal and Information Theoretic Syntactic Pattern Recognition
"... In this paper we present a foundational basis for optimal and information theoretic syntactic pattern recognition. We do this by developing a rigorous model, M * , for channels which permit arbitrarily distributed substitution, deletion and insertion syntactic errors. More explicitly, if A is any ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
In this paper we present a foundational basis for optimal and information theoretic syntactic pattern recognition. We do this by developing a rigorous model, M * , for channels which permit arbitrarily distributed substitution, deletion and insertion syntactic errors. More explicitly, if A is any finite alphabet and A * the set of words over A, we specify a stochastically consistent scheme by which a string U A * can be transformed into any Y A * by means of arbitrarily distributed substitution, deletion and insertion operations. The scheme is shown to be Functionally Complete and stochastically consistent. Apart from the synthesis aspects, we also deal with the analysis of such a model and derive a technique by which Pr[YU], the probability of receiving Y given that U was transmitted, can be computed in cubic time using dynamic programming. One of the salient features of this scheme is that it demonstrates how dynamic programming can be applied to evaluate quantities involv...
String Taxonomy Using Learning Automata
 IEEE Transactions on Systems, Man and Cybernetics
, 1997
"... A typical syntactic pattern recognition (PR) problem involves comparing a noisy string with every element of a dictionary, H. The problem of classification can be greatly simplified if the dictionary is partitioned into a set of subdictionaries. In this case, the classification can be hierarchical ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
A typical syntactic pattern recognition (PR) problem involves comparing a noisy string with every element of a dictionary, H. The problem of classification can be greatly simplified if the dictionary is partitioned into a set of subdictionaries. In this case, the classification can be hierarchical  the noisy string is first compared to a representative element of each subdictionary and the closest match within the subdictionary is subsequently located. Indeed, the entire problem of subdividing a set of strings into subsets where each subset contains "similar" strings has been referred to as the "String Taxonomy Problem". To our knowledge there is no reported solution to this problem (see footnote on Page 2). In this paper we shall present a learningautomaton based solution to string taxonomy. The solution utilizes the Object Migrating Automaton (OMA) whose power in clustering objects and images [33,35] has been reported. The power of the scheme for string taxonomy has been demons...
An Evidential Approach to Query Interface Matching on the Deep Web
"... Matching query interfaces is a critical step in data integration across multiple Web databases. The problem is closely related to schema matching that typically exploits different features of schemas. Relying on a particular feature of schemas is not sufficient. We propose an evidential approach to ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Matching query interfaces is a critical step in data integration across multiple Web databases. The problem is closely related to schema matching that typically exploits different features of schemas. Relying on a particular feature of schemas is not sufficient. We propose an evidential approach to combining multiple matchers using DempsterShafer theory of evidence. First, our approach views the match results of an individual matcher as a source of evidence that provides a level of confidence on the validity of each candidate attribute correspondence. Second, it combines multiple sources of evidence to calculate the overall level of confidence, reflecting the match results of different matchers. Third, it selects the top k attribute correspondences of each source attribute from the target schema. Finally it uses some heuristics to resolve any conflicts between the attribute correspondences of different source attributes. Our experimental results show that our approach is highly accurate and effective. 1.
Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions
 In ICSC
, 1994
"... . We consider a problem which can greatly enhance the areas of cursive script recognition and the recognition of printed character sequences. This problem involves recognizing words/strings by processing their noisy subsequences. Let X * be any unknown word from a finite dictionary H. Let U be a ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
. We consider a problem which can greatly enhance the areas of cursive script recognition and the recognition of printed character sequences. This problem involves recognizing words/strings by processing their noisy subsequences. Let X * be any unknown word from a finite dictionary H. Let U be any arbitrary subsequence of X * . We study the problem of estimating X * by processing Y, a noisy version of U. Y contains substitution, insertion, deletion and generalized transposition errors  the latter occurring when transposed characters are themselves subsequently substituted. We solve the noisy subsequence recognition problem by defining and using the constrained edit distance between X H and Y subject to any arbitrary edit constraint involving the number and type of edit operations to be performed. An algorithm to compute this constrained edit distance has been presented. Using these algorithms we present a syntactic Pattern Recognition (PR) scheme which corrects noisy tex...