## Pattern Recognition of Strings with Substitutions, Insertions, Deletions and Generalized Transpositions (1995)

### Cached

### Download Links

- [www.scs.carleton.ca]
- [people.scs.carleton.ca]
- [people.scs.carleton.ca:8008]
- [www.scs.carleton.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | Pattern Recognition |

Citations: | 12 - 2 self |

### BibTeX

@ARTICLE{Oommen95patternrecognition,

author = {B. J. Oommen and R. K. S. Loke},

title = {Pattern Recognition of Strings with Substitutions, Insertions, Deletions and Generalized Transpositions},

journal = {Pattern Recognition},

year = {1995},

volume = {30},

pages = {30--5}

}

### OpenURL

### Abstract

We study the problem of recognizing a string Y which is the noisy version of some unknown string X * chosen from a finite dictionary, H. The traditional case which has been extensively studied in the literature is the one in which Y contains substitution, insertion and deletion (SID) errors. Although some work has been done to extend the traditional set of edit operations to include the straightforward transposition of adjacent characters 2 [LW75] the problem is unsolved when the transposed characters are themselves subsequently substituted, as is typical in cursive and typewritten script, in molecular biology and in noisy chain-coded boundaries. In this paper we present the first reported solution to the analytic problem of editing one string X to another, Y using these four edit operations. A scheme for obtaining the optimal edit operations has also been given. Both these solutions are optimal for the infinite alphabet case. Using these algorithms we present a syntactic pattern r...

### Citations

1201 |
Binary codes capable of correcting deletions, insertions, and reversals
- Levenshtein
- 1966
(Show Context)
Citation Context ...sing the SID edit operations. The first major breakthrough in comparing strings using these three elementary edit transformations was the concept of the Levenshtein metric introduced in coding theory =-=[13]-=-, and its computation. The Levenshtein distance between two strings is defined as the minimum number of edit operations required to transform one string into another. Okuda et. al. [19] extended this ... |

270 |
A linear space algorithm for computing maximal common subsequences
- Hirschberg
- 1975
(Show Context)
Citation Context ...k and Paterson [15] improved the algorithm for the finite alphabet case. Related to these algorithms are the ones used to compute the Longest Common Subsequence (LCS) of two strings due to Hirschberg =-=[3,4]-=-, Hunt and Szymanski [5] and others [21]. The complexity of the LCS problem has been studied by Aho et. al. [1]. String correction using the GLD as a criterion has been done for strings [2,20,21], sub... |

186 |
On approximate string matching
- Ukkonen
- 1983
(Show Context)
Citation Context ... that we have not required the distances associated with the transposition operations to obey any "generalized triangular inequality", make our proof more interesting and different from the =-=proofs of [14,22,23]-=-. Rather, just as in [17] the concept is reminiscent of a Pat. Recog. with Subst., Insert., Delet. and Gen. Transpos. Errors. Page 8scontrol system in which various outputs are computed in terms of th... |

176 | Algorithms for the longest common subsequence problem
- Hirschberg
- 1977
(Show Context)
Citation Context ...k and Paterson [15] improved the algorithm for the finite alphabet case. Related to these algorithms are the ones used to compute the Longest Common Subsequence (LCS) of two strings due to Hirschberg =-=[3,4]-=-, Hunt and Szymanski [5] and others [21]. The complexity of the LCS problem has been studied by Aho et. al. [1]. String correction using the GLD as a criterion has been done for strings [2,20,21], sub... |

167 |
A fast algorithm for computing longest common subsequences
- Hunt, Szymanski
- 1977
(Show Context)
Citation Context ...ed the algorithm for the finite alphabet case. Related to these algorithms are the ones used to compute the Longest Common Subsequence (LCS) of two strings due to Hirschberg [3,4], Hunt and Szymanski =-=[5]-=- and others [21]. The complexity of the LCS problem has been studied by Aho et. al. [1]. String correction using the GLD as a criterion has been done for strings [2,20,21], substrings [9], dictionarie... |

167 |
A faster algorithm computing string edit distances
- Masek, Paterson
- 1980
(Show Context)
Citation Context ...e computation of various string similarity and dissimilarity measures, the underlying graph that has to be traversed is commonly called a trellis. This trellis is 2-dimensional in the case of the GLD =-=[2, 8, 15, 19, 21, 23]-=-, the Length of the LCS [3,4,21] and the Length of the Shortest Common Supersequence [8]. Indeed, the same trellis can be traversed using various set operators to yield the Set of the LCSs and the Set... |

131 |
Approximate string matching
- Hall, Dowling
- 1980
(Show Context)
Citation Context ...ules where "noisy" version of proteins strings are compared with their exact forms typically stored in large protein databases. The literature on spelling correction is extensive; indeed, th=-=e reviews [2,20]-=- list hundreds of publications that tackle the problem from various perspectives. Damareau [See references in 2,20] was probably the first researcher to observe that most of the errors that were found... |

119 |
The string to string correction problem
- Wagner, Fisher
- 1974
(Show Context)
Citation Context ...stitution, insertion and deletion (SID) errors. Although some work has been done to extend the traditional set of edit operations to include the straightforward transposition of adjacent characters 2 =-=[14]-=- the problem is unsolved when the transposed characters are themselves subsequently substituted, as is typical in cursive and typewritten script, in molecular biology and in noisy chain-coded boundari... |

65 |
Computation of normalized edit distance and applications
- Marzal, Vidal
- 1993
(Show Context)
Citation Context ...ment retrieval using noisy keywords, word processing, designing effective spelling checkers and in image processing where the boundary of the image to be recognized is syntactically coded as a string =-=[11]. Inexact -=-sequence comparison is also used extensively in the comparison of biological macro-molecules where "noisy" version of proteins strings are compared with their exact forms typically stored in... |

65 |
Computer programs for detecting and correcting spelling errors
- Peterson
- 1980
(Show Context)
Citation Context ...ules where "noisy" version of proteins strings are compared with their exact forms typically stored in large protein databases. The literature on spelling correction is extensive; indeed, th=-=e reviews [2,20]-=- list hundreds of publications that tackle the problem from various perspectives. Damareau [See references in 2,20] was probably the first researcher to observe that most of the errors that were found... |

64 | Bounds on the complexity of the longest common subsequence problem
- Aho, Hirschberg, et al.
- 1976
(Show Context)
Citation Context ... used to compute the Longest Common Subsequence (LCS) of two strings due to Hirschberg [3,4], Hunt and Szymanski [5] and others [21]. The complexity of the LCS problem has been studied by Aho et. al. =-=[1]-=-. String correction using the GLD as a criterion has been done for strings [2,20,21], substrings [9], dictionaries treated as generalized tries [7] and for grammars [21]. Besides these deterministic t... |

40 |
A method for the correction of garbled words based on the Levenstein metric
- pkuda, Tanaka, et al.
- 1976
(Show Context)
Citation Context ... coding theory [13], and its computation. The Levenshtein distance between two strings is defined as the minimum number of edit operations required to transform one string into another. Okuda et. al. =-=[19]-=- extended this concept by weighting the edit operations, and many other researchers among whom are Wagner and Fisher [23] generalized it by using edit distances which are symbol dependent. The latter ... |

18 | An effective algorithm for string correction using generalized edit distances
- Kashyap, Oommen
- 1981
(Show Context)
Citation Context ...y of the LCS problem has been studied by Aho et. al. [1]. String correction using the GLD as a criterion has been done for strings [2,20,21], substrings [9], dictionaries treated as generalized tries =-=[7]-=- and for grammars [21]. Besides these deterministic techniques, various probabilistic methods have been studied in the literature [2,20,10]. Pat. Recog. with Subst., Insert., Delet. and Gen. Transpos.... |

18 | Approximate string matching, Comput. Surveys - V, Dowling - 1980 |

12 |
Recognition of noisy subsequences using constrained edit distances
- Oommen
- 1987
(Show Context)
Citation Context ...rward transposition. Secondly, it permits the distances to be fairly arbitrary -- they can be chosen to reflect the confusion matrix of the garbling mechanism -- as is done in typical PR applications =-=[7,17,21]-=-. Last of all, (but if not the most important), is the relative simplicity of the present scheme -- it is but a straightforward generalization of the Wagner-Fischer algorithm. 3. A note about the modu... |

10 |
The noisy substring matching problem
- Kashyap, Oommen
- 1983
(Show Context)
Citation Context ...nd Szymanski [5] and others [21]. The complexity of the LCS problem has been studied by Aho et. al. [1]. String correction using the GLD as a criterion has been done for strings [2,20,21], substrings =-=[9]-=-, dictionaries treated as generalized tries [7] and for grammars [21]. Besides these deterministic techniques, various probabilistic methods have been studied in the literature [2,20,10]. Pat. Recog. ... |

8 |
Constrained string editing
- Oommen
- 1986
(Show Context)
Citation Context ...ors to yield the Set of the LCSs and the Set of the Shortest Common Supersequences [8]. The trellis becomes 3-dimensional when one has to compute string probabilities [10], constrained edit distances =-=[16]-=- and correct noisy subsequences [17]. Although the trellis itself is 2dimensional in the former examples, because the graphs are cycle-free they can be represented and traversed by merely maintaining ... |

8 |
Time Warps,String Edits and Macromolecules: The Theory and practice of Sequence Comparison, Addison-Wesley
- Sankoff, Kruskal
- 1983
(Show Context)
Citation Context ...t. The latter distance is termed as the Generalized Levenshtein Distance (GLD). One of the advantages of the GLD is that it can be made a metric if the individual edit distances obey some constraints =-=[19,21]-=-. Wagner and Fischer [23] also proposed an efficient algorithm for computing this distance using dynamic programming. This algorithm has been proved to be the optimal algorithm for the infinite alphab... |

7 |
A common basis for similarity and dissimilarity measures involving two strings
- Kashyap, Oommen
- 1983
(Show Context)
Citation Context ... (X',Y') Γ X,Y . Also note that the same set of edit operations can be represented by multiple elements in Γ X,Y . This duplication serves as a powerful tool in the proofs of various analytic result=-=s [7,8,10,11,18]-=-. Since the Edit Distance between X and Y is the minimum of the sum of the edit distances associated with operations required to change X to Y, this distance, D(X,Y), has the expression : D(X,Y) = min... |

6 |
String correction using probabilistic methods, Pattern Recognition Letters
- Kashyap, Oommen
- 1984
(Show Context)
Citation Context ...,20,21], substrings [9], dictionaries treated as generalized tries [7] and for grammars [21]. Besides these deterministic techniques, various probabilistic methods have been studied in the literature =-=[2,20,10]-=-. Pat. Recog. with Subst., Insert., Delet. and Gen. Transpos. Errors. Page 2sAlthough some work has been done to extend the traditional set of SID operations to include the transposition of adjacent c... |

6 |
String Alignment With Substitution, Insertion, Deletion, Squashing, and Expansion Operations
- Oommen
- 1995
(Show Context)
Citation Context ... (X',Y') Γ X,Y . Also note that the same set of edit operations can be represented by multiple elements in Γ X,Y . This duplication serves as a powerful tool in the proofs of various analytic result=-=s [7,8,10,11,18]-=-. Since the Edit Distance between X and Y is the minimum of the sum of the edit distances associated with operations required to change X to Y, this distance, D(X,Y), has the expression : D(X,Y) = min... |

5 | Symbolic Channel Modelling for Noisy Channels which Permit Arbitrary Noise Distributions
- Oommen, Kashyap
- 1993
(Show Context)
Citation Context ...raditional transposition errors were assumed as in the case of the Lowrance and Wagner [14] algorithm. The dictionary consisted of 342 words obtained as a subset of the 1023 most common English words =-=[7,9,12]-=- augmented with words used in computer literature. The length of all the words in the dictionary was greater than or equal to 7 and the average length of a word was approximately 8.3 characters. From ... |

3 |
A lower bound for the edit distance problem under an arbitrary cost function
- Huang
- 1988
(Show Context)
Citation Context ...,Y) ♦ Z(N, M) END ALGORITHM Distance_SID_GT Remarks 1. The computational complexity of string comparison algorithms is conveniently given by the number of symbol comparisons required by the algorith=-=m [1,6,21]-=-. In this case, the number of symbol comparisons is quadratic. In the body of the main loop, we will need at most four additions and at most four minimizations. The lower bound result claimed in [6] n... |