## Tries for Approximate String Matching (1996)

### Cached

### Download Links

- [www.cs.mcgill.ca]
- [www.uio.no]
- CiteULike
- DBLP

### Other Repositories/Bibliography

Venue: | IEEE Transactions on Knowledge and Data Engineering |

Citations: | 31 - 1 self |

### BibTeX

@ARTICLE{Shang96triesfor,

author = {H. Shang and T.H. Merrett},

title = {Tries for Approximate String Matching},

journal = {IEEE Transactions on Knowledge and Data Engineering},

year = {1996},

volume = {8},

pages = {540--547}

}

### Years of Citing Articles

### OpenURL

### Abstract

Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers), case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern differs from the document by k substitutions, transpositions, insertions or deletions, have hitherto been carried out only at costs linear in the size of the document. We present a trie-based method whose cost is independent of document size. H. Shang and T.H. Merrett are at the School of Computer Science, McGill University, Montr'eal, Qu'ebec, Canada H3A 2A7, Email: fshang, timg@cs.mcgill.ca 100 Our experiments show that this new method significantly outperforms the nearest competitor for k=0 and k=1, which are arguably the most important cases. The linear cost (in k) of the other methods begins to catch up, for our small files, only at k=2. For larger files, complexity arguments i...

### Citations

1358 |
Binary codes capable of correcting deletions insertions and reversals
- Levenshtein
- 1966
(Show Context)
Citation Context ...stitution), an insertion, or a deletion. It can also be a transposition, as illustrated above. Such operations were formulated by Damerau [8] and the notion of edit distances was given by Levenshtein =-=[15]-=-. A dynamic programming (DP) algorithm was shown by Wagner and Fischer [26] with O(mn) worst case. Ukkonen [24] improved this to O(kn) (and clearly ksm) by finding a cutoff in the DP. Chang and Lawler... |

704 |
The string-to-string correction problem
- Wagner, Fischer
- 1974
(Show Context)
Citation Context ... illustrated above. Such operations were formulated by Damerau [8] and the notion of edit distances was given by Levenshtein [15]. A dynamic programming (DP) algorithm was shown by Wagner and Fischer =-=[26]-=- with O(mn) worst case. Ukkonen [24] improved this to O(kn) (and clearly ksm) by finding a cutoff in the DP. Chang and Lawler [7] have the same worst case, but get sublinear expected time, O((n=m)k lo... |

665 |
V.R.(1977): Fast pattern matching in strings
- Knuth, Morris, et al.
(Show Context)
Citation Context ...find that "edit" matches it up to one substitution. Or a writer may transpose two letters at the keyboard, and the intended word, 101 worst-case run preproc. time extra space ref. naive mn K=-=MP 2n m m [13] BM 2n \Ga-=-mma m+ 1 O(m + j\Sigmaj) [6, 1] Shift-or O(n) O(m+ j\Sigmaj) O(j\Sigmaj) [4] Patricia O(m) O(n log n) O(n) [10] Figure 1: Exact Match Algorithms say "sent", should be detected instead of the... |

612 |
J.S.(1977):A fast string searching algorithm
- Boyer, Moore
(Show Context)
Citation Context ...ubstitution. Or a writer may transpose two letters at the keyboard, and the intended word, 101 worst-case run preproc. time extra space ref. naive mn KMP 2n m m [13] BM 2n \Gamma m+ 1 O(m + j\Sigmaj) =-=[6, 1] Shift-or O(n) O(m+ -=-j\Sigmaj) O(j\Sigmaj) [4] Patricia O(m) O(n log n) O(n) [10] Figure 1: Exact Match Algorithms say "sent", should be detected instead of the error, "snet". Applications occur with s... |

568 |
A space economical suffix tree construction algorithm
- McCreight
- 1976
(Show Context)
Citation Context ...d Lawler [7] have the same worst case, but get sublinear expected time, O((n=m)k log m)) and only O(m) space, as opposed to O(m 2 ) or O(n) for earlier methods. This they do by building a suffix tree =-=[27, 16], which is-=- just a "Patricia" trie (after Morrison [19]), on the pattern as a method of detecting common substrings. Kim and Shawe-Taylor [12] propose an O(m log n) algorithm with O(n) preprocessing. T... |

445 | Linear pattern matching algorithms
- Weiner
- 1973
(Show Context)
Citation Context ...d Lawler [7] have the same worst case, but get sublinear expected time, O((n=m)k log m)) and only O(m) space, as opposed to O(m 2 ) or O(n) for earlier methods. This they do by building a suffix tree =-=[27, 16], which is-=- just a "Patricia" trie (after Morrison [19]), on the pattern as a method of detecting common substrings. Kim and Shawe-Taylor [12] propose an O(m log n) algorithm with O(n) preprocessing. T... |

386 |
Techniques for automatically correcting words in text
- Kukich
- 1992
(Show Context)
Citation Context ... to evaluate O(k j\Sigmaj k ) DP table entries. We present this method in terms of full-text retrieval, for which both the index and the text must be stored. In applications such as spelling checkers =-=[14]-=-, the text is a dictionary, a set of words, and need not be stored separately from the index. These are special cases of what we describe. In such cases, our method offers negative storage overhead, b... |

237 | A new approach to text searching
- Baeza-Yates, Gonnet
- 1992
(Show Context)
Citation Context ...rs at the keyboard, and the intended word, 101 worst-case run preproc. time extra space ref. naive mn KMP 2n m m [13] BM 2n \Gamma m+ 1 O(m + j\Sigmaj) [6, 1] Shift-or O(n) O(m+ j\Sigmaj) O(j\Sigmaj) =-=[4] Patricia O(m) O(n l-=-og n) O(n) [10] Figure 1: Exact Match Algorithms say "sent", should be detected instead of the error, "snet". Applications occur with strings other than text: strings of DNA base p... |

229 | A technique for computer detection and correction of spelling errors - Damerau - 1964 |

162 |
Finding Approximate Patterns in Strings
- Ukkonen
- 1985
(Show Context)
Citation Context ...were formulated by Damerau [8] and the notion of edit distances was given by Levenshtein [15]. A dynamic programming (DP) algorithm was shown by Wagner and Fischer [26] with O(mn) worst case. Ukkonen =-=[24]-=- improved this to O(kn) (and clearly ksm) by finding a cutoff in the DP. Chang and Lawler [7] have the same worst case, but get sublinear expected time, O((n=m)k log m)) and only O(m) space, as oppose... |

141 | Approximate string matching
- Hall, Dowling
- 1980
(Show Context)
Citation Context ...wildcard, can be written as a regular expression, but is also a 3-approximate match --- but they do not coincide.) A recent review of these techniques is in the book by Stephen [23]. Hall and Dowling =-=[11]-=- give an early survey of approximate match techniques. The work is all directed to searches in relatively small texts, i.e., those not too large to fit into RAM. For texts that require secondary stora... |

104 |
String Searching Algorithms
- Stephen
- 1994
(Show Context)
Citation Context ...where # is a one-place wildcard, can be written as a regular expression, but is also a 3-approximate match --- but they do not coincide.) A recent review of these techniques is in the book by Stephen =-=[23]-=-. Hall and Dowling [11] give an early survey of approximate match techniques. The work is all directed to searches in relatively small texts, i.e., those not too large to fit into RAM. For texts that ... |

60 |
Multidimensional tries used for associative searching
- Orenstein
- 1982
(Show Context)
Citation Context ...huge trie on secondary storage. Tries could be represented as trees, with pointers to subtrees, 113 as proposed by Morrison [19], who first came up with the Patricia trie for text searches. Orenstein =-=[21]-=- has a very compact, pointerless representation, which uses two bits per node and which he adapted for secondary storage. Merrett and Shang [18, 22] refined this method and made it workable for Patric... |

55 | Fast and practical approximate string matching
- Baeza-Yates, Perleberg
- 1996
(Show Context)
Citation Context ...substrings. Kim and Shawe-Taylor [12] propose an O(m log n) algorithm with O(n) preprocessing. They generate ngramssfor the text and represent them as a trie for compactness. Baeza-Yates and Perlberg =-=[5]-=- propose a counting algorithm which runs in time independent of k, O(n +R), where R is bounded O(n) and is zero if all characters in Pm are distinct. Figure 2 summarizes this discussion. Agrep [28] is... |

49 |
Approximate string matching in sublinear expected time
- Chang, Lawler
- 1990
(Show Context)
Citation Context ... A dynamic programming (DP) algorithm was shown by Wagner and Fischer [26] with O(mn) worst case. Ukkonen [24] improved this to O(kn) (and clearly ksm) by finding a cutoff in the DP. Chang and Lawler =-=[7]-=- have the same worst case, but get sublinear expected time, O((n=m)k log m)) and only O(m) space, as opposed to O(m 2 ) or O(n) for earlier methods. This they do by building a suffix tree [27, 16], wh... |

40 |
New Indices for Text
- Gonnet, Baeza-Yates, et al.
- 1992
(Show Context)
Citation Context ...ed word, 101 worst-case run preproc. time extra space ref. naive mn KMP 2n m m [13] BM 2n \Gamma m+ 1 O(m + j\Sigmaj) [6, 1] Shift-or O(n) O(m+ j\Sigmaj) O(j\Sigmaj) [4] Patricia O(m) O(n log n) O(n) =-=[10] Figure 1: Exact Mat-=-ch Algorithms say "sent", should be detected instead of the error, "snet". Applications occur with strings other than text: strings of DNA base pairs, strings of musical pitch and ... |

30 |
The myriad virtues of suffix trees
- Apostolico
- 1985
(Show Context)
Citation Context ...ubstitution. Or a writer may transpose two letters at the keyboard, and the intended word, 101 worst-case run preproc. time extra space ref. naive mn KMP 2n m m [13] BM 2n \Gamma m+ 1 O(m + j\Sigmaj) =-=[6, 1] Shift-or O(n) O(m+ -=-j\Sigmaj) O(j\Sigmaj) [4] Patricia O(m) O(n log n) O(n) [10] Figure 1: Exact Match Algorithms say "sent", should be detected instead of the error, "snet". Applications occur with s... |

21 |
Efficient text searching of regular expressions
- Baeza-Yates, Gonnet
- 1989
(Show Context)
Citation Context ...vant subtries are cut off, and this gives the exact string search in time proportional only to the length of the string being sought. The algorithm can also be used to search full regular expressions =-=[3]-=-. We have proposed a trie structure which uses two bits per node and has no pointers. Our trie structure is designed for storing very large sets of word strings on secondary storage. The trie is parti... |

18 |
Practical Algorithm To Retrieve Information Coded In Alphanumeric
- PATRICIA
- 1968
(Show Context)
Citation Context ...xpected time, O((n=m)k log m)) and only O(m) space, as opposed to O(m 2 ) or O(n) for earlier methods. This they do by building a suffix tree [27, 16], which is just a "Patricia" trie (after=-= Morrison [19]-=-), on the pattern as a method of detecting common substrings. Kim and Shawe-Taylor [12] propose an O(m log n) algorithm with O(n) preprocessing. They generate ngramssfor the text and represent them as... |

11 |
An approximate string-matching algorithm
- Kim, Shawe-Taylor
- 1992
(Show Context)
Citation Context ...arlier methods. This they do by building a suffix tree [27, 16], which is just a "Patricia" trie (after Morrison [19]), on the pattern as a method of detecting common substrings. Kim and Sha=-=we-Taylor [12]-=- propose an O(m log n) algorithm with O(n) preprocessing. They generate ngramssfor the text and represent them as a trie for compactness. Baeza-Yates and Perlberg [5] propose a counting algorithm whic... |

11 |
Relational Information Systems
- Merrett
- 1984
(Show Context)
Citation Context ...e., those not too large to fit into RAM. For texts that require secondary storage, O(n) is far too slow, and we need O(log n) or faster methods, as with conventional files containing separate records =-=[17]-=-. The price we must pay is to store an index, which must be built once for the whole text (unless the text changes). If we are interested in the text as an ordered sequence of characters, we must stor... |

9 |
Efficient searching of text and pictures
- Gonnet
- 1988
(Show Context)
Citation Context ...ex only every word, the index is smaller and compression results.[18] If we do only dictionary searches, as in Section 6, there is great compression. 106 simulates the regular expression on the trie, =-=[9]-=- and is also fast O(log m (n) n ff ) where ff!1. This paper proposes a k-approximate match algorithm using DamerauLevenshtein DP on a text represented as a trie. The insight is that the trie represent... |

9 |
Computerized correction of phonographic errors
- Veronis
- 1988
(Show Context)
Citation Context ...errors and for some phonetic errors. For example, exsample to example has one difference, but sinary to scenery has three differences. To deal with phonetic misspellings, we may follow Veronis's work =-=[25]-=- by giving weights to edit operations based on phonetic similarity, or using non-integer distances to obtain finer grained scores on both typographic and phonetic similarities. Another solution is to ... |

7 |
Trie methods for representing text
- Merrett, Shang
- 1993
(Show Context)
Citation Context ... it contains, as in a dictionary for spelling check, then we need only store the index, and we can often achieve compression as well as retrieval speed. Tries have been used to index very large texts =-=[10, 18]-=- and are the only known truly sublinear way to do so. Tries are trees in which nodes are empty but have a potential subtree for each letter of the alphabet, \Sigma, encoding the data (e.g., 0 and 1 fo... |

7 | Trie methods for text and spatial data on secondary storage
- Shang
- 1995
(Show Context)
Citation Context ... The common prefixes of all sistrings are stored only once in the trie. This gives substantial data compression, and is important when indexing very large texts. Trie methods for text can be found in =-=[10, 18, 22]-=-. Here we describe them only briefly. When constructing a trie over a large number of and extremely long sistrings, we have to consider the representation of a huge trie on secondary storage. Tries co... |

4 |
Fast text searching
- Wu, Manber
- 1992
(Show Context)
Citation Context ...erg [5] propose a counting algorithm which runs in time independent of k, O(n +R), where R is bounded O(n) and is zero if all characters in Pm are distinct. Figure 2 summarizes this discussion. Agrep =-=[28]-=- is a package based on related ideas, which also does limited regular expression matching, i.e., Pm is a regular expression. (Regular expression matching and k-approximate string matching solve 104 wo... |

3 |
String Searching Algorithms
- Baeza-Yates
- 1992
(Show Context)
Citation Context ...s a development of the simpler problem of exact match: given a text, W n , of n characters from an alphabet \Sigma, and a string, Pm , of m characters, m ! n, find occurrences of P in W . Baeza-Yates =-=[2]-=- reviews exact match algorithms, and we summarize in Figure 1. 102 Here, all algorithms except the naive approach require some preprocessing. The Knuth-Morris-Pratt (KMP), Boyer-Moore (BM), and Shift-... |