Results 1 - 10
of
63
A Guided Tour to Approximate String Matching
- ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract
-
Cited by 306 (38 self)
- Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming
- Journal of the ACM
, 1999
"... The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms co ..."
Abstract
-
Cited by 121 (1 self)
- Add to MetaCart
The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k-difference automaton for the query, and asymptotically run in O(nmk=w) time where w is the word size of the machine (e.g. 32 or 64 in practice). Here we present an algorithm of comparable simplicity that requires only O(nm=w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus the algorithm's performance is independent of k, and it is found to be more efficient than the previous results for many useful choices of k and small m. Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the ...
Faster Approximate String Matching
- Algorithmica
, 1999
"... We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a non-deterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, ..."
Abstract
-
Cited by 69 (23 self)
- Add to MetaCart
We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a non-deterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e. whenever mk = O(log n)), where m is the pattern length and k ! m the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m = O(log n). Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk=w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps and others, at essentially the same search cost. We then explore other novel techniques t...
An Efficient Index Structure for String Databases
- In VLDB
, 2001
"... We consider the problem of substring searching in large databases. Typical applications of this problem are genetic data, web data, and event sequences. Since the size of such databases grows exponentially, it becomes impractical to use inmemory algorithms for these problems. In this paper, we ..."
Abstract
-
Cited by 59 (8 self)
- Add to MetaCart
We consider the problem of substring searching in large databases. Typical applications of this problem are genetic data, web data, and event sequences. Since the size of such databases grows exponentially, it becomes impractical to use inmemory algorithms for these problems. In this paper, we propose to map the substrings of the data into an integer space with the help of wavelet coefficients. Later, we index these coefficients using MBRs (Minimum Bounding Rectangles). We define a distance function which is a lower bound to the actual edit distance between strings. We experiment with both nearest neighbor queries and range queries. The results show that our technique prunes significant amount of the database (typically 50-95%), thus reducing both the disk I/O cost and the CPU cost significantly. 1
Proximity Matching Using Fixed-Queries Trees
"... . We present a new data structure, called the fixed-queries tree, for the problem of finding all elements of a fixed set that are close, under some distance function, to a query element. Fixedqueries trees can be used for any distance function, not necessarily even a metric, as long as it satisfies ..."
Abstract
-
Cited by 58 (5 self)
- Add to MetaCart
. We present a new data structure, called the fixed-queries tree, for the problem of finding all elements of a fixed set that are close, under some distance function, to a query element. Fixedqueries trees can be used for any distance function, not necessarily even a metric, as long as it satisfies the triangle inequality. We give an analysis of several performance parameters of fixed-queries trees and experimental results that support the analysis. Fixed-queries trees are particularly efficient for applications in which comparing two elements is expensive. 1 Introduction Search structures such as hashing and trees are at the basis of many efficient computer science applications. But they usually support only exact queries. Finding things approximately, that is, allowing some errors in the query specifications, is much harder. The first question that a prominent biologist once asked one of the authors when finding that he is a computer scientist is whether it is possible to adapt bina...
Approximate string matching over suffix trees
- PROCEEDINGS OF THE 4TH ANNUAL SYMPOSIUM ON COMBINATORIAL PATTERN MATCHING, NUMBER 684 IN LECTURE NOTES IN COMPUTER SCIENCE
, 1993
"... The classical approximate string-matching problem of finding the locations of approximate occurrences P 0 of pattern string P in text string T such that the edit distance between P and P 0 is k is considered. We concentrate on the special case in which T is available for preprocessing before the se ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
The classical approximate string-matching problem of finding the locations of approximate occurrences P 0 of pattern string P in text string T such that the edit distance between P and P 0 is k is considered. We concentrate on the special case in which T is available for preprocessing before the searches with varying P and k. It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree. Three variations of the search algorithm are developed with running times O(mq + n), O(mq log q + size of the output), and O(m
Indexing Methods for Approximate String Matching
- IEEE Data Engineering Bulletin
, 2000
"... Indexing for approximate text searching is a novel problem receiving much attention because of its applications in signal processing, computational biology and text retrieval, to name a few. We classify most indexing methods in a taxonomy that helps understand their essential features. We show that ..."
Abstract
-
Cited by 46 (9 self)
- Add to MetaCart
Indexing for approximate text searching is a novel problem receiving much attention because of its applications in signal processing, computational biology and text retrieval, to name a few. We classify most indexing methods in a taxonomy that helps understand their essential features. We show that the existing methods, rather than completely different as they are regarded, form a range of solutions whose optimum is usually somewhere in between.
A Hybrid Indexing Method for Approximate String Matching
"... We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is Ç Ò � ÐÓ � Ò,forsome�� that depends on the error fraction t ..."
Abstract
-
Cited by 46 (10 self)
- Add to MetaCart
We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is Ç Ò � ÐÓ � Ò,forsome�� that depends on the error fraction tolerated « and the alphabet size �. Itisshownthat �� for approximately « � � � Ô �,where � � � � ����. Thespace required is four times the text size, which is quite moderate for this problem. We experimentally show that this index can outperform by far all the existing alternatives for indexed approximate searching. These are also the first experiments that compare the different existing schemes.
Suffix Cactus: A Cross between Suffix Tree and Suffix Array
, 1995
"... The suffix cactus is a new alternative to the suffix tree and the suffix array as an index of large static texts. Its size and its performance in searches lies between those of the suffix tree and the suffix array. Structurally, the suffix cactus can be seen either as a compact variation of the suff ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
The suffix cactus is a new alternative to the suffix tree and the suffix array as an index of large static texts. Its size and its performance in searches lies between those of the suffix tree and the suffix array. Structurally, the suffix cactus can be seen either as a compact variation of the suffix tree or as an augmented suffix array.
Indexing Text with Approximate q-grams
, 2000
"... . We present a new index for approximate string matching. The index collects text q-samples, i.e. disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected ..."
Abstract
-
Cited by 36 (11 self)
- Add to MetaCart
. We present a new index for approximate string matching. The index collects text q-samples, i.e. disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected in the presence of some text q-samples that match approximately inside the pattern. We show experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filtration is still efficient. 1

