Results 1  10
of
27
A Guided Tour to Approximate String Matching
 ACM COMPUTING SURVEYS
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 598 (36 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems.
Better Filtering with Gapped qGrams
, 2001
"... A popular and wellstudied class of filters for approximate string matching compares substrings of length q, the qgrams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped qgrams instead of contiguous substrings is men ..."
Abstract

Cited by 90 (2 self)
 Add to MetaCart
A popular and wellstudied class of filters for approximate string matching compares substrings of length q, the qgrams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped qgrams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped qgrams. We show that gapped qgrams can provide orders of magnitude faster and/or more efficient filtering than contiguous qgrams. To achieve these results the arrangement of the gaps in the qgram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.
Faster Approximate String Matching
 Algorithmica
, 1999
"... We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, ..."
Abstract

Cited by 79 (23 self)
 Add to MetaCart
We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e. whenever mk = O(log n)), where m is the pattern length and k ! m the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m = O(log n). Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk=w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps and others, at essentially the same search cost. We then explore other novel techniques t...
Indexing Methods for Approximate String Matching
 IEEE Data Engineering Bulletin
, 2000
"... Indexing for approximate text searching is a novel problem receiving much attention because of its applications in signal processing, computational biology and text retrieval, to name a few. We classify most indexing methods in a taxonomy that helps understand their essential features. We show that ..."
Abstract

Cited by 66 (10 self)
 Add to MetaCart
(Show Context)
Indexing for approximate text searching is a novel problem receiving much attention because of its applications in signal processing, computational biology and text retrieval, to name a few. We classify most indexing methods in a taxonomy that helps understand their essential features. We show that the existing methods, rather than completely different as they are regarded, form a range of solutions whose optimum is usually somewhere in between.
A Faster Algorithm for Approximate String Matching
 Algorithmica
, 1996
"... . We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length O(log n), being n the maxi ..."
Abstract

Cited by 59 (27 self)
 Add to MetaCart
(Show Context)
. We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length O(log n), being n the maximum size of the text. The running time achieved is O(n) for small patterns (i.e. of length m = O( p log n)), independently of the maximum number of errors allowed, k. This algorithm is then used to design two general algorithms. One of them partitions the problem into subproblems, while the other partitions the automaton into subautomata. These algorithms are combined to obtain a hybrid algorithm which on average is O(n) for moderate k=m ratios, O( p mk= log n n) for medium ratios, and O((m \Gamma k)kn= log n) for large ratios. We show experimentally that this hybrid algorithm is faster than previous ones for moderate size of patterns and error ratios, which is the case in text search...
Multiple Approximate String Matching
 In Proc. of WADS'97, LNCS 1272
, 1997
"... We present two new algorithms for online multiple approximate string matching. These are extensions of previous algorithms that search for a single pattern. The singlepattern version of the first one is based on the simulation with bits of a nondeterministic finite automaton built from the patter ..."
Abstract

Cited by 25 (13 self)
 Add to MetaCart
(Show Context)
We present two new algorithms for online multiple approximate string matching. These are extensions of previous algorithms that search for a single pattern. The singlepattern version of the first one is based on the simulation with bits of a nondeterministic finite automaton built from the pattern and using the text as input. To search for multiple patterns, we superimpose their automata, using the result as a filter. The second algorithm partitions the pattern in subpatterns that are searched with no errors, with a fast exact multipattern search algorithm. To handle multiple patterns, we search the subpatterns of all of them together. The average running time achieved is in both cases O(n) for moderate error level, pattern length and number of patterns. They adapt (with higher costs) to the other cases. However, the algorithms differ in speed and thresholds of usefulness. We analyze theoretically when each algorithm should be used, and show experimentally that they are faster ...
Approximate Text Searching
, 1998
"... This thesis focuses on the problem of text retrieval allowing errors, also called "approximate" string matching. The problem is to nd a pattern in a text, where the pattern and the text may have "errors". This problem has received a lot of attention in recent years because of its ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
(Show Context)
This thesis focuses on the problem of text retrieval allowing errors, also called "approximate" string matching. The problem is to nd a pattern in a text, where the pattern and the text may have "errors". This problem has received a lot of attention in recent years because of its applications in many areas, such as information retrieval, computational biology and signal processing, to name a few. The aim of this work is the development and analysis of novel algorithms to deal with the problem under various conditions, as well as a better understanding of the problem itself and its statistical behavior. Although our results are valid in many dierent areas, we focus our attention on typical text searching for information retrieval applications. This makes some ranges of values for the parameters of the problem more interesting than others. We have divided this presentation in two parts. The rst one deals with online approximate string matching, i.e. when there is no time or space to preprocess the text. These algorithms are the core of oline algorithms as well. Online searching is the area of the problem where better algorithms existed. We have obtained new bounds for the probability of an approximate match of a pattern in
Multiple Approximate String Matching by Counting
 In Proc. WSP'97
, 1997
"... . We present a very simple and efficient algorithm for online multiple approximate string matching. It uses a previously known countingbased filter [9] that searches for a single pattern by quickly discarding uninteresting parts of the text. Our multipattern algorithm is based on the simulation of ..."
Abstract

Cited by 24 (10 self)
 Add to MetaCart
(Show Context)
. We present a very simple and efficient algorithm for online multiple approximate string matching. It uses a previously known countingbased filter [9] that searches for a single pattern by quickly discarding uninteresting parts of the text. Our multipattern algorithm is based on the simulation of many parallel filters using bits of the computer word. Our average complexity to search r patterns of length m is O(rn log m= log n), being n is the text size. We can search patterns of different length, each one with a different number of errors. We show experimentally that our algorithm is competitive with the fastest known algorithms, being the fastest for a wide range of intermediate error ratios. We give the first averagecase analysis of the filtering efficiency of the counting method, applicable also to [9]. 1 Introduction A number of important problems related to string processing lead to algorithms for approximate string matching: text searching, pattern recognition, computationa...
LempelZiv Index for qGrams
, 1998
"... . We present a new sublinearsize index structure for finding all occurrences of a given qgram in a text. Such a qgram index is needed in many approximate pattern matching algorithms. All earlier qgram indexes require at least O(n) space, where n is the length of the text. The new LempelZiv in ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
(Show Context)
. We present a new sublinearsize index structure for finding all occurrences of a given qgram in a text. Such a qgram index is needed in many approximate pattern matching algorithms. All earlier qgram indexes require at least O(n) space, where n is the length of the text. The new LempelZiv index needs only O(n/log n) space while being as fast as previous methods. The new method takes advantage of repetitions in the text found by LempelZiv parsing. Key Words. qGram index, Approximate pattern matching, Text indexing, LempelZiv parsing, String algorithms, Data compression. 1. Introduction. The approximate pattern matching problem is as follows. Given a text T = T [1, n] and a pattern P = P[1, m] in an alphabet # and an integer k, find all the text positions i such that an approximate occurrence of P with at most k differences ends at i . The difference between two strings # and # is measured as the edit distance d: d(#, #) is the minimum number of edit operations (in...
Improving an Algorithm for Approximate Pattern Matching
 Algorithmica
, 1998
"... We study a recent algorithm for fast online approximate string matching. This is the problem of searching a pattern in a text allowing errors in the pattern or in the text. The algorithm is based on a very fast kernel which is able to search short patterns using a nondeterministic finite automat ..."
Abstract

Cited by 19 (8 self)
 Add to MetaCart
(Show Context)
We study a recent algorithm for fast online approximate string matching. This is the problem of searching a pattern in a text allowing errors in the pattern or in the text. The algorithm is based on a very fast kernel which is able to search short patterns using a nondeterministic finite automaton, which is simulated using bitparallelism. A number of techniques to extend this kernel for longer patterns are presented in that work. However, the techniques can be integrated in many ways and the optimal interplay among them is by no means obvious. The solution to this problem starts at a very low level, by obtaining basic probabilistic information about the problem which was not previously known, and ends integrating analytical results with empirical data to obtain the optimal heuristic. The conclusions obtained via analysis are experimentally confirmed. We also improve many of the techniques and obtain a combined heuristic which is faster than the original work. This work sho...