Results 1  10
of
23
A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 401 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
Approximate string matching over suffix trees
 PROCEEDINGS OF THE 4TH ANNUAL SYMPOSIUM ON COMBINATORIAL PATTERN MATCHING, NUMBER 684 IN LECTURE NOTES IN COMPUTER SCIENCE
, 1993
"... The classical approximate stringmatching problem of finding the locations of approximate occurrences P 0 of pattern string P in text string T such that the edit distance between P and P 0 is k is considered. We concentrate on the special case in which T is available for preprocessing before the se ..."
Abstract

Cited by 55 (1 self)
 Add to MetaCart
The classical approximate stringmatching problem of finding the locations of approximate occurrences P 0 of pattern string P in text string T such that the edit distance between P and P 0 is k is considered. We concentrate on the special case in which T is available for preprocessing before the searches with varying P and k. It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree. Three variations of the search algorithm are developed with running times O(mq + n), O(mq log q + size of the output), and O(m
Indexing Methods for Approximate String Matching
 IEEE Data Engineering Bulletin
, 2000
"... Indexing for approximate text searching is a novel problem receiving much attention because of its applications in signal processing, computational biology and text retrieval, to name a few. We classify most indexing methods in a taxonomy that helps understand their essential features. We show that ..."
Abstract

Cited by 54 (10 self)
 Add to MetaCart
Indexing for approximate text searching is a novel problem receiving much attention because of its applications in signal processing, computational biology and text retrieval, to name a few. We classify most indexing methods in a taxonomy that helps understand their essential features. We show that the existing methods, rather than completely different as they are regarded, form a range of solutions whose optimum is usually somewhere in between.
A Hybrid Indexing Method for Approximate String Matching
"... We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is Ç Ò � ÐÓ � Ò,forsome�� that depends on the error fraction t ..."
Abstract

Cited by 54 (10 self)
 Add to MetaCart
We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is Ç Ò � ÐÓ � Ò,forsome�� that depends on the error fraction tolerated « and the alphabet size �. Itisshownthat �� for approximately « � � � Ô �,where � � � � ����. Thespace required is four times the text size, which is quite moderate for this problem. We experimentally show that this index can outperform by far all the existing alternatives for indexed approximate searching. These are also the first experiments that compare the different existing schemes.
Indexing Text with Approximate qgrams
, 2000
"... . We present a new index for approximate string matching. The index collects text qsamples, i.e. disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected ..."
Abstract

Cited by 38 (11 self)
 Add to MetaCart
. We present a new index for approximate string matching. The index collects text qsamples, i.e. disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected in the presence of some text qsamples that match approximately inside the pattern. We show experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filtration is still efficient. 1
A Practical qGram Index for Text Retrieval Allowing Errors
 CLEI Electronic Journal
, 1998
"... We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substri ..."
Abstract

Cited by 33 (9 self)
 Add to MetaCart
We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experi...
A Metric Index for Approximate String Matching
 In LATIN
, 2002
"... We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approxima ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the R occurrences of a pattern of length m in a text of length n in average time O(m log n+m +R), using O(n log n) space and O(n log n) index construction time. This complexity improves by far over all other previous methods. We also show a simpler scheme needing O(n) space.
A New Indexing Method for Approximate String Matching
 Combinatorial Pattern Matching, 10th Annual Symposium
, 1999
"... . We present a new indexing method for the approximate string matching problem. The method is based on a suffix tree combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the retrieval time is O(n ), for 0 ! ! 1, whenever ff ! 1 \Gamma e= p oe, where ff ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
. We present a new indexing method for the approximate string matching problem. The method is based on a suffix tree combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the retrieval time is O(n ), for 0 ! ! 1, whenever ff ! 1 \Gamma e= p oe, where ff is the error level tolerated and oe is the alphabet size. We experimentally show that this index outperforms by far all other algorithms for indexed approximate searching, also being the first experiments that compare the different existing schemes. We finally show how this index can be implemented using much less space. 1 Introduction Approximate string matching is a recurrent problem in many branches of computer science, with applications to text searching, computational biology, pattern recognition, signal processing, etc. The problem is: given a long text of length n, and a (comparatively short) pattern of length m, retrieve all the text segments (or "occurrences") whose edit distance t...
A Practical Index for Text Retrieval Allowing Errors
 In CLEI
, 1997
"... We propose a text indexing technique for approximate pattern matching, which is practical and especially aimed at Information Retrieval (IR). Unlike other indices of this kind, it is able to retrieve any string that approximately matches a given search pattern. Every sequence of a fixed length appea ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
We propose a text indexing technique for approximate pattern matching, which is practical and especially aimed at Information Retrieval (IR). Unlike other indices of this kind, it is able to retrieve any string that approximately matches a given search pattern. Every sequence of a fixed length appearing in the text is stored in the index, together with pointers to all the positions where it appears. The search pattern is cut into pieces so that at least one must match exactly. All the pieces are searched in the index and the union of candidate positions is verified. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This also allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the build time, space requirements an...
Evaluation Measures of Multiple Sequence Alignments
 J Comput Biol
, 1996
"... Multiple sequence alignments (MSAs) are frequently used in the study of families of protein sequences or DNA/RNA sequences. They are a fundamental tool for the understanding of structure, functionality and ultimately evolution of proteins. A new algorithm, the Circular Sum (CS) method, is present ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
Multiple sequence alignments (MSAs) are frequently used in the study of families of protein sequences or DNA/RNA sequences. They are a fundamental tool for the understanding of structure, functionality and ultimately evolution of proteins. A new algorithm, the Circular Sum (CS) method, is presented for formally evaluating the quality of an MSA. It is based on the use of a solution to the Traveling Salesman Problem, which identifies a circular tour through an evolutionary tree connecting the sequences in a protein family. With this approach the calculation of an evolutionary tree and the errors that it would introduce can be avoided altogether. The algorithm gives an upper bound, the best score that can possibly be achieved by any MSA for a given set of protein sequences. Alternatively, if presented with a specific MSA, the algorithm provides a formal score for the MSA, which serves as an absolute measure of the quality of the MSA. The CS measure yields a direct connection be...