Results 1  10
of
44
A Guided Tour to Approximate String Matching
 ACM COMPUTING SURVEYS
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 598 (36 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems.
The String Edit Distance Matching Problems with Moves
, 2006
"... The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smalles ..."
Abstract

Cited by 69 (3 self)
 Add to MetaCart
The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smallest edit distance between p and substrings of t. We relax the problem so that (a) we allow an additional operation, namely, substring moves, and (b) we allow approximation of this string edit distance. Our result is a near linear time deterministic algorithm to produce a factor of O(log n log ∗ n) approximation to the string edit distance with moves. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments. Our results are obtained by embedding strings into L1 vector space using a simplified parsing technique we call Edit
Dynamic LCA queries on trees
 SIAM Journal on Computing
, 1999
"... Abstract. We show how to maintain a data structure on trees which allows for the following operations, all in worstcase constant time: 1. insertion of leaves and internal nodes, 2. deletion of leaves, 3. deletion of internal nodes with only one child, 4. determining the least common ancestor of any ..."
Abstract

Cited by 57 (0 self)
 Add to MetaCart
Abstract. We show how to maintain a data structure on trees which allows for the following operations, all in worstcase constant time: 1. insertion of leaves and internal nodes, 2. deletion of leaves, 3. deletion of internal nodes with only one child, 4. determining the least common ancestor of any two nodes. We also generalize the Dietz–Sleator “cupfilling ” scheduling methodology, which may be of independent interest.
A sublinear algorithm for weakly approximating edit distance
 In Proc. STOC 2003
, 2003
"... We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two ncharacter strings A and B, runs in time o(n) and with high probability returns “CLOSE ” if their edit distance is O(n α), and “FAR”if their edit ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
(Show Context)
We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two ncharacter strings A and B, runs in time o(n) and with high probability returns “CLOSE ” if their edit distance is O(n α), and “FAR”if their edit distance is Ω(n), where α is a fixed parameter less than 1. Our algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance. To do this, we query both strings at random places using a special technique for economizing on the samples which does not pick the samples independently and provides better query and overall complexity. As a result, our test runs in time Õ n max { α 2,2α−1} for any fixed α < 1. Our algorithm thus provides a tradeoff between accuracy and efficiency that is particularly useful when the input data is very large. We also show a lower bound of Ω(n α/2)onthequerycomplexity of every algorithm that distinguishes pairs of strings with edit distance at most n α from those with edit distance at least n/6.
Approximating edit distance efficiently
 In Proc. FOCS 2004
, 2004
"... Edit distance has been extensively studied for the past several years. Nevertheless, no lineartime algorithm is known to compute the edit distance between two strings, or even to approximate it to within a modest factor. Furthermore, for various natural algorithmic problems such as lowdistortion e ..."
Abstract

Cited by 36 (5 self)
 Add to MetaCart
(Show Context)
Edit distance has been extensively studied for the past several years. Nevertheless, no lineartime algorithm is known to compute the edit distance between two strings, or even to approximate it to within a modest factor. Furthermore, for various natural algorithmic problems such as lowdistortion embeddings into normed spaces, approximate nearestneighbor schemes, and sketching algorithms, known results for the edit distance are rather weak. We develop algorithms that solve gap versions of the edit distance problem: given two strings of length n with the promise that their edit distance is either at most k or greater than ℓ, decide which of the two holds. We present two sketching algorithms for gap versions of edit distance. Our first algorithm solves the k vs. (kn) 2/3 gap problem, using a constant size sketch. A more involved algorithm solves the stronger k vs. ℓ gap problem, where ℓ can be as small as O(k 2)—still with a constant sketch—but works only for strings that are mildly “nonrepetitive”. Finally, we develop an n 3/7approximation quasilinear time algorithm for edit distance, improving the previous best factor of n 3/4 [5]; if the input strings are assumed to be nonrepetitive, then the approximation factor can be strengthened to n 1/3. 1.
Finding approximate repetitions under Hamming distance
 THEORETICAL COMPUTER SCIENCE
, 2001
"... The problem of computing tandem repetitions with K possible mismatches is studied. Two main definitions are considered, and for both of them an O(nK log K + S) algorithm is proposed (S the size of the output). This improves, in particular, the bound obtained in [LS93]. Finally, other possible defini ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
The problem of computing tandem repetitions with K possible mismatches is studied. Two main definitions are considered, and for both of them an O(nK log K + S) algorithm is proposed (S the size of the output). This improves, in particular, the bound obtained in [LS93]. Finally, other possible definions are briefly analyzed.
Random Access to GrammarCompressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
(Show Context)
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammarcompressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{P k, k 4 + P } + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammarcompressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavypaths in grammars.
Approximate Text Searching
, 1998
"... This thesis focuses on the problem of text retrieval allowing errors, also called "approximate" string matching. The problem is to nd a pattern in a text, where the pattern and the text may have "errors". This problem has received a lot of attention in recent years because of its ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
(Show Context)
This thesis focuses on the problem of text retrieval allowing errors, also called "approximate" string matching. The problem is to nd a pattern in a text, where the pattern and the text may have "errors". This problem has received a lot of attention in recent years because of its applications in many areas, such as information retrieval, computational biology and signal processing, to name a few. The aim of this work is the development and analysis of novel algorithms to deal with the problem under various conditions, as well as a better understanding of the problem itself and its statistical behavior. Although our results are valid in many dierent areas, we focus our attention on typical text searching for information retrieval applications. This makes some ranges of values for the parameters of the problem more interesting than others. We have divided this presentation in two parts. The rst one deals with online approximate string matching, i.e. when there is no time or space to preprocess the text. These algorithms are the core of oline algorithms as well. Online searching is the area of the problem where better algorithms existed. We have obtained new bounds for the probability of an approximate match of a pattern in
Faster approximate pattern matching in compressed repetitive texts
 IN PROCEEDINGS OF THE 22ND INTERNATIONAL SYMPOSIUM ON ALGORITHMS AND COMPUTATION (ISAAC
, 2011
"... Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, gi ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straightline program with r rules for a string s of length n, we can build an O(r)word data structure that allows us to extract any substring s[i..j] in O(log n + j − i) time. They also showed how, given a pattern p of length m and an edit distance k ≤ m, their data structure supports finding all occ approximate matches to p in s in O(r(min(mk, k4 + m) + log n) + occ) time. Rytter (2003) and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse of s, and gave algorithms for building straightline programs with O(z log n) rules. In this paper we give a simple O(z log n)word data structure that takes the same time for substring extraction but only O(z(min(mk, k4 + m)) + occ) time for approximate pattern matching.
Semilocal string comparison: Algorithmic techniques and applications
 Mathematics in Computer Science 1(4) (2008) 571–603 See also arXiv: 0707.3619
"... The longest common subsequence (LCS) problem is a classical problem in computer science. The semilocal LCS problem is a generalisation of the LCS problem, arising naturally in the context of string comparison. In this work, we present a number of algorithmic techniques related to the semilocal LCS ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
The longest common subsequence (LCS) problem is a classical problem in computer science. The semilocal LCS problem is a generalisation of the LCS problem, arising naturally in the context of string comparison. In this work, we present a number of algorithmic techniques related to the semilocal LCS problem, and give a number of algorithmic applications of these techniques. Summarising the presented results, we conclude that semilocal string comparison turns out to be a useful algorithmic plugin, which unifies, and often improves on, a number of previous approaches to various substring and subsequencerelated problems. Contents