Results 1 - 10
of
24
The String Edit Distance Matching Problems with Moves
, 2006
"... The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smalles ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smallest edit distance between p and substrings of t. We relax the problem so that (a) we allow an additional operation, namely, substring moves, and (b) we allow approximation of this string edit distance. Our result is a near linear time deterministic algorithm to produce a factor of O(log n log ∗ n) approximation to the string edit distance with moves. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments. Our results are obtained by embedding strings into L1 vector space using a simplified parsing technique we call Edit
Dictionary matching and indexing with errors and don’t cares
- In STOC ’04
, 2004
"... This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which a bounded number of “don’t care ” characters are allowed. The specific problems we look at are: indexing, in which there is a single text t, and we seek locations where p matches a substring of t; dictionary queries, in which a collection of strings is given upfront, and we seek those strings which match p in their entirety; and dictionary matching, in which a collection of strings is given upfront, and we seek those substrings of a (long) p which match an original string in its entirety. These are all instances of an all-to-all matching problem, for which we provide a single solution. The performance bounds all have a similar character. For example, for the indexing problem with n = |t | and m = |p|, the query time for k substitutions is O(m + (c1 log n) k k! # matches), with a data structure of size O(n (c2 log n) k k! and a preprocessing time of O(n (c2 log n) k), where c1, c2> k! 1 are constants. The deterministic preprocessing assumes a weakly nonuniform RAM model; this assumption is not needed if randomization is used in the preprocessing.
Overlap Matching
- Information and Computation
, 2001
"... We propose a new paradigm for string matching, namely structural matching. In structural matching, the text and pattern contents are not important. Rather, some areas in the text and patterns are singled out, say intervals. A "match" is a text location where a specified relation between the text ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
We propose a new paradigm for string matching, namely structural matching. In structural matching, the text and pattern contents are not important. Rather, some areas in the text and patterns are singled out, say intervals. A "match" is a text location where a specified relation between the text and pattern areas is satisfied. In particular we define the structural matching problem of Overlap (Parity) Matching. We seek the text locations where all overlaps of the given pattern and text intervals have even length. We show that this problem can be solved in time O(n log m), where the text length is n and the pattern length is m. As an application of overlap matching, we show how to reduce the String Matching with Swaps problem to the overlap matching problem. The String Matching with Swaps problem is the problem of string matching in the presence of local swaps. The best known deterministic upper bound for this problem was O(nm 1/3 log m log #) for a general alphabet #, wher...
Approximate Subset Matching with Don't Cares
"... The Subset Matching problem was recently introduced by Cole and Hariharan. The input of the problem is a text array of n sets totaling s elements and a pattern array of m sets totaling s0 elements. There is a match of the pattern in a text location if every pattern set is a subset of the correspondi ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
The Subset Matching problem was recently introduced by Cole and Hariharan. The input of the problem is a text array of n sets totaling s elements and a pattern array of m sets totaling s0 elements. There is a match of the pattern in a text location if every pattern set is a subset of the corresponding text set. Subset matching has proven to be a powerful technique and enabled finding an efficient solution to the Tree Matching problem. The subset matching model may prove useful in solving other hard problems, e.g. Swap Matching. In this paper we investigate the complexity of approximate subset matching with "don't care"s. We provide two algorithms for the problem. A randomized algorithm whose complexity is O((s + n + n m s 0)pm log2 m) and a deterministic algorithm whose complexity is O((s + n)ps0 log m).
k-mismatch with don’t cares
- In ESA
, 2007
"... Abstract. We give the first non-trivial algorithms for the k-mismatch pattern matching problem with don’t cares. Given a text t of length n and a pattern p of length m with don’t care symbols and a bound k,our algorithms find all the places that the pattern matches the text with at most k mismatches ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Abstract. We give the first non-trivial algorithms for the k-mismatch pattern matching problem with don’t cares. Given a text t of length n and a pattern p of length m with don’t care symbols and a bound k,our algorithms find all the places that the pattern matches the text with at most k mismatches.WefirstgiveanO(n(k +lognlog log n)logm)time randomised solution which finds the correct answer with high probability. We then present a new deterministic O(nk 2 log 3 m) time solution that uses tools developed for group testing and finally an approach based on k-selectors that runs in O(nk polylog m) time but requires O(poly m) time preprocessing. In each case, the location of the mismatches at each alignment is also given at no extra cost. 1
Pattern matching with address errors: rearrangement distances
- In SODA
, 2006
"... Historically, approximate pattern matching has mainly focused at coping with errors in the data, while the order of the text/pattern was assumed to be more or less correct. In this paper we consider a class of pattern matching problems where the content is assumed to be correct, while the locations ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Historically, approximate pattern matching has mainly focused at coping with errors in the data, while the order of the text/pattern was assumed to be more or less correct. In this paper we consider a class of pattern matching problems where the content is assumed to be correct, while the locations may have shifted/changed. We formally define a broad class of problems of this type, capturing situations in which the pattern is obtained from the text by a sequence of rearrangements. We consider several natural rearrangement schemes, including the analogues of the ℓ1 and ℓ2 distances, as well as two distances based on interchanges. For these, we present efficient algorithms to solve the resulting string matching problems. 1
Optimal spaced seeds for faster approximate string matching
"... Filtering is a standard technique for fast approximate string matching in practice. In filtering, a quick first step is used to rule out almost all positions of a text as possible starting positions for a pattern. Typically this step consists of finding the exact matches of small parts of the patter ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Filtering is a standard technique for fast approximate string matching in practice. In filtering, a quick first step is used to rule out almost all positions of a text as possible starting positions for a pattern. Typically this step consists of finding the exact matches of small parts of the pattern. In the followup step, a slow method is used to verify or eliminate each remaining position. The running time of such a method depends largely on the quality of the ltering step, as measured by its false positives rate. The quality of such a method depends on the number of true matches that it misses, that is, on its false negative rate.
A spaced seed is a recently introduced type of filter pattern that allows gaps (i.e. don't cares) in the small sub-pattern to be searched for. Spaced seeds promise to yield a much lower false positives rate, and thus have been extensively studied, though heretofore only heuristically or statistically.
In this paper, we show how to design almost optimal spaced seeds that yield no false negatives.
Random Access to Grammar-Compressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{|P |k, k 4 + |P |} + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammar-compressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavy-paths in grammars.
On Exact and Approximation Algorithms for Distinguishing Substring Selection
- In Proc. 14th FCT, volume 2751 of LNCS
, 2003
"... The NP-complete Distinguishing Substring Selection problem (DSSS for short) asks, given a set of "good" strings and a set of "bad" strings, for a solution string which is, with respect to Hamming metric, "away" from the good strings and "close" to the bad strings. ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
The NP-complete Distinguishing Substring Selection problem (DSSS for short) asks, given a set of "good" strings and a set of "bad" strings, for a solution string which is, with respect to Hamming metric, "away" from the good strings and "close" to the bad strings.

