## Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts ∗ (2007)

### Cached

### Download Links

Citations: | 2 - 0 self |

### BibTeX

@MISC{Bille07improvedapproximate,

author = {Philip Bille and Rolf Fagerberg and Inge Li Gørtz},

title = {Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts ∗},

year = {2007}

}

### OpenURL

### Abstract

We study the approximate string matching and regular expression matching problem for the case when the text to be searched is compressed with the Ziv-Lempel adaptive dictionary compression schemes. We present a time-space trade-off that leads to algorithms improving the previously known complexities for both problems. In particular, we significantly improve the space bounds, which in practical applications are likely to be a bottleneck. 1

### Citations

1221 | A universal algorithm for sequential data compression
- Ziv, Lempel
- 1977
(Show Context)
Citation Context ...ZL78 and ZLW become ”equivalent”. This is the reason why Theorem 1 holds only for ZL78 when space is o(n) but for both when the space is Ω(n). Another well-known variant is theZL77 compression scheme =-=[23]-=-. UnlikeZL78 andZLW phrases in theZL77 scheme can be any substring of text that has already been processed. This makes searching much more difficult and none of the known techniques for ZL78 and ZLW s... |

776 | Compression of individual sequences via variable rate coding
- Lempel, Ziv
- 1978
(Show Context)
Citation Context ...tring matching and regular expression matching problems on compressed texts. As in previous work on these problems [10,17] we focus on the popular ZL78 and ZLW adaptive dictionary compression schemes =-=[22, 24]-=-. We present a new technique that gives a general time-space trade-off. The resulting algorithms improve all previously known complexities for both problems. In particular, we significantly improve th... |

458 |
A technique for high performance data compression
- Welch
- 1984
(Show Context)
Citation Context ...algorithms have been proposed depending on the type of pattern and compression method, see e.g., [2,9–11,13,17]. For instance, given a string Q of length u compressed with the Ziv-Lempel-Welch scheme =-=[22]-=- into a string of length n, Amir et al. [3] gave an algorithm for finding all exact occurrences of a pattern string of length m in O(n + m 2 ) time and space. In this paper we study the classical appr... |

447 | A Guided Tour of Approximate String Matching
- Navarro
- 2001
(Show Context)
Citation Context ... to Sellers [20] solves the problem in O(um) time and O(m) space, where u and m are the length of Q and P, respectively. Several improvements of this result are known, see e.g., the survey by Navarro =-=[16]-=-. For this paper we are particularly interested in the fast solution for small values of k, namely, the O(uk) time algorithm by Landau and Vishkin [12] and the more recent O(uk 4 /m + u) time algorith... |

301 |
Compilers: principles, techniques, and tools
- AHO, SETHI, et al.
- 1986
(Show Context)
Citation Context ... compression scheme. 4 Regular Expression Matching 4.1 Regular Expressions and Finite Automata First we briefly review the classical concepts used in the paper. For more details see, e.g., Aho et al. =-=[1]-=-. The set of regular expressions over Σ are defined recursively as follows: A character α ∈ Σ is a regular expression, and if S and T are regular expressions then so is the concatenation, (S) · (T ), ... |

253 |
On the theory and computation of evolutionary distances
- Sellers
- 1974
(Show Context)
Citation Context ...dit distance between two strings is the minimum number of insertions, deletions, and substitutions needed to convert one string to the other. The classical dynamic programming solution due to Sellers =-=[20]-=- solves the problem in O(um) time and O(m) space, where u and m are the length of Q and P, respectively. Several improvements of this result are known, see e.g., the survey by Navarro [16]. For this p... |

133 | Dynamic perfect hashing: upper and lower bounds
- Dietzfelbinger, Karlin, et al.
- 1994
(Show Context)
Citation Context ...ement in C is at most 2τ. Proof. Let 1 ≤ τ ≤ n be a given parameter. We build C incrementally in a left-to-right scan of Z. The set is maintained as a dynamic dictionary using dynamic perfect hashing =-=[8]-=-, i.e., constant time worst-case access and constant time amortized expected update. Initially, we set C = {z0}. Suppose that we have read z0, . . .,zi. To process zi+1 we follow the path p of referen... |

114 |
Fast parallel and serial approximate string matching
- Landau, Vishkin
- 1989
(Show Context)
Citation Context ...esult are known, see e.g., the survey by Navarro [16]. For this paper we are particularly interested in the fast solution for small values of k, namely, the O(uk) time algorithm by Landau and Vishkin =-=[12]-=- and the more recent O(uk 4 /m + u) time algorithm due to Cole and Hariharan [7] (we assume w.l.o.g. that k < m). Both of these can be implemented in O(m) space. Recently, Kärkkäinen et al. [10] studi... |

101 |
Regular Expressions and State Graphs for Automata,” in
- McNaughton, Yamada
- 1960
(Show Context)
Citation Context ...1, j] such that A accepts B[i, j]. The set of all matches is denoted ∆(A, B). Given a regular expression R, an NFA A accepting precisely the strings in L(R) can be obtained by several classic methods =-=[10, 16, 23]-=-. In particular, Thompson [23] gave a simple well-known construction which we will refer to as a Thompson NFA (TNFA). A TNFA A for R has at most 2m states, at most 4m transitions, and can be computed ... |

98 | Let sleeping files lie: Pattern matching in z-compressed files
- Amir, Benson, et al.
- 1996
(Show Context)
Citation Context ...he type of pattern and compression method, see e.g., [2,9–11,13,17]. For instance, given a string Q of length u compressed with the Ziv-Lempel-Welch scheme [22] into a string of length n, Amir et al. =-=[3]-=- gave an algorithm for finding all exact occurrences of a pattern string of length m in O(n + m 2 ) time and space. In this paper we study the classical approximate string matching and regular express... |

90 |
String matching in lempel-ziv compressed strings,” in 27th annual ACM symposium on the theory of computing
- Farach, Thorup
- 1995
(Show Context)
Citation Context ...arching much more difficult and none of the known techniques for ZL78 and ZLW seems to be applicable. The only known algorithm for pattern matching on ZL77 compressed text is due to Farach and Thorup =-=[9]-=- who gave an algorithm for the exact string matching problem.s3 Approximate String Matching In this section we consider the compressed approximate string matching problem. Before presenting our algori... |

81 | Optimal two-dimensional compressed matching
- Amir, Benson, et al.
- 1994
(Show Context)
Citation Context ...ext databases, e.g., for biological and World Wide Web data, are huge. To save time and space the data must be kept in compressed form and allow efficient searching. Motivated by this Amir and Benson =-=[1,2]-=- initiated the study of compressed pattern matching problems, that is, given a text string Q in compressed form Z and a specified (uncompressed) pattern P, find all occurrences of P in Q without decom... |

76 |
The abstract theory of automata
- Glushkov
- 1961
(Show Context)
Citation Context ...1, j] such that A accepts B[i, j]. The set of all matches is denoted ∆(A, B). Given a regular expression R, an NFA A accepting precisely the strings in L(R) can be obtained by several classic methods =-=[10, 16, 23]-=-. In particular, Thompson [23] gave a simple well-known construction which we will refer to as a Thompson NFA (TNFA). A TNFA A for R has at most 2m states, at most 4m transitions, and can be computed ... |

46 | A general practical approach to pattern matching over ziv-lempel compressed text
- Navarro, Raffinot
- 1999
(Show Context)
Citation Context ...r of occurrences of the pattern. Currently, this is the only non-trivial worst-case bound for the problem. For special cases and restricted versions of the problem other algorithms have been proposed =-=[14,19]-=-. An experimental study of the problem and an optimized practical implementation can be found in [18]. In this paper, we show that the problem is closely connected to the uncompressed problem and we a... |

41 |
A four Russians algorithm for regular expression pattern matching. J Assoc Comput Mach 1992;39(4):430–48. Needleman SB, Wunsch CD. A general method applicable to the search simila ;48(3 s MA nic, Acad Warl eural . Prin 5. s JM betwe ;91(3
- Myers
(Show Context)
Citation Context ...on [21] solves the problem in O(um) time and O(m) space, where u and m are the length of Q and R, respectively. Improvements based on the Four Russian Technique or word-level parallelism are given in =-=[4,6,15]-=-. The only solution to the compressed problem is due to Navarro [17]. His solution depends on word RAM techniques to encode small sets into memory words, thereby allowing constant time set operations.... |

39 |
Two-dimensional periodicity and its application
- Amir, Benson
- 1992
(Show Context)
Citation Context ...ext databases, e.g., for biological and World Wide Web data, are huge. To save time and space the data must be kept in compressed form and allow efficient searching. Motivated by this Amir and Benson =-=[1,2]-=- initiated the study of compressed pattern matching problems, that is, given a text string Q in compressed form Z and a specified (uncompressed) pattern P, find all occurrences of P in Q without decom... |

35 |
Programming Techniques: Regular expression search algorithm
- Thompson
- 1968
(Show Context)
Citation Context ...regular expression matching problem is to find all ending position of substrings in Q that matches a string in the language denoted by R. The classic textbook solution to this problem due to Thompson =-=[21]-=- solves the problem in O(um) time and O(m) space, where u and m are the length of Q and R, respectively. Improvements based on the Four Russian Technique or word-level parallelism are given in [4,6,15... |

32 | Approximate string matching: A simpler faster algorithm
- Cole, Hariharan
- 1998
(Show Context)
Citation Context ...icularly interested in the fast solution for small values of k, namely, the O(uk) time algorithm by Landau and Vishkin [12] and the more recent O(uk 4 /m + u) time algorithm due to Cole and Hariharan =-=[7]-=- (we assume w.l.o.g. that k < m). Both of these can be implemented in O(m) space. Recently, Kärkkäinen et al. [10] studied the problem for text compressed with the ZL78/ZLW compression schemes. If n i... |

25 | Faster approximate string matching over compressed text
- Navarro, Kida, et al.
- 2001
(Show Context)
Citation Context ...m. For special cases and restricted versions of the problem other algorithms have been proposed [14,19]. An experimental study of the problem and an optimized practical implementation can be found in =-=[18]-=-. In this paper, we show that the problem is closely connected to the uncompressed problem and we achieve a simple time-space trade-off. More precisely, let t(m, u, k) and s(m, u, k) denote the time a... |

23 | Multiple pattern matching in lzw compressed text
- KIDA, TAKEDA, et al.
- 1998
(Show Context)
Citation Context ... approach of decompressing Z into Q and then searching for P in Q. Various compressed pattern matching algorithms have been proposed depending on the type of pattern and compression method, see e.g., =-=[3, 9, 11, 12, 14, 19]-=-. For instance, given a string Q of length u compressed with the Ziv-Lempel-Welch scheme [24] into a string of length n, Amir et al. [4] gave an algorithm for finding all exact occurrences of a patter... |

20 | Approximate matching of run-length compressed strings
- Makinen, Navarro, et al.
- 2001
(Show Context)
Citation Context ... approach of decompressing Z into Q and then searching for P in Q. Various compressed pattern matching algorithms have been proposed depending on the type of pattern and compression method, see e.g., =-=[3, 9, 11, 12, 14, 19]-=-. For instance, given a string Q of length u compressed with the Ziv-Lempel-Welch scheme [24] into a string of length n, Amir et al. [4] gave an algorithm for finding all exact occurrences of a patter... |

19 | Fast and compact regular expression matching
- Bille, Farach-Colton
(Show Context)
Citation Context ...on [21] solves the problem in O(um) time and O(m) space, where u and m are the length of Q and R, respectively. Improvements based on the Four Russian Technique or word-level parallelism are given in =-=[4,6,15]-=-. The only solution to the compressed problem is due to Navarro [17]. His solution depends on word RAM techniques to encode small sets into memory words, thereby allowing constant time set operations.... |

13 | Regular expression searching on compressed text
- Navarro
- 2003
(Show Context)
Citation Context ...h m in O(n + m 2 ) time and space. In this paper we study the classical approximate string matching and regular expression matching problems on compressed texts. As in previous work on these problems =-=[10,17]-=- we focus on the popular ZL78 and ZLW adaptive dictionary compression schemes [22, 24]. We present a new technique that gives a general time-space trade-off. The resulting algorithms improve all previ... |

12 | Bit-parallel approach to approximate string matching in compressed texts
- Matsumoto, Kida, et al.
- 2000
(Show Context)
Citation Context ...r of occurrences of the pattern. Currently, this is the only non-trivial worst-case bound for the problem. For special cases and restricted versions of the problem other algorithms have been proposed =-=[14,19]-=-. An experimental study of the problem and an optimized practical implementation can be found in [18]. In this paper, we show that the problem is closely connected to the uncompressed problem and we a... |

8 | New algorithms for regular expression matching
- Bille
- 2006
(Show Context)
Citation Context ...on [21] solves the problem in O(um) time and O(m) space, where u and m are the length of Q and R, respectively. Improvements based on the Four Russian Technique or word-level parallelism are given in =-=[4,6,15]-=-. The only solution to the compressed problem is due to Navarro [17]. His solution depends on word RAM techniques to encode small sets into memory words, thereby allowing constant time set operations.... |

2 |
Approximate string matching on Ziv-Lempel compressed text
- Kärkkäinen, Navarro, et al.
- 2003
(Show Context)
Citation Context ...h m in O(n + m 2 ) time and space. In this paper we study the classical approximate string matching and regular expression matching problems on compressed texts. As in previous work on these problems =-=[10,17]-=- we focus on the popular ZL78 and ZLW adaptive dictionary compression schemes [22, 24]. We present a new technique that gives a general time-space trade-off. The resulting algorithms improve all previ... |