## Approximate String Matching with Lempel-Ziv Compressed Indexes (2007)

### Cached

### Download Links

- [kdbio.inesc-id.pt]
- [www.dcc.uchile.cl]
- [www.dcc.uchile.cl]
- DBLP

### Other Repositories/Bibliography

Citations: | 5 - 4 self |

### BibTeX

@MISC{Russo07approximatestring,

author = {Luís M. S. Russo and Gonzalo Navarro and Arlindo L. Oliveira},

title = {Approximate String Matching with Lempel-Ziv Compressed Indexes},

year = {2007}

}

### OpenURL

### Abstract

A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a Lempel-Ziv self-index. We consider the so-called hybrid indexes, which are the best in practice for this problem. We show that a Lempel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the Lempel-Ziv index. We show experimentally that our algorithm has a competitive performance and provides a useful space-time tradeoff compared to classical indexes.

### Citations

730 | Compression of individual sequences via variable-rate coding
- Ziv, Lempel
- 1978
(Show Context)
Citation Context ... on the ILZI [28], yet the results can be carried over to similar indexes. The ILZI partitions the text into phrases such that every suffix of a phrase is also a phrase (similarly to LZ78 compressors =-=[32]-=-, where every prefix of a phrase is also a phrase). It uses two tries, one storing the phrases and another storing the reverse phrases. In addition, it stores a mapping that permits moving from one tr... |

644 | Suffix arrays: a new method for on-line string searches
- Manber, Myers
- 1990
(Show Context)
Citation Context ...x space (e.g. O(u log k u)). Another trend is to reuse an index designed for exact searching, all of which are linear-space, and try to do ASM over it. Indexes such as suffix trees [9], suffix arrays =-=[10]-=-, or based on so-called q-grams or q-samples, have been used. There exist several algorithms, based on suffix trees or arrays, which focus on worst-case performance [11–13]. Given the mentioned time-s... |

426 |
Linear pattern matching algorithms
- Weiner
- 1973
(Show Context)
Citation Context ...ns exponential index space (e.g. O(u log k u)). Another trend is to reuse an index designed for exact searching, all of which are linear-space, and try to do ASM over it. Indexes such as suffix trees =-=[9]-=-, suffix arrays [10], or based on so-called q-grams or q-samples, have been used. There exist several algorithms, based on suffix trees or arrays, which focus on worst-case performance [11–13]. Given ... |

409 | A Guided Tour to Approximate String Matching
- Navarro
(Show Context)
Citation Context ...mum number of character insertions, deletions, and substitutions of single characters to convert one string into the other. The classical sequential search solution runs in O(um) worst-case time (see =-=[1]-=-). An optimal average-case algorithm requires time O(u(k + log σ m)/m) [2,3], where σ is the size of the alphabet Σ. Those good average-case algorithms are called filtration algorithms: they traverse ... |

188 | Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching - Grossi, Vitter |

172 | Compressed full-text indexes - NAVARRO, MÄKINEN - 2007 |

139 | A fast bit-vector algorithm for approximate string matching based on dynamic programming
- Myers
- 1999
(Show Context)
Citation Context ... different approaches when k = 6. ILZI Hybrid LZI DLZI FMIndex English 55 257 145 178 131 DNA 45 252 125 158 127 Proteins 105 366 217 228 165 BPM, the bit-parallel dynamic programming matrix of Myers =-=[33]-=-, and EXP, the exact pattern partitioning by Navarro and Baeza-Yates [34]). For the real prototype we used a stricter backtracking than as explained in previous sections. For each pattern substring P[... |

131 | An analysis of the Burrows-Wheeler transform
- MANZINI
(Show Context)
Citation Context ... searching, and replaces the text as it can reproduce any text substring (in which case they are called self-indexes). The size of those indexes is measured in terms of the empirical text entropy, Hk =-=[25]-=-, which gives a lower bound on the number of bits per symbol achievable by a k-th order compressor. In this work we are interested in indexes based on Lempel-Ziv compression [21,22,26–28].sDespite the... |

119 |
Indexing compressed text
- Ferragina, Manzini
(Show Context)
Citation Context ...t a filtration ASM method that relies on looking for exact occurrences of pattern substrings, as this is what all self-indexes provide. Indeed, this has been already attempted [31] using the FM-index =-=[21]-=- and a Lempel-Ziv index [22]. The Lempel-Ziv index worked better because it is faster to extract the text to verify (recall that in selfindexes the text is not directly available). The specific struct... |

117 | Reducing the space requirement of suffix trees
- Kurtz
- 1999
(Show Context)
Citation Context ... and q-sample indexes [18]. Yet, many of those linear-space indexes are very large anyway. For example, suffix arrays require 4 times the text size and suffix trees require at the very least 10 times =-=[19]-=-. In recent years a new and extremely successful class of indexes has emerged. Compressed full-text indexes use data compression techniques to produce less space-demanding data structures [20–24]. It ... |

93 |
A sublinear algorithm for approximate keyword searching
- Myers
- 1994
(Show Context)
Citation Context ...l cost is controlled. Indexes of this kind offer average-case guarantees of the form O(mn λ ) for some 0 < λ < 1, and work well for higher error levels. They have been implemented over q-gram indexes =-=[16]-=-, suffix arrays [17], and q-sample indexes [18]. Yet, many of those linear-space indexes are very large anyway. For example, suffix arrays require 4 times the text size and suffix trees require at the... |

87 | New text indexing functionalities of the compressed suffix arrays - SADAKANE |

64 | Indexing text using the ZivLempel trie
- Navarro
- 2004
(Show Context)
Citation Context ...at relies on looking for exact occurrences of pattern substrings, as this is what all self-indexes provide. Indeed, this has been already attempted [31] using the FM-index [21] and a Lempel-Ziv index =-=[22]-=-. The Lempel-Ziv index worked better because it is faster to extract the text to verify (recall that in selfindexes the text is not directly available). The specific structure of the LempelZiv index u... |

56 | Approximate string-matching over suffix trees - Ukkonen - 1993 |

56 | A hybrid indexing method for approximate string matching
- Navarro, Baeza-Yates
- 2000
(Show Context)
Citation Context .... Indexes of this kind offer average-case guarantees of the form O(mn λ ) for some 0 < λ < 1, and work well for higher error levels. They have been implemented over q-gram indexes [16], suffix arrays =-=[17]-=-, and q-sample indexes [18]. Yet, many of those linear-space indexes are very large anyway. For example, suffix arrays require 4 times the text size and suffix trees require at the very least 10 times... |

54 | Indexing methods for approximate string matching
- Navarro, Baeza-Yates, et al.
(Show Context)
Citation Context ...n part by Fondecyt Grant 1-050493 (Chile).sclassical ASM algorithm. For long texts, however, sequential searching might be impractical because it must scan all the text. To avoid this we use an index =-=[4]-=-. There exist indexes specifically devoted to ASM, e.g. [5–8], but these are oriented to worst-case performance. There seems to exist an unbreakable spacetime barrier with indexed ASM: Either one obta... |

51 | Dictionary matching and indexing with errors and don’t cares - Cole, Gottlieb, et al. - 2004 |

48 | Lempel-Ziv parsing and sublinear-size index structures for string matching - Kärkkäinen, Ukkonen - 1996 |

42 |
Filtration with q-samples in approximate string matching
- Sutinen, Tarhio
- 1996
(Show Context)
Citation Context ...tration algorithm, such that the “necessary condition” checked involves exact matching of pattern substrings, and as such can be verified with any exact-searching index. Such filtration indexes, e.g. =-=[14,15]-=-, cease to be useful for moderate k values, which are still of interest in applications. The most successful approach, in practice, is in between the two techniques described above, and is called “hyb... |

41 |
Approximate String Matching and Local Similarity
- Chang, Marr
- 1994
(Show Context)
Citation Context ... characters to convert one string into the other. The classical sequential search solution runs in O(um) worst-case time (see [1]). An optimal average-case algorithm requires time O(u(k + log σ m)/m) =-=[2,3]-=-, where σ is the size of the alphabet Σ. Those good average-case algorithms are called filtration algorithms: they traverse the text fast while checking for a simple necessary condition, and only when... |

38 | Fast approximate matching using suffix trees - Cobbs - 1995 |

38 | Indexing text with approximate q-grams
- Navarro, Sutinen, et al.
(Show Context)
Citation Context ...r average-case guarantees of the form O(mn λ ) for some 0 < λ < 1, and work well for higher error levels. They have been implemented over q-gram indexes [16], suffix arrays [17], and q-sample indexes =-=[18]-=-. Yet, many of those linear-space indexes are very large anyway. For example, suffix arrays require 4 times the text size and suffix trees require at the very least 10 times [19]. In recent years a ne... |

33 | Baeza-Yates. A practical q-gram index for text retrieval allowing errors
- Navarro, A
- 1998
(Show Context)
Citation Context ...tration algorithm, such that the “necessary condition” checked involves exact matching of pattern substrings, and as such can be verified with any exact-searching index. Such filtration indexes, e.g. =-=[14,15]-=-, cease to be useful for moderate k values, which are still of interest in applications. The most successful approach, in practice, is in between the two techniques described above, and is called “hyb... |

32 | Very fast and simple approximate string matching
- Navarro, Baeza-Yates
- 1999
(Show Context)
Citation Context ... 257 145 178 131 DNA 45 252 125 158 127 Proteins 105 366 217 228 165 BPM, the bit-parallel dynamic programming matrix of Myers [33], and EXP, the exact pattern partitioning by Navarro and Baeza-Yates =-=[34]-=-). For the real prototype we used a stricter backtracking than as explained in previous sections. For each pattern substring P[y1, y2] to be matched, we computed the maximum number of errors that coul... |

26 | Approximate string matching using compressed suffix arrays
- Huynh, Hon, et al.
- 2004
(Show Context)
Citation Context ...iv compression [21,22,26–28].sDespite the great success of self-indexes, they have been mainly used for exact searching. Only very recently some indexes taking O(u) or O(u √ log u) bits have appeared =-=[29,30,7]-=-. Yet, those are again of the worst-case type, and thus all their times are exponential on k. In this paper we present a practical algorithm that runs on a compressed self-index and belongs to the mos... |

24 | A Tutorial Introduction to Computational Biochemistry Using Darwin - Gonnet - 1994 |

23 | Reducing the space requirement of LZ-index - Arroyuelo, Navarro, et al. - 2006 |

19 | Average-optimal single and multiple approximate string matching
- Fredriksoon, Navarro
(Show Context)
Citation Context ... characters to convert one string into the other. The classical sequential search solution runs in O(um) worst-case time (see [1]). An optimal average-case algorithm requires time O(u(k + log σ m)/m) =-=[2,3]-=-, where σ is the size of the alphabet Σ. Those good average-case algorithms are called filtration algorithms: they traverse the text fast while checking for a simple necessary condition, and only when... |

18 | A compressed self-index using a Ziv-Lempel dictionary
- Russo, Oliveira
- 2006
(Show Context)
Citation Context ...e want a Lempel-Ziv-based index, so that the extraction of text to verify is fast; (3) we wish to avoid the problems derived from pieces spanning several Lempel-Ziv phrases. We will focus on an index =-=[28]-=- whose suffix-tree-like structure is useful for this approximate searching. Mimicking q-sample indexes is particularly useful for our goals. Consider that the text is partitioned into contiguous q-sam... |

12 | A linear size index for approximate pattern matching
- Chan, Lam, et al.
- 2006
(Show Context)
Citation Context ...iv compression [21,22,26–28].sDespite the great success of self-indexes, they have been mainly used for exact searching. Only very recently some indexes taking O(u) or O(u √ log u) bits have appeared =-=[29,30,7]-=-. Yet, those are again of the worst-case type, and thus all their times are exponential on k. In this paper we present a practical algorithm that runs on a compressed self-index and belongs to the mos... |

11 |
Improved approximate string matching using compressed suffix data structures
- Lam, Sung, et al.
- 2005
(Show Context)
Citation Context ...iv compression [21,22,26–28].sDespite the great success of self-indexes, they have been mainly used for exact searching. Only very recently some indexes taking O(u) or O(u √ log u) bits have appeared =-=[29,30,7]-=-. Yet, those are again of the worst-case type, and thus all their times are exponential on k. In this paper we present a practical algorithm that runs on a compressed self-index and belongs to the mos... |

8 | Text indexing with errors - Maass, Nowak - 2005 |

6 | Text indexing with errors - Maaß, Nowak - 2005 |

3 | Dotted suffix trees: a structure for approximate text indexing - Coelho, Oliveira |

2 |
Solución de consultas complejas sobre un indice de texto comprimido (solving complex queries over a compressed text index). Undergraduate thesis
- Morales
- 2005
(Show Context)
Citation Context ...d self-index to implement a filtration ASM method that relies on looking for exact occurrences of pattern substrings, as this is what all self-indexes provide. Indeed, this has been already attempted =-=[31]-=- using the FM-index [21] and a Lempel-Ziv index [22]. The Lempel-Ziv index worked better because it is faster to extract the text to verify (recall that in selfindexes the text is not directly availab... |

1 | 12. Ukkonen, E.: Approximate string matching over suffix trees. In: CPM - report, Informatik, et al. - 1992 |

1 | Volume 1075 of LNCS - CPM - 1996 |

1 | Indexing compressed text - Algorithms - 2003 |

1 | 530-536 33. Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming - Navarro, Baeza-Yates - 1978 |