MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Direct Pattern Matching on Compressed Text (1998) [14 citations — 5 self]

by Edleno Silva De Moura ,  Gonzalo Navarro ,  Nivio Ziviani ,  Belo Horizonte Brasil ,  Belo Horizonte Brasil ,  Ricardo Baeza-yates
Add To MetaCart

Abstract:

We present a fast compression and decompression technique for natural language texts. The novelty is that the exact search can be done on the compressed text directly, using any known sequential pattern matching algorithm. Approximate search can also be done efficiently without any decoding. The compression scheme uses a semi-static word-based modeling and a Huffman coding where the coding alphabet is byte-oriented rather than bit-oriented. We use the first bit of each byte to mark the beginning of a word, which allows the searching of the compressed pattern directly on the compressed text. We achieve about 33% compression ratio for typical English texts. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is almost twice as fast as running agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than agrep.

Citations

613 A method for the construction of minimum-redundancy codes – Huffman - 1952
249 Fast text searching allowing errors – Wu, Manber - 1992
174 Overview of the Third Text REtrieval Conference – Harman - 1995
134 On the Complexity of Finite Sequences – Lempel, Ziv - 1976
111 A locally adaptive data compression scheme – Bentley, Sleator, et al. - 1986
83 Information Retrieval - Computational and Theoretical Aspects – Heaps - 1978
80 A very fast substring search algorithm – Sunday - 1990
70 String matching in Lempel-Ziv compressed strings – Farach, Thorup - 1998
68 Let sleeping files lie: Pattern matching in Z-compressed files – Amir, Benson, et al. - 1996
68 Adding compression to a full-text retrieval system – Zobel, Moffat - 1995
58 Ecient two-dimensional compressed matching – Amir, Benson - 1992
46 A faster algorithm for approximate string matching – Baeza-Yates, Navarro - 1996
42 A text compression scheme that allows fast searching directly in the compressed file – Manber - 1994
33 Large Text Searching Allowing Errors – Araújo, Navarro, et al. - 1997
19 Efficient algorithms for Lempel-Ziv encoding – Gasieniec, Karpinski, et al. - 1996
18 Indexing compressed text – Moura, Navarro, et al. - 1997
16 Multiple approximate string matching – Baeza-Yates, Navarro - 1997
11 Word-based text compression. Software Practice and Experience – Moffat - 1989
2 Pattern matching problem for strings with short descriptions – Karpinski, Shinohara, et al. - 1997