by
Edleno Silva De Moura
,
Gonzalo Navarro
,
Nivio Ziviani
,
Belo Horizonte Brasil
,
Belo Horizonte Brasil
,
Ricardo Baeza-yates
Add To MetaCart
Abstract:
We present a fast compression and decompression technique for natural language texts. The novelty is that the exact search can be done on the compressed text directly, using any known sequential pattern matching algorithm. Approximate search can also be done efficiently without any decoding. The compression scheme uses a semi-static word-based modeling and a Huffman coding where the coding alphabet is byte-oriented rather than bit-oriented. We use the first bit of each byte to mark the beginning of a word, which allows the searching of the compressed pattern directly on the compressed text. We achieve about 33% compression ratio for typical English texts. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is almost twice as fast as running agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than agrep.
Citations
|
613
|
A method for the construction of minimum-redundancy codes
– Huffman
- 1952
|
|
249
|
Fast text searching allowing errors
– Wu, Manber
- 1992
|
|
174
|
Overview of the Third Text REtrieval Conference
– Harman
- 1995
|
|
134
|
On the Complexity of Finite Sequences
– Lempel, Ziv
- 1976
|
|
111
|
A locally adaptive data compression scheme
– Bentley, Sleator, et al.
- 1986
|
|
83
|
Information Retrieval - Computational and Theoretical Aspects
– Heaps
- 1978
|
|
80
|
A very fast substring search algorithm
– Sunday
- 1990
|
|
70
|
String matching in Lempel-Ziv compressed strings
– Farach, Thorup
- 1998
|
|
68
|
Let sleeping files lie: Pattern matching in Z-compressed files
– Amir, Benson, et al.
- 1996
|
|
68
|
Adding compression to a full-text retrieval system
– Zobel, Moffat
- 1995
|
|
58
|
Ecient two-dimensional compressed matching
– Amir, Benson
- 1992
|
|
46
|
A faster algorithm for approximate string matching
– Baeza-Yates, Navarro
- 1996
|
|
42
|
A text compression scheme that allows fast searching directly in the compressed file
– Manber
- 1994
|
|
33
|
Large Text Searching Allowing Errors
– Araújo, Navarro, et al.
- 1997
|
|
19
|
Efficient algorithms for Lempel-Ziv encoding
– Gasieniec, Karpinski, et al.
- 1996
|
|
18
|
Indexing compressed text
– Moura, Navarro, et al.
- 1997
|
|
16
|
Multiple approximate string matching
– Baeza-Yates, Navarro
- 1997
|
|
11
|
Word-based text compression. Software Practice and Experience
– Moffat
- 1989
|
|
2
|
Pattern matching problem for strings with short descriptions
– Karpinski, Shinohara, et al.
- 1997
|