## Lightweight natural language text compression. Information Retrieval (2007)

Citations: | 27 - 21 self |

### BibTeX

@MISC{Brisaboa07lightweightnatural,

author = {Nieves R. Brisaboa and Antonio Fariña and Gonzalo Navarro and José R. Paramá},

title = {Lightweight natural language text compression. Information Retrieval},

year = {2007}

}

### OpenURL

### Abstract

Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11 % larger compressed files. This work describes End-Tagged Dense Code and (s, c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60 % faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.

### Citations

1138 | A Universal Algorithm for Sequential Data Compression
- Ziv, Lempel
- 1977
(Show Context)
Citation Context ... (Wan 2003). Other methods with competitive compression ratios on natural language text, yet unable of searching the compressed text faster than the uncompressed text, include Ziv-Lempel compression (=-=Ziv and Lempel 1977-=-, Ziv and Lempel 1978) (implemented for example in Gnu gzip), Burrows-Wheeler compression (Burrows and Wheeler 1994) (implemented for example in Seward’s bzip2), and statistical modeling with arithmet... |

980 |
Human behavior and the principle of least effort
- Zipf
- 1949
(Show Context)
Citation Context ...on of words in natural language is much more skewed than that of characters, following a Zipf Law (that is, the frequency of the i-th most frequent word is proportional to 1/i θ , for some 1 < θ < 2 (=-=Zipf 1949-=-, Baeza-Yates and Ribeiro-Neto 1999)), and the separators are even more skewed. As a result, compression ratios get around 25%, which is close to what can be obtained with any other compression method... |

942 |
A method for the construction of minimum redundancy codes
- Huffman
- 1952
(Show Context)
Citation Context ...ulary word is assigned a codeword. This process is different for each method: • The PHC encoding phase is the application of the Huffman technique (Moffat and Katajainen 1995, Moffat and Turpin 1996, =-=Huffman 1952-=-). Encoding takes O(n) time overall. • The SCDC encoding phase has two parts: The first computes the list of accumulated frequencies and searches for the optimal s and c values. Its cost is O(n) in pr... |

730 | Compression of individual sequences via variable-rate coding
- Ziv, Lempel
- 1978
(Show Context)
Citation Context ...thods with competitive compression ratios on natural language text, yet unable of searching the compressed text faster than the uncompressed text, include Ziv-Lempel compression (Ziv and Lempel 1977, =-=Ziv and Lempel 1978-=-) (implemented for example in Gnu gzip), Burrows-Wheeler compression (Burrows and Wheeler 1994) (implemented for example in Seward’s bzip2), and statistical modeling with arithmetic coding (Carpinelli... |

617 |
Text Compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ...arch as the best Huffman variants, which are not so close to the optimal size. Keywords: Text databases, natural language text compression, searching compressed text. 1 Introduction Text compression (=-=Bell et al. 1990-=-) permits representing a document using less space. This is useful not only to save disk space, but more importantly, to save disk transfer and network transmission time. In recent years, compression ... |

572 |
A Fast String-Searching Algorithm
- Boyer, Moore
- 1977
(Show Context)
Citation Context ...ern word or phrase and then running any classical string matching algorithm for the compressed pattern on the compressed text. In particular, one can use those algorithms able of skipping characters (=-=Boyer and Moore 1977-=-, Navarro and Raffinot 2002). This is not possible with Plain Huffman Code, because of the false matches problem. On Tagged Huffman Code false matches are impossible thanks to the flag bits. It is int... |

565 | A Block-sorting Lossless Data Compression Algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ...ng the compressed text faster than the uncompressed text, include Ziv-Lempel compression (Ziv and Lempel 1977, Ziv and Lempel 1978) (implemented for example in Gnu gzip), Burrows-Wheeler compression (=-=Burrows and Wheeler 1994-=-) (implemented for example in Seward’s bzip2), and statistical modeling with arithmetic coding (Carpinelli et al. 1999). 3 End-Tagged Dense Code We obtain End-Tagged Dense Code (ETDC) by a simple chan... |

317 | Fast text searching allowing errors - Wu, Manber - 1992 |

189 | A tool to search through entire file systems
- MANBER, WU
- 1994
(Show Context)
Citation Context ...ms in the wrong order. Moreover, some space-time tradeoffs in inverted indexes are based on grouping documents into blocks, and therefore sequential scanning is necessary even on single-word queries (=-=Manber and Wu 1994-=-, Navarro et al. 2000). Although partial decompression followed by searching is a solution, direct search of the compressed text is much more efficient (Ziviani et al. 2000). Classic compression techn... |

131 | Agrep – a fast approximate pattern-matching tool - Wu, Manber - 1992 |

126 |
In fornation retrieval: Computational and theoretical aspects
- Heaps
- 1978
(Show Context)
Citation Context ...hods must encode together with the compressed text) is not significant on large text collections, as the vocabulary grows slowly (O(N β ) symbols on a text of N words, for some β ≈ 0.5, by Heaps Law (=-=Heaps 1978-=-, BaezaYates and Ribeiro-Neto 1999)). This solution is acceptable for compressed text databases. With respect to searching those Huffman codes, essentially one can compress the pattern and search the ... |

93 | Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences - Navarro, Raffinot - 2002 |

81 | Fast and flexible word searching on compressed text - MOURA, NAVARRO, et al. - 2000 |

65 |
An informational theory of the statistical structure of language. Communication Theory
- Mandelbrot
- 1953
(Show Context)
Citation Context ...hich upper bounds the coding inefficiency of ETDC with respect to a b-ary Huffman. Several studies about bounds on Dense Codes and b-ary Huffman codes applied to Zipf (Zipf 1949) and Zipf-Mandelbrot (=-=Mandelbrot 1953-=-) distributions can be found in (Navarro and Brisaboa 2006, Fariña 2005). As shown in Section 6, ETDC improves Tagged Huffman Code compression ratio by more than 8%. Its difference with respect to Pla... |

62 | A text compression scheme that allows fast searching directly in the compressed file - Manber - 1994 |

49 | Adding compression to block addressing inverted indices. Inf. Retrieval - NAVARRO, MOURA, et al. - 2000 |

45 |
On the implementation of minimum redundancy prefix codes
- Moffat, Turpin
- 1997
(Show Context)
Citation Context ... 2. Encoding. Each vocabulary word is assigned a codeword. This process is different for each method: • The PHC encoding phase is the application of the Huffman technique (Moffat and Katajainen 1995, =-=Moffat and Turpin 1996-=-, Huffman 1952). Encoding takes O(n) time overall. • The SCDC encoding phase has two parts: The first computes the list of accumulated frequencies and searches for the optimal s and c values. Its cost... |

43 | Boyer-Moore string matching over Ziv-Lempel compressed text - Navarro, Tarhio - 2000 |

37 | Factor oracle: a new structure for pattern matching. Theory and
- Allauzen, Crochemore, et al.
- 1999
(Show Context)
Citation Context ...ferent algorithms were tested to search the uncompressed text: i) our own implementation of the Set Horspool’s algorithm, ii) author’s implementation of Set Backward Oracle Matching algorithm (SBOM) (=-=Allauzen et al. 1999-=-), and iii) the agrep software (Wu and Manber 1992b, Wu and Manber 1992a), a fast approximate pattern-matching tool which allows, among other things, searching a text for multiple patterns. Agrep sear... |

33 |
Word-based text compression
- Moffat
- 1989
(Show Context)
Citation Context ...language texts, this yields poor compression ratios (around 65%). The key idea to the success of semistatic compression on natural language text databases was to consider words as the source symbols (=-=Moffat 1989-=-) (as well as separators, defined as maximal text substrings among consecutive words). The distribution of words in natural language is much more skewed than that of characters, following a Zipf Law (... |

27 |
In-place calculation of minimum-redundancy codes
- MOFFAT, J
- 1995
(Show Context)
Citation Context ...cal for PHC, SCDC, and ETDC. 2. Encoding. Each vocabulary word is assigned a codeword. This process is different for each method: • The PHC encoding phase is the application of the Huffman technique (=-=Moffat and Katajainen 1995-=-, Moffat and Turpin 1996, Huffman 1952). Encoding takes O(n) time overall. • The SCDC encoding phase has two parts: The first computes the list of accumulated frequencies and searches for the optimal ... |

25 | Fast searching on compressed text allowing errors
- Moura, Navarro, et al.
- 1998
(Show Context)
Citation Context ...nly proven extremely effective (with compression ratios 1 around 25%-35%), but also permitted searching the compressed text much faster (up to 8 times) than the original text (Turpin and Moffat 1997, =-=Moura et al. 1998-=-, Moura et al. 2000). The integration of compression and indexing techniques (Witten et al. 1999, Navarro et al. 2000, Ziviani et al. 2000) opened the door to compressed text databases, where texts an... |

23 | An efficient compression code for text databases - Brisaboa, Iglesias, et al. - 2003 |

22 |
A new algorithm for data compression
- Gage
- 1994
(Show Context)
Citation Context ...bstitution method for direct searching we know of was proposed by Manber (1997), yet its compression ratios were poor (around 70%). This encoding was a simplified variant of Byte-Pair Encoding (BPE) (=-=Gage 1994-=-). BPE is a multi-pass method based on finding frequent pairs of consecutive source symbols and replacing them by a fresh source symbol. On natural language text, it obtains a poor compression ratio (... |

18 | S,C)-dense coding: An optimized compression code for natural language text databases - Brisaboa, Fariña, et al. - 2003 |

12 |
Practical Fast Searching
- Horspool
- 1980
(Show Context)
Citation Context ...x code) is compensated because the size of the compressed text is smaller with Dense Codes than with Tagged Huffman Code. Figure 6 gives a search algorithm based on Horspool’s variant of Boyer-Moore (=-=Horspool 1980-=-, Navarro and Raffinot 2002). This algorithm is especially well suited to this case (codewords of length at most 3–4, characters with relatively uniform distribution in {0. . . 255}). 6 Empirical resu... |

12 |
Fast file search using text compression
- TURPIN, A
- 1997
(Show Context)
Citation Context ...anguage texts have not only proven extremely effective (with compression ratios 1 around 25%-35%), but also permitted searching the compressed text much faster (up to 8 times) than the original text (=-=Turpin and Moffat 1997-=-, Moura et al. 1998, Moura et al. 2000). The integration of compression and indexing techniques (Witten et al. 1999, Navarro et al. 2000, Ziviani et al. 2000) opened the door to compressed text databa... |

10 |
Speeding up the pattern matching machine for compressed texts
- Miyazaki, Fukamachi, et al.
- 1998
(Show Context)
Citation Context ...on methods, however, are not entirely satisfactory either. For example, the Huffman (1952) code offers direct random access from codeword beginnings and decent decompression and direct search speeds (=-=Miyazaki et al. 1998-=-), yet the compression ratio of the Huffman code on natural language is poor (around 65%). The key to the success of natural language compressed text databases is the use of a semistatic wordbased mod... |

9 | Browsing and searching compressed documents - Wan - 2003 |

8 | LZgrep: a Boyer-Moore string matching tool for Ziv-Lempel compressed text - Navarro, Tarhio - 2005 |

7 | New Compression Codes for Text Databases - Fariña - 2005 |

6 | String matching with stopper encoding and code splitting - Rautio, Tanninen, et al. - 2002 |

5 | Efficiently decodable and searchable natural language adaptive compression - Brisaboa, Fariña, et al. - 2005 |

5 | Speeding up string pattern matching by text compression: The dawn of a new era. Transactions of Information Processing Society of Japan - Takeda, Shibata, et al. - 2001 |

3 |
On universal codeword sets
- Lakshmanan
- 1981
(Show Context)
Citation Context ...lias 1975) also assign codewords to source symbols in order of decreasing probability, with shorter codewords for the first positions. Other authors proposed other codes with similar characteristics (=-=Lakshmanan 1981-=-, Fraenkel and Klein 1996). These codes yield an average codeword length within a constant factor of the optimal average length. Unfortunately, the constant may be too large for the code to be prefera... |

3 | On the analysis of variable-to-variable length codes - Savari, Szpankowski |

2 | Compressing dynamic text collections via phrase-based coding - Brisaboa, Fariña, et al. |

2 | bit based compression using arithmetic coding,” http://www.cs.mu.oz.au/˜alistair/arith_coder - Carpinelli - 1999 |

2 | Pattern matching - Klein, Shapira - 2005 |

1 | 2004, Recent advances in applied probability - Baeza-Yates, Navarro |

1 | efficient natural language adaptive compression - Brisaboa, Fariña, et al. |

1 | Practical fast searching - N - 1980 |

1 | 2005, LZgrep: A Boyer-Moore String Matching Tool for Ziv-Lempel Compressed Text. Software Practice and Experience (SPE - Navarro, Tarhio |