#### DMCA

## Indexing Compressed Text (1997)

### Cached

### Download Links

- [www.dcc.uchile.cl]
- [www.dcc.uchile.cl]
- [www.dcc.uchile.cl]
- DBLP

### Other Repositories/Bibliography

Venue: | Proceedings of the 4th South American Workshop on String Processing |

Citations: | 25 - 9 self |

### Citations

1522 | A Universal Algorithm for Sequential Data Compression", - Ziv, Lempel - 1977 |

1372 |
A method for the construction of minimum redundancy codes. In:
- Huffman
- 1951
(Show Context)
Citation Context ... to update the statistical information as they decompress. 2.2 Compressing Words The most used encoding methods are arithmetic coding [WNC87], Ziv-Lempel coding family [ZL77, ZL78] and Huffman coding =-=[Huf52]-=-. Arithmetic coding can achieve better compression ratios than Huffman coding because of its ability to use fractional parts of bits. However, as in the arithmetic coding the data is encoded as ration... |

954 | Compression of Individual Sequences Via Variable-rate Coding, - Ziv, Lempel - 1978 |

835 | Suffix arrays: a new method for on-line string searches.
- Manber, Myers
- 1991
(Show Context)
Citation Context ...xing techniques using inverted lists have recently received some attention [MB95, WBN92, ZM95]. However, work on combining compression techniques and suffix arrays has not been pursued. Suffix arrays =-=[MM90]-=- or Pat arrays [Gon87, GBYS92] are indexing structures that achieve space and time complexity similar to inverted lists. Their main drawback is their costly construction and maintenance procedure. How... |

801 |
Arithmetic Coding for Data Compression.
- Witten, Neal, et al.
- 1987
(Show Context)
Citation Context ...static methods than on adaptive methods because the last ones need to update the statistical information as they decompress. 2.2 Compressing Words The most used encoding methods are arithmetic coding =-=[WNC87]-=-, Ziv-Lempel coding family [ZL77, ZL78] and Huffman coding [Huf52]. Arithmetic coding can achieve better compression ratios than Huffman coding because of its ability to use fractional parts of bits. ... |

722 | Text Compression, - Bell, Cleary, et al. - 1990 |

456 |
Human Behaviours and the Principle of Least Effort.
- Zipf
- 1949
(Show Context)
Citation Context ... proposing achieves slight better compression ratios. 2.4 Analytical and Experimental Results We first analyze the zero-order entropy of natural language text when words are symbols, using Zipf's law =-=[Zip49]-=-. We then present experimental results showing that Huffman coding is very close to the entropy limit, thus supporting the thesis that it is a good choice for our purposes. We also compare the separat... |

210 |
Overview of the Third Text Retrieval Conference (TREC-3).
- Harman
- 1994
(Show Context)
Citation Context ...g random access into a compressed text while their main purpose is to search sequentially the compressed file. For the experimental results we used literary texts from the 2 gigabytes trec collection =-=[Har95]-=-. We have chosen the following texts: ap Newswire (1989), doe - Short abstracts from doe publications, fr - Federal Register (1989), wsj - Wall Street Journal (1987, 1988, 1989) and ziff - articles fr... |

173 | A locally adaptive data compression scheme,”
- Bentley, Sleator, et al.
- 1986
(Show Context)
Citation Context ...n scheme to be used in conjunction with suffix arrays. We make three main contributions. First, we study analytically and experimentally different variations of the word-oriented compression paradigm =-=[BSTW86]-=-. Second, we define an encoding method that preserves the lexicographical ordering of the text words. This idea already existed from a long time ago [Knu73], but to the best of our knowledge, it had n... |

171 |
Information Retrieval: Computational and Theoretical Aspects.
- Heaps
- 1978
(Show Context)
Citation Context ... now the performance of this indexing scheme. Filtering and compression takes O(N ) time, which is negligible compared to suffix array construction. The Hu-Tucker algorithm is O(n log n). As shown in =-=[Hea78]-=-, n = O(N fi ) for 0 ! fi ! 1. For example, for the doe collection we have n = 9:43 N 0:53 , while for ziff we have n = 10:77 N 0:51 . Therefore, Hu-Tucker time is O(N fi log N ) time, which is domina... |

114 | Let sleeping files lie: pattern matching in Zcompressed files. - Amir, Benson, et al. - 1996 |

111 |
The Art of Computer Programming: Sorting and Searching (Volume 3).
- KNUTH
- 1998
(Show Context)
Citation Context ... of the word-oriented compression paradigm [BSTW86]. Second, we define an encoding method that preserves the lexicographical ordering of the text words. This idea already existed from a long time ago =-=[Knu73]-=-, but to the best of our knowledge, it had not been applied in practice. Third, based on our encoding method, we describe a mechanism to build suffix arrays for either compressed or uncompressed texts... |

106 | String matching in Lempel-Ziv compressed strings. - Farach, Thorup - 1998 |

91 | Adding compression to a full-text retrieval system.
- Zobel, Moffat
- 1995
(Show Context)
Citation Context ...ession scheme are not valid for textual databases. For example, the need of direct access to parts of the text immediately rules out adaptive models, which are pervasive in modern compression schemes =-=[ZM95]-=-. Adaptive models start with no information about the text and progressively learn about its statistical distribution as the compression process goes on. They are one-pass and store no additional info... |

76 | A text compression scheme that allows fast searching directly in the compressed file
- MANBER
- 1997
(Show Context)
Citation Context ...iginal size, and the whole index (text and suffix array) is reduced to 60%, which is less than the space of the uncompressed text with no index. Another type of text compression scheme is proposed in =-=[Man93]-=-. The main purpose of [Man93] is to speed up sequential searching by compressing the search key rather than decompressing the text being searched. As a consequence it requires no modification in the a... |

71 |
A Method for the Construction of Minimum Redundancy Codes
- Human
- 1951
(Show Context)
Citation Context ...d to update the statistical information as they decompress. 2.2 Compressing Words The most used encoding methods are arithmetic coding [WNC87], Ziv-Lempel coding family [ZL77, ZL78] and Human coding =-=[Huf52]-=-. Arithmetic coding can achieve better compression ratios than Human coding because of its ability to use fractional parts of bits. However, as in the arithmetic coding the data is encoded as rationa... |

50 |
New indices for text: PAT trees and PAT arrays
- Gonnet, Baeza-Yates, et al.
- 1992
(Show Context)
Citation Context ...he sorting efficiently when large texts are involved. Large texts do not fit in main memory and an external sort procedure has to be used. Our indexing algorithm is based on the algorithm proposed in =-=[GBYS92]-=- for generating large suffix arrays. The algorithm divides the text in blocks small enough to be individually indexed in main memory. It works with each block separately in three distinct phases. In t... |

43 | Large text searching allowing errors
- Ara'ujo, Navarro, et al.
- 1997
(Show Context)
Citation Context ... are nearly twice as fast than in the uncompressed version of the index. This scheme can be readily adapted to meet other requirements. For example, it is not difficult to mix it with the approach of =-=[ANZ97]-=- to allow searching for regular expressions, approximate patterns, etc. This is because that approach is mainly based on processing the vocabulary, which is stored in our index. We are currently worki... |

33 | In-situ generation of compressed inverted files - Moffat, Bell - 1995 |

23 |
Human Behaviour and the Principle of Least Eort
- Zipf
- 1949
(Show Context)
Citation Context ...re proposing achieves slight better compression ratios. 2.4 Analytical and Experimental Results Wesrst analyze the zero-order entropy of natural language text when words are symbols, using Zipf's law =-=[Zip49]-=-. We then present experimental results showing that Human coding is very close to the entropy limit, thus supporting the thesis that it is a good choice for our purposes. We also compare the separate... |

21 | Word-based text compression. Software Practice and Experience - Moffat - 1851 |

20 | Hierarchies of indices for text searching - Baeza-Yates, Barbosa, et al. - 1996 |

19 | Let sleeping lie: pattern matching in z-compressed - Amir, Benson, et al. - 1996 |

14 |
Optimal computer-search trees, and variable length alphabetic codes,
- Hu, Tucker
- 1971
(Show Context)
Citation Context ...arch tree. This is also the optimal trie we want, since we minimize the same expression (the depth is the length of the code for each leaf). The solution to this problem is presented by Hu and Tucker =-=[HT71]-=- and also considered in [Knu73], where the tree is built with an O(n log n) algorithm. Therefore, the complexity is the same as for obtaining a Huffman code. A natural question is how far is the resul... |

9 |
From partial to full inverted lists for text searching
- Barbosa, Ziviani
- 1995
(Show Context)
Citation Context ...xing whole collections of documents. Another line of work we are pursuing is to handle modifications to a text database (insertions, deletions, updates). Finally, we can add search structures such as =-=[BZ95]-=- on top of the suffix array to improve the performance significantly (reductions by a factor of more than 5 times are reported). We have not attempted to compress the suffix array itself. This is beca... |

7 | Word-based text compression. Software Practice and Experience - Moat - 1989 |

6 | Optimized binary search and text retrieval - Barbosa, Navarro, et al. - 1995 |

5 | Indexing and compressing full-text databases for CD-ROM - Witten, Bell, et al. - 1992 |

4 | pat 3.1: An Efficient Text Searching System. User's Manual - Gonnet - 1987 |

3 | In Situ Generation of Compressed Inverted - Moffat, Bell - 1995 |

2 | pat 3.1: An Ecient Text Searching System. User's Manual - Gonnet - 1987 |