Adding Compression to a FullText Retrieval System
, 1995
Abstract

Cited by 90 (25 self)
We describe the implementation of a data compression scheme as an integral and transparent layer within a fulltext...
An efficient indexing technique for fulltext database systems
 In Proceedings of 18th International Conference on Very Large Databases
, 1992
Abstract

Cited by 84 (10 self)
Abstract: Fulltext database systems require an index to allow fast access to documents based on their content. We propose an inverted file indexing scheme based on compression. This scheme allows users to retrieve documents using words occurring in the documents, sequences of adjacent words, and statistical ranking techniques. The compression methods chosen ensure that the storage requirements are small and that dynamic update is straightforward. The only assumption that we make is that sufficient main memory is available to support an inmemory vocabulary; given this assumption, the method we describe requires at most one disc access per query term to identify answers to queries.
Compressing Integers for Fast File Access
 The Computer Journal
, 1999
Abstract

Cited by 77 (14 self)
this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare di#erent approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variablebyte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed
Compressed text databases with efficient query algorithms based on the compressed suffix array
 Proceedings of ISAAC'00, number 1969 in LNCS
, 2000
Abstract

Cited by 68 (3 self)
A compressed text database based on the compressed suffix array is proposed. The compressed su#x array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies O(n log bits for the alphabet #. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text
A Subquadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
Abstract

Cited by 68 (4 self)
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speedup is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by LempelZiv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut GaspardMonge, Universite de MarnelaVallee, Cite Descartes, ChampssurMarne, 77454 MarnelaVallee Cedex 2, France, email: mac@univmlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (9724) 8240103, FAX: (9724) 8249331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 112013840; email: landau@poly.edu; partially supported by NSF grant CCR0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
Code and parse trees for lossless source encoding
 Communications in Information and Systems
, 2001
Abstract

Cited by 61 (1 self)
This paper surveys the theoretical literature on fixedtovariablelength lossless source code trees, called code trees, and on variablelengthtofixed lossless sounce code trees, called parse trees. Huffman coding [ l] is the most well known code tree problem, but there are a number of interesting variants of the problem formulation which lead to other combinatorial optimization problems. Huffman coding as an
A General Practical Approach to Pattern Matching over ZivLempel Compressed Text
, 1998
Abstract

Cited by 59 (10 self)
. We address the problem of string matching on ZivLempel compressed text. The goal is to search a pattern in a text without uncompressing it. This is a highly relevant issue to keep compressed text databases where efficient searching is still possible. We develop a general technique for string matching when the text comes as a sequence of blocks. This abstracts the essential features of ZivLempel compression. We then apply the scheme to each particular type of compression. We present the first algorithm to find all the matches of a pattern in a text compressed using LZ77. When we apply our scheme to LZ78, we obtain a much more efficient search algorithm, which is faster than uncompressing the text and then searching on it. Finally, we propose a new hybrid compression scheme which is between LZ77 and LZ78, being in practice as good to compress as LZ77 and as fast to search in as LZ78. 1 Introduction String matching is one of the most pervasive problems in computer science, with appli...