Results 11  20
of
421
Adding Compression to a FullText Retrieval System
, 1995
"... We describe the implementation of a data compression scheme as an integral and transparent layer within a fulltext... ..."
Abstract

Cited by 90 (25 self)
 Add to MetaCart
We describe the implementation of a data compression scheme as an integral and transparent layer within a fulltext...
An efficient indexing technique for fulltext database systems
 In Proceedings of 18th International Conference on Very Large Databases
, 1992
"... Abstract: Fulltext database systems require an index to allow fast access to documents based on their content. We propose an inverted file indexing scheme based on compression. This scheme allows users to retrieve documents using words occurring in the documents, sequences of adjacent words, and ..."
Abstract

Cited by 84 (10 self)
 Add to MetaCart
Abstract: Fulltext database systems require an index to allow fast access to documents based on their content. We propose an inverted file indexing scheme based on compression. This scheme allows users to retrieve documents using words occurring in the documents, sequences of adjacent words, and statistical ranking techniques. The compression methods chosen ensure that the storage requirements are small and that dynamic update is straightforward. The only assumption that we make is that sufficient main memory is available to support an inmemory vocabulary; given this assumption, the method we describe requires at most one disc access per query term to identify answers to queries.
Compressing Integers for Fast File Access
 The Computer Journal
, 1999
"... this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare di#erent approaches to compressing integers, including the Elias gamma and delta codes, Golom ..."
Abstract

Cited by 80 (14 self)
 Add to MetaCart
(Show Context)
this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare di#erent approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variablebyte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed
A Subquadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
"... The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring ..."
Abstract

Cited by 74 (4 self)
 Add to MetaCart
(Show Context)
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speedup is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by LempelZiv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut GaspardMonge, Universite de MarnelaVallee, Cite Descartes, ChampssurMarne, 77454 MarnelaVallee Cedex 2, France, email: mac@univmlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (9724) 8240103, FAX: (9724) 8249331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 112013840; email: landau@poly.edu; partially supported by NSF grant CCR0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
Compressed text databases with efficient query algorithms based on the compressed suffix array
 Proceedings of ISAAC'00, number 1969 in LNCS
, 2000
"... A compressed text database based on the compressed suffix array is proposed. The compressed su#x array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies O(n log bits for the alphabet #. On the other hand, our data structure does n ..."
Abstract

Cited by 69 (3 self)
 Add to MetaCart
A compressed text database based on the compressed suffix array is proposed. The compressed su#x array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies O(n log bits for the alphabet #. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text
Code and parse trees for lossless source encoding
 Communications in Information and Systems
, 2001
"... This paper surveys the theoretical literature on fixedtovariablelength lossless source code trees, called code trees, and on variablelengthtofixed lossless sounce code trees, called parse trees. Huffman coding [ l] is the most well known code tree problem, but there are a number of interestin ..."
Abstract

Cited by 63 (1 self)
 Add to MetaCart
(Show Context)
This paper surveys the theoretical literature on fixedtovariablelength lossless source code trees, called code trees, and on variablelengthtofixed lossless sounce code trees, called parse trees. Huffman coding [ l] is the most well known code tree problem, but there are a number of interesting variants of the problem formulation which lead to other combinatorial optimization problems. Huffman coding as an
Universal compression of memoryless sources over unknown alphabets
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2004
"... It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern—the order in which the symbol ..."
Abstract

Cited by 59 (22 self)
 Add to MetaCart
(Show Context)
It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern—the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the GoodTuring probabilityestimation problem.