## Parameterised Compression for Sparse Bitmaps (1992)

Venue: | Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval |

Citations: | 29 - 8 self |

### BibTeX

@INPROCEEDINGS{Moffat92parameterisedcompression,

author = {Alistair Moffat and Justin Zobel},

title = {Parameterised Compression for Sparse Bitmaps},

booktitle = {Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval},

year = {1992},

pages = {274--285},

publisher = {ACM Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

: Full-text retrieval systems typically use either a bitmap or an inverted file to identify which documents contain which words, so that the documents containing any combination of words can be quickly located. Bitmaps of word occurrences are large, but are usually sparse, and thus are amenable to a variety of compression techniques. Here we consider techniques in which the encoding of each bitvector within the bitmap is parameterised, so that a different code can be used for each bitvector. Our experimental results show that the new methods yield better compression than previous techniques. Categories and Subject Descriptors: E.4 [Coding and Information Theory]: Data compaction and compression; H.3.2 [Information Storage]: File organisation . Keywords: Full-text retrieval, data compression, document database, Huffman coding, geometric distribution, inverted file. 1 Introduction Full-text retrieval systems are used for storing and accessing document collections such as newspaper a...

### Citations

3122 |
Introduction to modern Information Retrieval
- Salton, McGill
(Show Context)
Citation Context ... useful to know not only whether or not the word appears, but also how many times it appears. For example, documents can be ranked more effectively, via similarity measures such as the cosine measure =-=[16]-=-, if term frequency information is known: including term frequency information results in a 10%-20% improvement in ranking effectiveness [6]. To allow such ranking, the bitmap should store a frequency... |

942 |
A method for the construction of minimum redundancy codes
- Huffman
- 1952
(Show Context)
Citation Context ...more parameters. In the limit, for asymptotically large document collections, the runlengths within each bitvector are best modelled by using the exact gap frequencies to drive either a Huffman coder =-=[11]-=- or an arithmetic coder [22]. However this does not result in the most economical representation of the bitmap for practical sized collections, since the parameters that control the coding must also b... |

664 |
Arithmetic coding for data compression
- Witten, Neal, et al.
- 1987
(Show Context)
Citation Context ...t, for asymptotically large document collections, the runlengths within each bitvector are best modelled by using the exact gap frequencies to drive either a Huffman coder [11] or an arithmetic coder =-=[22]-=-. However this does not result in the most economical representation of the bitmap for practical sized collections, since the parameters that control the coding must also be stored, and can constitute... |

348 |
Universal codeword sets and representations of the integers
- Elias
- 1975
(Show Context)
Citation Context ...ome parameterless models. Surprisingly, these can also give rise to significant compression. 4 Parameterless Models To store the list of runlengths we need an encoding of the positive integers. Elias =-=[7]-=- described a number of such encodings. Each of the encodings has the property that small integers are allocated short codes, while larger integers are allocated longer codes. For example, his fl code ... |

228 |
Run-length encodings
- Golomb
- 1966
(Show Context)
Citation Context ...n the encoding of the bitvector for word w is p w , the number of 1-bits present. The average inter-word gap is given by N=pw ; setting b = 0:69N=pw and taking VG = (b; b; b; : : :) gives an encoding =-=[9, 10, 14]-=- in which each bitvector is represented in at most p w \Delta (log N p w + 2) bits, which, on a per pointer basis, is just a small additive constant from the minimum number of bits possible if it is a... |

88 | Optimal source codes for geometrically distributed integer alphabets - Gallager, Voorhis - 1975 |

81 | Adding compression to a full-text retrieval system. Softw
- ZOBEL, MOFFAT
- 1995
(Show Context)
Citation Context ...e same example bitmap in about 8 Mbyte, corresponding to less than 5 bits per word pointer. We have previously described a compression regime for use with the main text of a fulltext retrieval system =-=[15]-=-. In conjunction with the techniques presented here for inverted file storage, a complete compressed representation of this document collection requires just 43 Mbyte, less than one third of the space... |

43 |
Development of a spelling list
- McIlroy
- 1982
(Show Context)
Citation Context ...n the encoding of the bitvector for word w is p w , the number of 1-bits present. The average inter-word gap is given by N=pw ; setting b = 0:69N=pw and taking VG = (b; b; b; : : :) gives an encoding =-=[9, 10, 14]-=- in which each bitvector is represented in at most p w \Delta (log N p w + 2) bits, which, on a per pointer basis, is just a small additive constant from the minimum number of bits possible if it is a... |

40 |
A compression method for clustered bit-vectors
- Teuhola
- 1978
(Show Context)
Citation Context ...t of the words will be relatively frequent over small sections of the collection, and relatively infrequent in the remainder. No advantage of this is taken by the simple code described above. Teuhola =-=[19]-=- described a similar encoding that exploits this skewness in the distribution of run lengths. His `Exp-Gol' code, a generalisation of fl, corresponds to the coding vector V T = (b; 2b; 4b; : : : ; 2 i... |

27 |
Implementing Ranking Strategies Using Text Signatures, ACM Transactions on Office Information Systems
- Croft, Savino
- 1988
(Show Context)
Citation Context ...tively, via similarity measures such as the cosine measure [16], if term frequency information is known: including term frequency information results in a 10%-20% improvement in ranking effectiveness =-=[6]-=-. To allow such ranking, the bitmap should store a frequency count for each word-document combination instead of a bit, and, strictly speaking, will not be a bitmap at all. We call this structure a fr... |

25 | Compression of correlated bit-vectors
- Bookstein, Klein
- 1991
(Show Context)
Citation Context ...ere is a small, but definite, improvement in the compression performance. 8 Comparison with Other Methods There has been a great deal of previous work on bitmap compression, and many methods proposed =-=[1, 2, 3, 4, 5, 8, 12, 17, 19, 21]-=-. It is interesting to compare the compression possible with the parameterised models with those previous methods. Compression % Manuals GNUbib Comact Huffman (global) 49.8 32.3 25.9 Global model 0.5 ... |

24 |
Some applications of inverted indexes on the UNIX system. Computing Science technical report 69
- Lesk
- 1978
(Show Context)
Citation Context ...PARCstation), including embedded formatting commands. Collection GNUbib [18] stores 64,000 citations to journal articles, technical reports, conference papers, and books, all stored in `refer' format =-=[13, 20]-=-. Each citation was taken to be a `document' for indexing purposes. The third document collection is the one we have already mentioned; it is a collection of 261,829 pages of legal text storing the co... |

15 |
Improved hierarchical bit-vector compression in document retrieval systems
- Choueka, Fraenkel, et al.
- 1986
(Show Context)
Citation Context ...ere is a small, but definite, improvement in the compression performance. 8 Comparison with Other Methods There has been a great deal of previous work on bitmap compression, and many methods proposed =-=[1, 2, 3, 4, 5, 8, 12, 17, 19, 21]-=-. It is interesting to compare the compression possible with the parameterised models with those previous methods. Compression % Manuals GNUbib Comact Huffman (global) 49.8 32.3 25.9 Global model 0.5 ... |

10 |
Compression of concordances in full-text retrieval systems
- Choueka, Fraenkel, et al.
- 1988
(Show Context)
Citation Context ...ere is a small, but definite, improvement in the compression performance. 8 Comparison with Other Methods There has been a great deal of previous work on bitmap compression, and many methods proposed =-=[1, 2, 3, 4, 5, 8, 12, 17, 19, 21]-=-. It is interesting to compare the compression possible with the parameterised models with those previous methods. Compression % Manuals GNUbib Comact Huffman (global) 49.8 32.3 25.9 Global model 0.5 ... |

8 |
Novel compression of sparse bit-stringsâ€”preliminary report
- Fraenkel, Klein
- 1984
(Show Context)
Citation Context ...d decoding, and both require just one pass over the input bitmap to generate a compressed representation. The fl encoding can also be thought of as being one example of a more general coding paradigm =-=[8]-=-, described as follows. x fl ffi 1 1, 1, 2 01,0 010,0 3 01,1 010,1 4 001,00 011,00 5 001,01 011,01 6 001,10 011,10 7 001,11 011,11 8 0001,000 00100,000 Table 2: Examples of codes Let V be a (possibly ... |

7 |
Huffman coding in bit-vector compression
- Jakobsson
- 1978
(Show Context)
Citation Context |

7 |
Models for compression in full-text retrieval systems
- WITTEN, BELL, et al.
- 1991
(Show Context)
Citation Context ...lly) 77.6 78.2 65.8 VG (locally) 48.8 41.1 33.8 fl for p w values 2.3 0.7 0.2 Total 51.1 41.8 33.9 Table 4: Coding using VG equivalent compression scheme using a Bernoulli model and arithmetic coding =-=[21]-=-, where the frequency of each word was also used to control the arithmetic coding of the text of the retrieval system. Of course, in practice the 1-bits are not uniformly and randomly distributed. The... |

6 |
Refer - A bibliography system
- Tuthill
- 1984
(Show Context)
Citation Context ...PARCstation), including embedded formatting commands. Collection GNUbib [18] stores 64,000 citations to journal articles, technical reports, conference papers, and books, all stored in `refer' format =-=[13, 20]-=-. Each citation was taken to be a `document' for indexing purposes. The third document collection is the one we have already mentioned; it is a collection of 261,829 pages of legal text storing the co... |

5 |
Compression of Large Inverted Files with Hyperbolic Term Distribution
- Schuegraf
- 1976
(Show Context)
Citation Context |

4 |
Generative models for bitmap sets with compression applications
- Bookstein, Klein
- 1991
(Show Context)
Citation Context |

4 |
The Melbourne University bibliography system
- Somogyi
- 1990
(Show Context)
Citation Context ...ows the sizes of these collections. The collection Manuals was a collection of Unix manual pages (/usr/man/man[1-8]/* on a Sun SPARCstation), including embedded formatting commands. Collection GNUbib =-=[18]-=- stores 64,000 citations to journal articles, technical reports, conference papers, and books, all stored in `refer' format [13, 20]. Each citation was taken to be a `document' for indexing purposes. ... |

3 |
Flexible compression for bitmap sets
- Bookstein, Klein
- 1991
(Show Context)
Citation Context |