## Compressing Integers for Fast File Access (1999)

### Cached

### Download Links

- [www.cs.rmit.edu.au]
- [goanna.cs.rmit.edu.au]
- DBLP

### Other Repositories/Bibliography

Venue: | The Computer Journal |

Citations: | 60 - 14 self |

### BibTeX

@ARTICLE{Williams99compressingintegers,

author = {Hugh E. Williams and Justin Zobel},

title = {Compressing Integers for Fast File Access},

journal = {The Computer Journal},

year = {1999},

volume = {42},

pages = {193--201}

}

### Years of Citing Articles

### OpenURL

### Abstract

this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare di#erent approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variable-byte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed

### Citations

866 | Managing Gigabytes: Compressing and Indexing Documents and Images (Second Edition
- Witten, Moat, et al.
(Show Context)
Citation Context ...ot effective. Moreover, the requirement of atomic decom2 Williams and Zobel pression precludes the application of vertical compression techniques—such as the READ compression commonly used in images=-= [7]��-=-�that take advantage of differences between adjacent records. For text, for example, Huffman coding with a semi-static model is the method of choice because it is fast and allows order-independent dec... |

652 |
Text Compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ...ommon symbols and long codes to rare symbols, optimising code length overall. Adaptive schemes (where the model evolves as the data is processed) are currently favoured for generalpurpose compression =-=[5, 6]-=-, and are the basis of utilities such as compress. However, because databases are divided into small records that must be independently decompressible [1], adaptive techniques are generally not effect... |

380 |
Prediction and Entropy of Printed English
- Shannon
- 1951
(Show Context)
Citation Context ...ts and space requirements. Space efficiency for a given data set can be measured by comparison to the information content of data, as represented by the entropy determined by Shannon’s coding theore=-=m [9]. Entro-=-py is the ideal compression that is achievable for a given model. For a set S of symbols in which each symbol t has probability of occurrence pt, theentropyis E(S) = � (−pt · log2 pt) t∈S bits ... |

363 |
Universal codeword sets and representations of the integers
- Elias
- 1975
(Show Context)
Citation Context ... the model is not adaptive—allow order-independent record-based decompression. Our selection of techniques results in two possible classes: first, we consider the bit-wise Elias gamma and delta code=-=s [2]-=-, and parameterised Golomb codes [3]; second, we evaluate byte-wise storage using standard four-byte integers and a variable-byte scheme. We have found in our experiments that coding using a tailored ... |

248 |
Run-length encodings
- Golomb
- 1966
(Show Context)
Citation Context ...er-independent record-based decompression. Our selection of techniques results in two possible classes: first, we consider the bit-wise Elias gamma and delta codes [2], and parameterised Golomb codes =-=[3]-=-; second, we evaluate byte-wise storage using standard four-byte integers and a variable-byte scheme. We have found in our experiments that coding using a tailored integer compression scheme can allow... |

147 | Arithmetic coding revisited
- Moffat, Neal, et al.
- 1998
(Show Context)
Citation Context ...ommon symbols and long codes to rare symbols, optimising code length overall. Adaptive schemes (where the model evolves as the data is processed) are currently favoured for generalpurpose compression =-=[5, 6]-=-, and are the basis of utilities such as compress. However, because databases are divided into small records that must be independently decompressible [1], adaptive techniques are generally not effect... |

112 |
Universal modeling and coding
- Rissanen, Langdon
- 1981
(Show Context)
Citation Context ...ntegers. Our conclusion is that, for fast access to files containing integers, they should be stored in a compressed format. 2. BACKGROUND Compression consists of two activities, modelling and coding =-=[4]-=-. A model for data to be compressed is a representation of the distinct symbols in the data and includes information such as frequency about each symbol. Coding is the process of producing a compresse... |

83 | Adding compression to a full-text retrieval system
- Zobel, Moffat
- 1995
(Show Context)
Citation Context ...ompression schemes can allow retrieval of stored text to be faster than when uncompressed, since the computational cost of decompression can be offset by reductions in disk seeking and transfer costs =-=[1]-=-. In this paper we explore whether similar gains are available for numeric data. We have implemented several integer coding schemes and evaluated them on collections derived from large indexes and sci... |

19 |
A systematic approach to compressing a full text retrieval system
- Bookstein, Klein, et al.
- 1992
(Show Context)
Citation Context ...a total of 156.4 Mb. A prime number collection of the first one million prime numbers (prime) is3.8Mb. Integer compression has been shown previously to offer space efficient representation of indexes =-=[8, 13, 14]-=-. Inverted file indexes contain large sorted arrays of integers or postings representing occurrences of terms in a collection. Each postings array contains a sorted list of document identifiers and, i... |

17 |
Indexing Nucleotide Databases for Fast Query Evaluation
- Williams, Zobel
- 1996
(Show Context)
Citation Context ... 14 bits. Variable-byte codes for selected integers in the range 1–30 are shown in Table 1. A typical application is the coding of index term and inverted list file offset pairs for an inverted inde=-=x [10]-=-. When storing large arrays of integers, variable-byte integers are generally not as space efficient as variablebit schemes. However, when storing only a few integers, byte-aligned variable-byte schem... |

14 |
Exploiting clustering in inverted file compression
- Moffat, Stuiver
- 1996
(Show Context)
Citation Context ...pplications by using semi-static models for different probability distributions, such as in inverted index posting lists where integers are sorted in increasing order and integers are often clustered =-=[13]-=-. 3. TEST DATA To evaluate the performance of integer coding techniques for fast file access, we use collections derived from scientific data and inverted indexes. In this selection we have focused on... |

6 | Compression of nucleotide databases for fast searching
- Williams, Zobel
- 1997
(Show Context)
Citation Context ... section semi-static parameterised methods. Elias coding [2] is a non-parameterised method of coding integers that is, for example, used in large text database indexes [8] and specialist applications =-=[10, 11]-=-. Elias coding, like the other schemes described in this paper, allows unambiguous coding of integers and does not require separators between each integer of a stored array. There are two distinct Eli... |

6 |
Improved inverted file processing for large text databases
- Moffat, Zobel, et al.
- 1995
(Show Context)
Citation Context ...he occurrences of different values have a geometric distribution. Skewed Bernoulli models, where a simple mean difference is not used, typically result in better compression than simple global models =-=[12]-=-. We have experimented with global Bernoulli models, but not with skewed models because of the increased complexity ins6 Williams and Zobel calculating and storing appropriate k values. However, we wo... |