Results 1 -
3 of
3
Fast and Compact Prefix Codes ⋆
"... Abstract. It is well-known that, given a probability distribution over n characters, in the worst case it takes Θ(n log n) bits to store a prefix code with minimum expected codeword length. However, in this paper we first show that, for any ɛ with 0 < ɛ < 1/2 and 1/ɛ = O(polylog(n)), it takes O(n lo ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. It is well-known that, given a probability distribution over n characters, in the worst case it takes Θ(n log n) bits to store a prefix code with minimum expected codeword length. However, in this paper we first show that, for any ɛ with 0 < ɛ < 1/2 and 1/ɛ = O(polylog(n)), it takes O(n log log(1/ɛ)) bits to store a prefix code with expected codeword length within an additive ɛ of the minimum. We then show that, for any constant c> 1, it takes O ( n 1/c log n) bits to store a prefix code with expected codeword length at most c times the minimum. In both cases, our data structures allow us to encode and decode any character in O(1) time. 1
Small Codes
"... We can view any prefix code as consisting of two parts: first, the code-tree’s shape (i.e., the shape of the trie containing the codewords) and, second, its leaves ’ labels (i.e., the permutation of the characters that sorts them by their codewords ’ lexicographic ranks). In this paper we briefly di ..."
Abstract
- Add to MetaCart
We can view any prefix code as consisting of two parts: first, the code-tree’s shape (i.e., the shape of the trie containing the codewords) and, second, its leaves ’ labels (i.e., the permutation of the characters that sorts them by their codewords ’ lexicographic ranks). In this paper we briefly discuss how to compress those parts while providing efficient access. In particular, we sketch how to store a nearly optimal prefix code in nearly linear space and provide constant encode/decode time. Shape: Canonical code-tree’s shapes can be encoded using little space [4] Recently, Gagie and Nekrich [2] showed that if the longest codeword is O(w) bits, where w is the length of a machine word, then we can store any canonical code-tree’s shape in O ( w 2) bits such that finding a codeword given its rank, and vice versa, takes O(1) time. Permutation: Milidiú and Laber [3] showed that, for L> ⌈log n ⌉ (our logarithms are in base 2), some prefix code with maximum codeword length at most L (henceforth ‘L-restricted’) has expected redundancy below 1/φ L−⌈log(n+⌈log n⌉−L)⌉−1 over Huffman’s, where φ ≈ 1.618 is the golden ratio. Given any L-restricted prefix code, we can store some L-restricted with the same codeword lengths in O(n log L) bits: we can restructure the code-tree so that the leaves ’ depths are nondecreasing
Efficient Fully-Compressed Sequence Representations ∗
"... We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zero-order entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in ..."
Abstract
- Add to MetaCart
We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zero-order entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worst-case time O (lg lg σ) and average time O (lg H0(s)). The worst-case complexity matches the best previous results, yet these had been achieved with data structures using nH0(s) + o(n lg σ) bits. On highly compressible sequences the o(n lg σ) bits of the redundancy may be significant compared to the the nH0(s) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our average-case complexity is unprecedented. Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy. The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes; (ii) compressed permutations π with times for π() and π −1 () improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors. Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios. 1

