Compressed representations of sequences and full-text indexes (2007)
Cached
Download Links
- [www.cs.helsinki.fi]
- [www.dcc.uchile.cl]
- [www.dcc.uchile.cl]
- [www.cs.helsinki.fi]
- [www.mfn.unipmn.it]
- DBLP
Other Repositories/Bibliography
| Venue: | ACM Transactions on Algorithms |
| Citations: | 92 - 55 self |
BibTeX
@ARTICLE{Ferragina07compressedrepresentations,
author = {Paolo Ferragina and Giovanni Manzini and Veli Mäkinen and Gonzalo Navarro},
title = {Compressed representations of sequences and full-text indexes},
journal = {ACM Transactions on Algorithms},
year = {2007},
volume = {3},
pages = {2007}
}
Years of Citing Articles
OpenURL
Abstract
Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FM-index that indexes a string T [1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the k-th order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log |Σ | n, constant 0 < α < 1, and |Σ | = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length ℓ in O(ℓ + log 1+ε n) time.







