Results 1 -
8 of
8
Compressed full-text indexes
- ACM COMPUTING SURVEYS
, 2007
"... Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract
-
Cited by 142 (70 self)
- Add to MetaCart
Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into self-indexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Implicit compression boosting with applications to self-indexing
- In Proc. SPIRE'07, LNCS 4726
, 2007
"... Abstract. Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique to enhance zeroth order entropy compressors ’ performance to k-th order entropy. It works by constructing the Burrows-Wheeler transform of the input text, finding optimal partitioning of the transform, and then compre ..."
Abstract
-
Cited by 23 (14 self)
- Add to MetaCart
Abstract. Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique to enhance zeroth order entropy compressors ’ performance to k-th order entropy. It works by constructing the Burrows-Wheeler transform of the input text, finding optimal partitioning of the transform, and then compressing each piece using an arbitrary zeroth order compressor. The optimal partitioning has the property that the achieved compression is boosted to k-th order entropy, for any k. The technique has an application to text indexing: Essentially, building a wavelet tree (Grossi et al., SODA 2003) for each piece in the partitioning yields a k-th order compressed full-text self-index providing efficient substring searches on the indexed text (Ferragina et al., SPIRE 2004). In this paper, we show that using explicit compression boosting with wavelet trees is not necessary; our new analysis reveals that the size of the wavelet tree built for the complete Burrows-Wheeler transformed text is, in essence, the sum of those built for the pieces in the optimal partitioning. Hence, the technique provides a way to do compression boosting implicitly, with a trivial linear time algorithm, but fixed to a specific zeroth order compressor (Raman et al., SODA 2002). In addition to having these consequences on compression and static full-text self-indexes, the analysis shows that a recent dynamic zeroth order compressed self-index (Mäkinen & Navarro, CPM 2006) occupies in fact space proportional to k-th order entropy. 1
Stronger Lempel-Ziv Based Compressed Text Indexing
, 2008
"... Given a text T[1..u] over an alphabet of size σ, the full-text search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Given a text T[1..u] over an alphabet of size σ, the full-text search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a more space-efficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The LZ-index of Navarro is a compressed full-text self-index able to represent T using 4uHk(T) + o(u log σ) bits of space, where Hk(T) denotes the k-th order empirical entropy of T, for any k = o(log σ u). This space is about four times the compressed text size. It can locate all the occ occurrences of a pattern P in T in O(m 3 log σ+(m+occ) log u) worst-case time. Despite this index has shown to be very competitive in practice, the O(m 3 log σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other state-of-the-art alternatives. In this paper we present stronger Lempel-Ziv based indices, improving the overall performance of the LZ-index. We achieve indices requiring (2+ǫ)uHk(T)+o(u log σ) bits of space, for any constant ǫ> 0, which makes our indices the smallest existing LZ-indices. We simultaneously improve the search time to
Alphabet-independent compressed text indexing
- In ESA
, 2011
"... Abstract. Self-indexes can represent a text in asymptotically optimal space under the k-th order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, f ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. Self-indexes can represent a text in asymptotically optimal space under the k-th order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, for the first time, full alphabet-independence in the time complexities of self-indexes, while retaining space optimality. We obtain also some relevant byproducts on compressed suffix trees. 1
G.: Space-conscious compression
, 2007
"... Abstract. Compression is most important when space is in short supply, so compression algorithms are often implemented in limited memory. Most analyses ignore memory constraints as an implementation detail, however, creating a gap between theory and practice. In this paper we consider the effect of ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. Compression is most important when space is in short supply, so compression algorithms are often implemented in limited memory. Most analyses ignore memory constraints as an implementation detail, however, creating a gap between theory and practice. In this paper we consider the effect of memory limitations on compression algorithms. In the first part we assume the memory available is fixed and prove nearly tight upper and lower bounds on how much memory is needed to compress a string close to its k-th order entropy. In the second part we assume the memory available grows (slowly) as more and more characters are read. In this setting we show that the rate of growth of the available memory determines the speed at which the compression ratio approaches the entropy. In particular, we establish a relationship between the rate of growth of the sliding window in the LZ77 algorithm and its convergence rate. 1
Fast and Compact Prefix Codes ⋆
"... Abstract. It is well-known that, given a probability distribution over n characters, in the worst case it takes Θ(n log n) bits to store a prefix code with minimum expected codeword length. However, in this paper we first show that, for any ɛ with 0 < ɛ < 1/2 and 1/ɛ = O(polylog(n)), it takes O(n lo ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. It is well-known that, given a probability distribution over n characters, in the worst case it takes Θ(n log n) bits to store a prefix code with minimum expected codeword length. However, in this paper we first show that, for any ɛ with 0 < ɛ < 1/2 and 1/ɛ = O(polylog(n)), it takes O(n log log(1/ɛ)) bits to store a prefix code with expected codeword length within an additive ɛ of the minimum. We then show that, for any constant c> 1, it takes O ( n 1/c log n) bits to store a prefix code with expected codeword length at most c times the minimum. In both cases, our data structures allow us to encode and decode any character in O(1) time. 1
On Compressing and Indexing Repetitive Sequences
, 2011
"... We introduce LZ-End, a new member of the Lempel-Ziv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first self-index based on LZ77 (or LZ-End) compression, which in addition to te ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We introduce LZ-End, a new member of the Lempel-Ziv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first self-index based on LZ77 (or LZ-End) compression, which in addition to text extraction offers fast indexed searches on the compressed text. This self-index is particularly effective to represent highly repetitive sequence collections, which arise for example when storing versioned documents, software repositories, periodic publications, and biological sequence databases.
Re-Pair Achieves High-Order Entropy ∗
"... Re-Pair is a dictionary-based compression method invented in 1999 by Larsson and Moffat. Although its practical performance has been established through experiments, the method has resisted all attempts of formal analysis. In this paper we show that Re-Pair compresses a sequence T[1,n] over an alpha ..."
Abstract
- Add to MetaCart
Re-Pair is a dictionary-based compression method invented in 1999 by Larsson and Moffat. Although its practical performance has been established through experiments, the method has resisted all attempts of formal analysis. In this paper we show that Re-Pair compresses a sequence T[1,n] over an alphabet of size σ and k-th order entropy Hk, to at most 2Hk + o(n log σ) bits, for any k = o(logσ n). 1

