Results 1  10
of
13
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 224 (94 self)
 Add to MetaCart
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Implicit compression boosting with applications to selfindexing
 In Proc. SPIRE'07, LNCS 4726
, 2007
"... Abstract. Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique to enhance zeroth order entropy compressors ’ performance to kth order entropy. It works by constructing the BurrowsWheeler transform of the input text, finding optimal partitioning of the transform, and then co ..."
Abstract

Cited by 34 (17 self)
 Add to MetaCart
(Show Context)
Abstract. Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique to enhance zeroth order entropy compressors ’ performance to kth order entropy. It works by constructing the BurrowsWheeler transform of the input text, finding optimal partitioning of the transform, and then compressing each piece using an arbitrary zeroth order compressor. The optimal partitioning has the property that the achieved compression is boosted to kth order entropy, for any k. The technique has an application to text indexing: Essentially, building a wavelet tree (Grossi et al., SODA 2003) for each piece in the partitioning yields a kth order compressed fulltext selfindex providing efficient substring searches on the indexed text (Ferragina et al., SPIRE 2004). In this paper, we show that using explicit compression boosting with wavelet trees is not necessary; our new analysis reveals that the size of the wavelet tree built for the complete BurrowsWheeler transformed text is, in essence, the sum of those built for the pieces in the optimal partitioning. Hence, the technique provides a way to do compression boosting implicitly, with a trivial linear time algorithm, but fixed to a specific zeroth order compressor (Raman et al., SODA 2002). In addition to having these consequences on compression and static fulltext selfindexes, the analysis shows that a recent dynamic zeroth order compressed selfindex (Mäkinen & Navarro, CPM 2006) occupies in fact space proportional to kth order entropy. 1
Alphabetindependent compressed text indexing
 In ESA
, 2011
"... Abstract. Selfindexes can represent a text in asymptotically optimal space under the kth order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, f ..."
Abstract

Cited by 22 (15 self)
 Add to MetaCart
(Show Context)
Abstract. Selfindexes can represent a text in asymptotically optimal space under the kth order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, for the first time, full alphabetindependence in the time complexities of selfindexes, while retaining space optimality. We obtain also some relevant byproducts on compressed suffix trees. 1
New lower and upper bounds for representing sequences
 CoRR
"... Abstract. Sequence representations supporting queries access, select and rank are at the core of many data structures. There is a considerable gap between different upper bounds, and the few lower bounds, known for such representations, and how they interact with the space used. In this article we p ..."
Abstract

Cited by 17 (12 self)
 Add to MetaCart
(Show Context)
Abstract. Sequence representations supporting queries access, select and rank are at the core of many data structures. There is a considerable gap between different upper bounds, and the few lower bounds, known for such representations, and how they interact with the space used. In this article we prove a strong lower bound for rank, which holds for rather permissive assumptions on the space used, and give matching upper bounds that require only a compressed representation of the sequence. Within this compressed space, operations access and select can be solved within almostconstant time. 1
On Compressing and Indexing Repetitive Sequences
, 2011
"... We introduce LZEnd, a new member of the LempelZiv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first selfindex based on LZ77 (or LZEnd) compression, which in addition to te ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
We introduce LZEnd, a new member of the LempelZiv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first selfindex based on LZ77 (or LZEnd) compression, which in addition to text extraction offers fast indexed searches on the compressed text. This selfindex is particularly effective to represent highly repetitive sequence collections, which arise for example when storing versioned documents, software repositories, periodic publications, and biological sequence databases.
Stronger LempelZiv Based Compressed Text Indexing
, 2008
"... Given a text T[1..u] over an alphabet of size σ, the fulltext search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
(Show Context)
Given a text T[1..u] over an alphabet of size σ, the fulltext search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed fulltext selfindices, which replace the text with a more spaceefficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The LZindex of Navarro is a compressed fulltext selfindex able to represent T using 4uHk(T) + o(u log σ) bits of space, where Hk(T) denotes the kth order empirical entropy of T, for any k = o(log σ u). This space is about four times the compressed text size. It can locate all the occ occurrences of a pattern P in T in O(m 3 log σ+(m+occ) log u) worstcase time. Despite this index has shown to be very competitive in practice, the O(m 3 log σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other stateoftheart alternatives. In this paper we present stronger LempelZiv based indices, improving the overall performance of the LZindex. We achieve indices requiring (2+ǫ)uHk(T)+o(u log σ) bits of space, for any constant ǫ> 0, which makes our indices the smallest existing LZindices. We simultaneously improve the search time to
G.: Spaceconscious compression
, 2007
"... Abstract. Compression is most important when space is in short supply, so compression algorithms are often implemented in limited memory. Most analyses ignore memory constraints as an implementation detail, however, creating a gap between theory and practice. In this paper we consider the effect of ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Compression is most important when space is in short supply, so compression algorithms are often implemented in limited memory. Most analyses ignore memory constraints as an implementation detail, however, creating a gap between theory and practice. In this paper we consider the effect of memory limitations on compression algorithms. In the first part we assume the memory available is fixed and prove nearly tight upper and lower bounds on how much memory is needed to compress a string close to its kth order entropy. In the second part we assume the memory available grows (slowly) as more and more characters are read. In this setting we show that the rate of growth of the available memory determines the speed at which the compression ratio approaches the entropy. In particular, we establish a relationship between the rate of growth of the sliding window in the LZ77 algorithm and its convergence rate. 1
Fast and Compact Prefix Codes ⋆
"... Abstract. It is wellknown that, given a probability distribution over n characters, in the worst case it takes Θ(n log n) bits to store a prefix code with minimum expected codeword length. However, in this paper we first show that, for any ɛ with 0 < ɛ < 1/2 and 1/ɛ = O(polylog(n)), it takes ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
Abstract. It is wellknown that, given a probability distribution over n characters, in the worst case it takes Θ(n log n) bits to store a prefix code with minimum expected codeword length. However, in this paper we first show that, for any ɛ with 0 < ɛ < 1/2 and 1/ɛ = O(polylog(n)), it takes O(n log log(1/ɛ)) bits to store a prefix code with expected codeword length within an additive ɛ of the minimum. We then show that, for any constant c> 1, it takes O ( n 1/c log n) bits to store a prefix code with expected codeword length at most c times the minimum. In both cases, our data structures allow us to encode and decode any character in O(1) time. 1
RePair Achieves HighOrder Entropy
"... RePair is a dictionarybased compression method invented in 1999 by Larsson and Moffat. Although its practical performance has been established through experiments, the method has resisted all attempts of formal analysis. In this paper we show that RePair compresses a sequence T[1,n] over an alpha ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
RePair is a dictionarybased compression method invented in 1999 by Larsson and Moffat. Although its practical performance has been established through experiments, the method has resisted all attempts of formal analysis. In this paper we show that RePair compresses a sequence T[1,n] over an alphabet of size σ and kth order entropy Hk, to at most 2Hk + o(n log σ) bits, for any k = o(logσ n).
Optimal Lower and Upper Bounds for Representing Sequences
"... Sequence representations supporting queries access, select and rank are at the core of many data structures. There is a considerable gap between the various upper bounds and the few lower bounds known for such representations, and how they relate to the space used. In this article we prove a strong ..."
Abstract
 Add to MetaCart
Sequence representations supporting queries access, select and rank are at the core of many data structures. There is a considerable gap between the various upper bounds and the few lower bounds known for such representations, and how they relate to the space used. In this article we prove a strong lower bound for rank, which holds for rather permissive assumptions on the space used, and give matching upper bounds that require only a compressed representation of the sequence. Within this compressed space, operations access and select can be solved in constant or almostconstant time, which is optimal for large alphabets. Our new upper bounds dominate all of the previous work in the time/space map.