Results 1  10
of
19
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 269 (97 self)
 Add to MetaCart
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Compressed representations of sequences and fulltext indexes
 ACM Transactions on Algorithms
, 2007
"... Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) pro ..."
Abstract

Cited by 162 (79 self)
 Add to MetaCart
(Show Context)
Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed fulltext indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FMindex that indexes a string T [1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log Σ  n, constant 0 < α < 1, and Σ  = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length ℓ in O(ℓ + log 1+ε n) time.
Succinct suffix arrays based on runlength encoding
 Nordic Journal of Computing
, 2005
"... A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex re ..."
Abstract

Cited by 62 (33 self)
 Add to MetaCart
A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex replaces the text. Several remarkable selfindexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n). In this paper we present a new selfindex, called RLFM index for “runlength FMindex”, that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHk log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ. In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the BurrowsWheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes. Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different spacetime tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those. 1
A compressed selfindex using a ZivLempel dictionary
 In: SPIRE. Volume 4209 of LNCS. (2006) 163–180
"... Abstract. A compressed fulltext selfindex for a text T, of size u, is a data structure used to search patterns P, of size m, in T that requires reduced space, i.e. that depends on the empirical entropy (Hk, H0) of T, and is, furthermore, able to reproduce any substring of T. In this paper we prese ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
Abstract. A compressed fulltext selfindex for a text T, of size u, is a data structure used to search patterns P, of size m, in T that requires reduced space, i.e. that depends on the empirical entropy (Hk, H0) of T, and is, furthermore, able to reproduce any substring of T. In this paper we present a new compressed selfindex able to locate the occurrences of P in O((m + occ) log n) time, where occ is the number of occurrences and σ the size of the alphabet of T. The fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on m from O(m 2) to O(m). To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose and explore the nature of a recurrent structure in LZindexes, the T78 suffix tree. We show that our method is very competitive in practice by comparing it against the LZIndex, the FMindex and a compressed suffix array. 1
DACs: Bringing Direct Access to VariableLength Codes
, 2012
"... We present a new variablelength encoding scheme for sequences of integers, Directly Addressable Codes (DACs), which enables direct access to any element of the encoded sequence without the need of any sampling method. Our proposal is a kind of implicit data structure that introduces synchronism in ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
We present a new variablelength encoding scheme for sequences of integers, Directly Addressable Codes (DACs), which enables direct access to any element of the encoded sequence without the need of any sampling method. Our proposal is a kind of implicit data structure that introduces synchronism in the encoded sequence without using asymptotically any extra space. We show some experiments demonstrating that the technique is not only simple, but also competitive in time and space with existing solutions in several applications, such as the representation of LCP arrays or highorder entropycompressed sequences.
Wordbased Statistical Compressors as Natural Language Compression Boosters
, 2008
"... Semistatic wordbased byteoriented compression codes are known to be attractive alternatives to compress natural language texts. With compression ratios around 30%, they allow direct pattern searching on the compressed text up to 8 times faster than on its uncompressed version. In this paper we rev ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Semistatic wordbased byteoriented compression codes are known to be attractive alternatives to compress natural language texts. With compression ratios around 30%, they allow direct pattern searching on the compressed text up to 8 times faster than on its uncompressed version. In this paper we reveal that these compressors have even more benefits. We show that most of the stateoftheart compressors such as the blockwise bzip2, those from the ZivLempel family, and the predictive ppmbased ones, can benefit from compressing not the original text, but its compressed representation obtained by a wordbased byteoriented statistical compressor. In particular, our experimental results show that using DenseCodebased compression as a preprocessing step to classical compressors like bzip2, gzip, or ppmdi, yields several important benefits. For example, the ppm family is known for achieving the best compression ratios. With a Dense coding preprocessing, ppmdi achieves even better compression ratios (the best we know of on natural language) and much faster compression/decompression than ppmdi alone. Text indexing also profits from our preprocessing step. A compressed selfindex achieves much better space and time performance when preceded by a semistatic wordbased compression step. We show, for example, that the AFFMindex coupled with Tagged Huffman coding is an attractive alternative index for natural language texts. 1 Introduction Traditionally, classical compressors used characters as the symbols to be compressed; that is, they regarded the text as a sequence of characters. Classical Huffman [11] uses a semistatic model to assign shorter codes to more frequent symbols. Unfortunately, the compression obtained when applied to natural language English text is very poor (around 65%). Other wellknown compressors are the dictionarybased algorithms of
Simple Compression Code Supporting Random Access and Fast String Matching
"... Abstract. Given a sequence S of n symbols over some alphabet Σ, we develop a new compression method that is (i) very simple to implement; (ii) provides O(1) time random access to any symbol of the original sequence; (iii) allows efficient pattern matching over the compressed sequence. Our simplest s ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Given a sequence S of n symbols over some alphabet Σ, we develop a new compression method that is (i) very simple to implement; (ii) provides O(1) time random access to any symbol of the original sequence; (iii) allows efficient pattern matching over the compressed sequence. Our simplest solution uses at most 2h + o(h) bits of space, where h = n(H0(S) + 1), and H0(S) is the zerothorder empirical entropy of S. We discuss a number of improvements and tradeoffs over the basic method. The new method is applied to text compression. We also propose average case optimal string matching algorithms. 1
FMKZ: an even simpler alphabetindependent FMindex
 Czech Technical University, Prague
, 2006
"... Abstract. In an earlier work [6] we presented a simple FMindex variant, based on the idea of Huffmancompressing the text and then applying the BurrowsWheeler transform over it. The main drawback of using Huffman was its lack of synchronizing properties, forcing us to supply another bit stream ind ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract. In an earlier work [6] we presented a simple FMindex variant, based on the idea of Huffmancompressing the text and then applying the BurrowsWheeler transform over it. The main drawback of using Huffman was its lack of synchronizing properties, forcing us to supply another bit stream indicating the Huffman codeword boundaries. In this way, the resulting index needed O(n(H0 +1)) bits of space but with the constant 2 (concerning the main term). There are several options aiming to mitigate the overhead in space, with various effects on the query handling speed. In this work we propose KautzZeckendorf coding as a both simple and practical replacement for Huffman. We dub the new index FMKZ. We also present an efficient implementation of the rank operation, which is the main building brick of the FMKZ. Experimental results show that our index provides an attractive space/time tradeoff in comparison with existing succinct data structures, and in the DNA test it even wins both in search time and space use. An additional asset of our solution is its relative simplicity. 1
Boosting Text Compression with Wordbased Statistical Encoding
"... Semistatic wordbased byteoriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 3035%, they allow fast direct searching of compressed text. In this article we reveal that these compressors have even more benefits. We show th ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Semistatic wordbased byteoriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 3035%, they allow fast direct searching of compressed text. In this article we reveal that these compressors have even more benefits. We show that most of the stateoftheart compressors benefit from compressing not the original text, but the compressed representation obtained by a wordbased byteoriented statistical compressor. For example, p7zip with a densecoding preprocessing achieves even better compression ratios and much faster compression than p7zip alone. We reach compression ratios below 17 % in typical large English texts, which was obtained only by the slow PPM compressors. Furthermore, searches perform much faster if the final compressor operates over wordbased compressed text. We show that typical selfindexes also profit from our preprocessing step. They achieve much better space and time performance when indexing is preceded by a compression step. Apart from using the wellknown Tagged Huffman code, we present a new suffixfree DenseCodebased compressor that compresses slightly better. We also show how some selfindexes can handle nonsuffixfree codes. As a result, the compressed/indexed text requires around 35 % of the space of the original text and allows indexed searches for both words and phrases.
FMKZ: An Even Simpler AlphabetIndependent FMIndex
"... Abstract. In an earlier work [6] we presented a simple FMindex variant, based on the idea of Huffmancompressing the text and then applying the BurrowsWheeler transform over it. The main drawback of using Huffman was its lack of synchronizing properties, forcing us to supply another bit stream ind ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In an earlier work [6] we presented a simple FMindex variant, based on the idea of Huffmancompressing the text and then applying the BurrowsWheeler transform over it. The main drawback of using Huffman was its lack of synchronizing properties, forcing us to supply another bit stream indicating the Huffman codeword boundaries. In this way, the resulting index needed O(n(H0+1)) bits of space but with the constant 2 (concerning the main term). There are several options aiming to mitigate the overhead in space, with various effects on the query handling speed. In this work we propose KautzZeckendorf coding as a both simple and practical replacement for Huffman. We dub the new index FMKZ. We also present an efficient implementation of the rank operation, which is the main building brick of the FMKZ. Experimental results show that our index provides an attractive space/time tradeoff in comparison with existing succinct data structures, and in the DNA test it even wins both in search time and space use. An additional asset of our solution is its relative simplicity. 1