Results 1 - 10
of
13
Compressed full-text indexes
- ACM COMPUTING SURVEYS
, 2007
"... Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract
-
Cited by 142 (70 self)
- Add to MetaCart
Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into self-indexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs
, 1999
"... We consider the implementation of abstract data types for the static objects: binary tree, rooted ordered tree and balanced parenthesis expression. Our representations use an amount of space within a lower order term of the information theoretic minimum and support, in constant time, a richer set ..."
Abstract
-
Cited by 116 (5 self)
- Add to MetaCart
We consider the implementation of abstract data types for the static objects: binary tree, rooted ordered tree and balanced parenthesis expression. Our representations use an amount of space within a lower order term of the information theoretic minimum and support, in constant time, a richer set of navigational operations than has previously been considered in similar work. In the case of binary trees, for instance, we can move from a node to its left or right child or to the parent in constant time while retaining knowledge of the size of the subtree at which we are positioned. The approach is applied to produce succinct representation of planar graphs in which one can test adjacency in constant time. Keywords: abstract data type, succinct representation, binary trees, balanced parenthesis, rooted ordered trees, planar graphs. AMS subject classifications: 68P05, 68Q65 1 Introduction The binary tree is among the most fundamental of data structures. While it is often the c...
Compressed representations of sequences and full-text indexes
- ACM Transactions on Algorithms
, 2007
"... Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) pro ..."
Abstract
-
Cited by 92 (55 self)
- Add to MetaCart
Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FM-index that indexes a string T [1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the k-th order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log |Σ | n, constant 0 < α < 1, and |Σ | = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length ℓ in O(ℓ + log 1+ε n) time.
Succinct suffix arrays based on run-length encoding
- Nordic Journal of Computing
, 2005
"... A succinct full-text self-index is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the self-index re ..."
Abstract
-
Cited by 46 (32 self)
- Add to MetaCart
A succinct full-text self-index is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the self-index replaces the text. Several remarkable self-indexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n). In this paper we present a new self-index, called RLFM index for “run-length FM-index”, that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHk log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ. In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the Burrows-Wheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes. Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different space-time tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those. 1
Practical implementation of rank and select queries
- In Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA’05) (Greece
, 2005
"... Research on succinct data structures has made significant progress in recent years. An essential building block of many of those techniques is a data structure to perform rank and select operations over a bit array. The first operation tells how many bits are set up to some position, and the second ..."
Abstract
-
Cited by 30 (13 self)
- Add to MetaCart
Research on succinct data structures has made significant progress in recent years. An essential building block of many of those techniques is a data structure to perform rank and select operations over a bit array. The first operation tells how many bits are set up to some position, and the second the position of the i-th bit set. Albeit there exist constanttime solutions that require sublinear extra space, the practicality of those solutions against more naive ones has not been carefully studied. In this paper we show some results in this respect, which suggest that in many practical cases the simpler solutions are better in terms of time and extra space.
Rank and select revisited and extended
- Workshop on SpaceConscious Algorithms, University of
, 2006
"... The deep connection between the Burrows-Wheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corr ..."
Abstract
-
Cited by 29 (17 self)
- Add to MetaCart
The deep connection between the Burrows-Wheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. It has been shown that improvements to rank/select algorithms, in combination with the BWT, turn into improved compressed text indexes. This paper is devoted to alternative implementations and extensions of rank and select data structures. First, we show that one can use gap encoding techniques to obtain constant time rank and select queries in essentially the same space as what is achieved by the best current direct solution (and sometimes less). Second, we extend symbol rank and select to substring rank and select, giving several space/time tradeoffs for the problem. An application of these queries is in position-restricted substring searching, where one can specify the range in the text where the search is restricted to, and only occurrences residing in that range are to be reported. In addition, arbitrary occurrences are reported in text position order. Several byproducts of our results display connections with searchable partial sums, Chazelle’s two-dimensional data structures, and Grossi et al.’s wavelet trees.
Implicit compression boosting with applications to self-indexing
- In Proc. SPIRE'07, LNCS 4726
, 2007
"... Abstract. Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique to enhance zeroth order entropy compressors ’ performance to k-th order entropy. It works by constructing the Burrows-Wheeler transform of the input text, finding optimal partitioning of the transform, and then compre ..."
Abstract
-
Cited by 23 (14 self)
- Add to MetaCart
Abstract. Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique to enhance zeroth order entropy compressors ’ performance to k-th order entropy. It works by constructing the Burrows-Wheeler transform of the input text, finding optimal partitioning of the transform, and then compressing each piece using an arbitrary zeroth order compressor. The optimal partitioning has the property that the achieved compression is boosted to k-th order entropy, for any k. The technique has an application to text indexing: Essentially, building a wavelet tree (Grossi et al., SODA 2003) for each piece in the partitioning yields a k-th order compressed full-text self-index providing efficient substring searches on the indexed text (Ferragina et al., SPIRE 2004). In this paper, we show that using explicit compression boosting with wavelet trees is not necessary; our new analysis reveals that the size of the wavelet tree built for the complete Burrows-Wheeler transformed text is, in essence, the sum of those built for the pieces in the optimal partitioning. Hence, the technique provides a way to do compression boosting implicitly, with a trivial linear time algorithm, but fixed to a specific zeroth order compressor (Raman et al., SODA 2002). In addition to having these consequences on compression and static full-text self-indexes, the analysis shows that a recent dynamic zeroth order compressed self-index (Mäkinen & Navarro, CPM 2006) occupies in fact space proportional to k-th order entropy. 1
Statistical encoding of succinct data structures
- In Proc. 17th CPM, LNCS 4009
, 2006
"... Abstract. In recent work, Sadakane and Grossi [SODA 2006] introduced a scheme to represent any (k log σ + log log n)) bits sequence S = s1s2... sn, over an alphabet of size σ, using nHk(S) + O ( n log σ n of space, where Hk(S) is the k-th order empirical entropy of S. The representation permits extr ..."
Abstract
-
Cited by 22 (10 self)
- Add to MetaCart
Abstract. In recent work, Sadakane and Grossi [SODA 2006] introduced a scheme to represent any (k log σ + log log n)) bits sequence S = s1s2... sn, over an alphabet of size σ, using nHk(S) + O ( n log σ n of space, where Hk(S) is the k-th order empirical entropy of S. The representation permits extracting any substring of size Θ(log σ n) in constant time, and thus it completely replaces S under the RAM model. This is extremely important because it permits converting any succinct data structure requiring o(|S|) = o(n log σ) bits in addition to S, into another requiring nHk(S) + o(n log σ) (overall) for any k = o(log σ n). They achieve this result by using Ziv-Lempel compression, and conjecture that the result can in particular be useful to implement compressed full-text indexes. In this paper we extend their result, by obtaining the same space and time complexities using a simpler scheme based on statistical encoding. We show that the scheme supports appending symbols in constant amortized time. In addition, we prove some results on the applicability of the scheme for full-text selfindexing. 1
Succinct Representation of Sequences
, 2004
"... Abstract. Given a sequence S = s1s2... sn such that 1 ≤ sq ≤ r for all q, where r = O(polylog(n)), we show how S can be represented using nH0(S)+o(n) bits (where H0(S) is the zero-order entropy of S), so that we can know any sq, as well as answer rank and select queries on S, in constant time. This ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Abstract. Given a sequence S = s1s2... sn such that 1 ≤ sq ≤ r for all q, where r = O(polylog(n)), we show how S can be represented using nH0(S)+o(n) bits (where H0(S) is the zero-order entropy of S), so that we can know any sq, as well as answer rank and select queries on S, in constant time. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. Furthermore, we show how our technique can be applied to improve a succinct full-text index. 1

