Results 1 - 10
of
123
High-order entropy-compressed text indexes
, 2003
"... We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet Σ, where each symbol is encoded by lg |Σ | bits. We show that compressed suffix arrays use just nHh + O(n lg lg ..."
Abstract
-
Cited by 163 (20 self)
- Add to MetaCart
We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet Σ, where each symbol is encoded by lg |Σ | bits. We show that compressed suffix arrays use just nHh + O(n lg lg n / lg |Σ | n) bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg |Σ | + polylog(n)) time. The term Hh ≤ lg |Σ | denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hh = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper. 1
Compressed full-text indexes
- ACM COMPUTING SURVEYS
, 2007
"... Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract
-
Cited by 142 (70 self)
- Add to MetaCart
Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into self-indexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Compressed representations of sequences and full-text indexes
- ACM Transactions on Algorithms
, 2007
"... Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) pro ..."
Abstract
-
Cited by 92 (55 self)
- Add to MetaCart
Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FM-index that indexes a string T [1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the k-th order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log |Σ | n, constant 0 < α < 1, and |Σ | = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length ℓ in O(ℓ + log 1+ε n) time.
Succinct suffix arrays based on run-length encoding
- Nordic Journal of Computing
, 2005
"... A succinct full-text self-index is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the self-index re ..."
Abstract
-
Cited by 46 (32 self)
- Add to MetaCart
A succinct full-text self-index is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the self-index replaces the text. Several remarkable self-indexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n). In this paper we present a new self-index, called RLFM index for “run-length FM-index”, that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHk log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ. In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the Burrows-Wheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes. Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different space-time tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those. 1
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices
"... Suffix trees and suffix arrays are the most prominent full-text indices, and their construction algorithms are well studied. It has been open for a long time whether these indicescan be constructed in both o(n log n) time and o(n log n)-bit working space, where n denotes the length of the text. Int ..."
Abstract
-
Cited by 44 (3 self)
- Add to MetaCart
Suffix trees and suffix arrays are the most prominent full-text indices, and their construction algorithms are well studied. It has been open for a long time whether these indicescan be constructed in both o(n log n) time and o(n log n)-bit working space, where n denotes the length of the text. Inthe literature, the fastest algorithm runs in O(n) time, whileit requires O(n log n)-bit working space. On the other hand,the most space-efficient algorithm requires O(n)-bit work-ing space while it runs in O(n log n) time. This paper breaks the long-standing time-and-space bar-rier under the unit-cost word RAM. We give an algorithm for constructing the suffix array which takes O(n) time and O(n)-bit working space, for texts with constant-size alpha-bets. Note that both the time and the space bounds are optimal. For constructing the suffix tree, our algorithm re-quires O(n logffl n) time and O(n)-bit working space forany 0! ffl! 1. Apart from that, our algorithm can alsobe adopted to build other existing full-text indices, such as
Structuring labeled trees for optimal succinctness, and beyond
- In FOCS
, 2005
"... Consider an ordered, static tree T on t nodes where each node has a label from alphabet set Σ. TreeTmaybeofar bitrary degree and of arbitrary shape. Say, we wish to support basic navigational operations such as find the parent of node u,theith child of u, and any child of u with label α. In a semina ..."
Abstract
-
Cited by 44 (8 self)
- Add to MetaCart
Consider an ordered, static tree T on t nodes where each node has a label from alphabet set Σ. TreeTmaybeofar bitrary degree and of arbitrary shape. Say, we wish to support basic navigational operations such as find the parent of node u,theith child of u, and any child of u with label α. In a seminal work over fifteen years ago, Jacobson [15] observed that pointer-based tree representations are wasteful in space and introduced the notion of succinct data structures. He studied the special case of unlabeled trees and presented a succinct data structure of 2t+o(t) bits supporting navigational operations in O(1) time. The space used is asymptotically optimal with the information-theoretic lower bound averaged over all trees. This led to a slew of results on succinct data structures for arrays, trees, strings
An alphabet-friendly FM-index
- In Proc.SPIRE’04, LNCS 3246
, 2004
"... Abstract. We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T)+O � ..."
Abstract
-
Cited by 40 (19 self)
- Add to MetaCart
Abstract. We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T)+O � (n log log n) / log |Σ | n � bits, where Hk(T) is the k-th order empirical entropy of T. The above bound holds simultaneously for all k ≤ α log |Σ | n and 0 < α < 1. Moreover, the index design does not depend on the parameter k, which plays a role only in analysis of the space occupancy. Using our index, the counting of the occurrences of an arbitrary pattern P [1, p] as a substring of T takes O(p log |Σ|) time. Locating each pattern occurrence takes O(log |Σ | (log 2 n / log log n)) time. Reporting a text substring of length ℓ takes O((ℓ + log 2 n / log log n) log |Σ|) time. 1
When indexing equals compression: Experiments with compressing suffix arrays and applications
, 2004
"... We report on a new and improved version of high-order entropy-compressed suffix arrays, which has theoretical performance guarantees similar to those in our earlier work [16], yet represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art co ..."
Abstract
-
Cited by 40 (5 self)
- Add to MetaCart
We report on a new and improved version of high-order entropy-compressed suffix arrays, which has theoretical performance guarantees similar to those in our earlier work [16], yet represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20 % of the original text size—without requiring a separate instance of the text—and support fast and powerful searches. To our knowledge, this is the best known method in terms of space for fast searching. 1
A simple optimal representation for balanced parentheses
- In Proc. 15th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS v. 3109 (2004
, 2004
"... b Institute of Mathematical Sciences, Chennai 600 113, India. We consider succinct, or highly space-efficient, representations of a (static) string consisting of n pairs of balanced parentheses, that support natural operations such as finding the matching parenthesis for a given parenthesis, or find ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
b Institute of Mathematical Sciences, Chennai 600 113, India. We consider succinct, or highly space-efficient, representations of a (static) string consisting of n pairs of balanced parentheses, that support natural operations such as finding the matching parenthesis for a given parenthesis, or finding the pair of parentheses that most tightly enclose a given pair. This problem was considered by Jacobson, [Proc. 30th FOCS, 549–554, 1989] and Munro and Raman, [SIAM J. Comput. 31 (2001), 762–776], who gave O(n)-bit and 2n + o(n)-bit representations, respectively, that supported the above operations in O(1) time on the RAM model of computation. This data structure is a fundamental tool in succinct representations, and has applications in representing suffix trees, ordinal trees, planar graphs and permutations. We consider the practical performance of parenthesis representations. First, we give a new 2n + o(n)-bit representation that supports all the above operations in O(1) time. This representation is conceptually simpler, its space bound has a smaller o(n) term and it also has a simple and uniform o(n) time and space construction algorithm. We implement our data structure and a variant of Jacobson’s, and evaluate their practical performance (speed and memory usage), when used in a succinct representation of trees derived from XML documents. As a baseline, we compare our representations against a widely-used implementation of the standard DOM (Document Object Model) representation of XML documents. Both succinct representations use orders of magnitude less space than DOM and tree traversal operations are usually only slightly slower than in DOM. Key words: Succinct data structures, parentheses representation of trees, compressed dictionaries, XML DOM. Preprint submitted to Theoretical Computer Science 29 November 2006 1

