Results 1 - 10
of
27
Compressed representations of sequences and full-text indexes
- ACM Transactions on Algorithms
, 2007
"... Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) pro ..."
Abstract
-
Cited by 92 (55 self)
- Add to MetaCart
Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FM-index that indexes a string T [1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the k-th order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log |Σ | n, constant 0 < α < 1, and |Σ | = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length ℓ in O(ℓ + log 1+ε n) time.
A simple optimal representation for balanced parentheses
- In Proc. 15th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS v. 3109 (2004
, 2004
"... b Institute of Mathematical Sciences, Chennai 600 113, India. We consider succinct, or highly space-efficient, representations of a (static) string consisting of n pairs of balanced parentheses, that support natural operations such as finding the matching parenthesis for a given parenthesis, or find ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
b Institute of Mathematical Sciences, Chennai 600 113, India. We consider succinct, or highly space-efficient, representations of a (static) string consisting of n pairs of balanced parentheses, that support natural operations such as finding the matching parenthesis for a given parenthesis, or finding the pair of parentheses that most tightly enclose a given pair. This problem was considered by Jacobson, [Proc. 30th FOCS, 549–554, 1989] and Munro and Raman, [SIAM J. Comput. 31 (2001), 762–776], who gave O(n)-bit and 2n + o(n)-bit representations, respectively, that supported the above operations in O(1) time on the RAM model of computation. This data structure is a fundamental tool in succinct representations, and has applications in representing suffix trees, ordinal trees, planar graphs and permutations. We consider the practical performance of parenthesis representations. First, we give a new 2n + o(n)-bit representation that supports all the above operations in O(1) time. This representation is conceptually simpler, its space bound has a smaller o(n) term and it also has a simple and uniform o(n) time and space construction algorithm. We implement our data structure and a variant of Jacobson’s, and evaluate their practical performance (speed and memory usage), when used in a succinct representation of trees derived from XML documents. As a baseline, we compare our representations against a widely-used implementation of the standard DOM (Document Object Model) representation of XML documents. Both succinct representations use orders of magnitude less space than DOM and tree traversal operations are usually only slightly slower than in DOM. Key words: Succinct data structures, parentheses representation of trees, compressed dictionaries, XML DOM. Preprint submitted to Theoretical Computer Science 29 November 2006 1
Succinct indexes for strings, binary relations and multi-labeled trees
- In Procs ACM-SIAM SODA, 2007 (this volume
, 2007
"... We define and design succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures called succinct indexes that occupy asymptotically less space than the information-theoretic lower bound on the space required to encode the given data, and support an ext ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
We define and design succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures called succinct indexes that occupy asymptotically less space than the information-theoretic lower bound on the space required to encode the given data, and support an extended set of operations using the basic operators defined in the ADT. As opposed to succinct encodings, The main advantage of succinct indexes is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data. In this paper, we present succinct indexes for various data types, namely strings, binary relations and multi-labeled trees. Given the support for the interface of the ADTs of these data types, we can support various useful operations efficiently by constructing succinct indexes for them. When the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data. Using our techniques, we design the first succinct encoding that represents a string of length n over alphabet [σ] using nHk + o(n lg σ) bits 1 that support access / rank / select operations in o((lg lg σ) 3) time. We also design the first succinct text index using nHk + o(n lg σ) bits that supports pattern matching queries in O(m lg lg σ + occ lg 1+ɛ n lg lg σ) time, for a given pattern of length m. Previous results on these two problems either have a lg σ factor instead of lg lg σ in terms of running time, or are not compressible, but our results do not have such problems. More results are reported in the paper. 1 We use lg n to denote ⌈log2 n⌉. 1
Practical rank/select queries over arbitrary sequences
- In Proc. 15th SPIRE, LNCS 5280
, 2008
"... Abstract. We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the cas ..."
Abstract
-
Cited by 25 (20 self)
- Add to MetaCart
Abstract. We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the case of sequences with very large alphabets. We first present a new practical implementation of the compressed representation for bit sequences proposed by Raman, Raman, and Rao [SODA 2002], that is competitive with the existing ones when the sequences are not too compressible. It also has nice local compression properties, and we show that this makes it an excellent tool for compressed text indexing in combination with the Burrows-Wheeler transform. This shows the practicality of a recent theoretical proposal [Mäkinen and Navarro, SPIRE 2007], achieving spaces never seen before. Second, for general sequences, we tune wavelet trees for the case of very large alphabets, by removing their pointer information. We show that this gives an excellent solution for representing a sequence within zero-order entropy space, in cases where the large alphabet poses a serious challenge to typical encoding methods. We also present the first implementation of Golynski et al.’s representation [SODA 2006], which offers another interesting time/space trade-off. 1
Statistical encoding of succinct data structures
- In Proc. 17th CPM, LNCS 4009
, 2006
"... Abstract. In recent work, Sadakane and Grossi [SODA 2006] introduced a scheme to represent any (k log σ + log log n)) bits sequence S = s1s2... sn, over an alphabet of size σ, using nHk(S) + O ( n log σ n of space, where Hk(S) is the k-th order empirical entropy of S. The representation permits extr ..."
Abstract
-
Cited by 22 (10 self)
- Add to MetaCart
Abstract. In recent work, Sadakane and Grossi [SODA 2006] introduced a scheme to represent any (k log σ + log log n)) bits sequence S = s1s2... sn, over an alphabet of size σ, using nHk(S) + O ( n log σ n of space, where Hk(S) is the k-th order empirical entropy of S. The representation permits extracting any substring of size Θ(log σ n) in constant time, and thus it completely replaces S under the RAM model. This is extremely important because it permits converting any succinct data structure requiring o(|S|) = o(n log σ) bits in addition to S, into another requiring nHk(S) + o(n log σ) (overall) for any k = o(log σ n). They achieve this result by using Ziv-Lempel compression, and conjecture that the result can in particular be useful to implement compressed full-text indexes. In this paper we extend their result, by obtaining the same space and time complexities using a simpler scheme based on statistical encoding. We show that the scheme supports appending symbols in constant amortized time. In addition, we prove some results on the applicability of the scheme for full-text selfindexing. 1
Reducing the space requirement of LZ-index
- in CPM, 2006
"... Abstract. The LZ-index is a compressed full-text self-index able to represent a text T1...u, over an alphabet of size σ and with k-th order empirical entropy Hk(T), using 4uHk(T)+o(ulog σ) bits for any k = o(log σ u). It can report all the occ occurrences of a pattern P1...m in T in O(m 3 logσ +(m+o ..."
Abstract
-
Cited by 22 (17 self)
- Add to MetaCart
Abstract. The LZ-index is a compressed full-text self-index able to represent a text T1...u, over an alphabet of size σ and with k-th order empirical entropy Hk(T), using 4uHk(T)+o(ulog σ) bits for any k = o(log σ u). It can report all the occ occurrences of a pattern P1...m in T in O(m 3 logσ +(m+occ)log u) worst case time. This is the only existing data structure of size O(uHk(T)) able of spending O(logu) time per occurrence reported. Its main drawback is the factor 4 in its space complexity, which makes it larger than other stateof-the-art alternatives. In this paper we present two different approaches to reduce the space requirement of LZ-index. In both cases we achieve (2 + ε)uHk(T) + o(ulog σ) bits of space, for any constant ε> 0, and we simultaneously improve the search time to O(m 2 logm+(m+occ)logu). Both indexes support displaying any subtext of length ℓ in optimal O(ℓ/log σ u) time. In addition, we show how the space can be squeezed to (1+ε)uHk(T)+o(ulog σ) to obtain a structure with O(m 2) average search time for m � 2log σ u. 1 Introduction and Previous Work Text searching is a classical problem in Computer Science. Given a sequence of symbols T1...u (the text) over an alphabet Σ of size σ, and given another (short) sequence P1...m (the search pattern) over Σ, the full-text search problem consists in finding all the occ occurrences of P in T. Applications of full-text searching include text databases in general, which typically contain natural
A compressed self-index using a Ziv-Lempel dictionary
- In: SPIRE. Volume 4209 of LNCS. (2006) 163–180
"... Abstract. A compressed full-text self-index for a text T, of size u, is a data structure used to search patterns P, of size m, in T that requires reduced space, i.e. that depends on the empirical entropy (Hk, H0) of T, and is, furthermore, able to reproduce any substring of T. In this paper we prese ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
Abstract. A compressed full-text self-index for a text T, of size u, is a data structure used to search patterns P, of size m, in T that requires reduced space, i.e. that depends on the empirical entropy (Hk, H0) of T, and is, furthermore, able to reproduce any substring of T. In this paper we present a new compressed self-index able to locate the occurrences of P in O((m + occ) log n) time, where occ is the number of occurrences and σ the size of the alphabet of T. The fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on m from O(m 2) to O(m). To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose and explore the nature of a recurrent structure in LZ-indexes, the T78 suffix tree. We show that our method is very competitive in practice by comparing it against the LZ-Index, the FM-index and a compressed suffix array. 1
Succinct Representation of Sequences
, 2004
"... Abstract. Given a sequence S = s1s2... sn such that 1 ≤ sq ≤ r for all q, where r = O(polylog(n)), we show how S can be represented using nH0(S)+o(n) bits (where H0(S) is the zero-order entropy of S), so that we can know any sq, as well as answer rank and select queries on S, in constant time. This ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Abstract. Given a sequence S = s1s2... sn such that 1 ≤ sq ≤ r for all q, where r = O(polylog(n)), we show how S can be represented using nH0(S)+o(n) bits (where H0(S) is the zero-order entropy of S), so that we can know any sq, as well as answer rank and select queries on S, in constant time. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. Furthermore, we show how our technique can be applied to improve a succinct full-text index. 1
Compressed representations of permutations, and applications
- SYMPOSIUM ON THEORETICAL ASPECTS OF COMPUTER SCIENCE
"... We explore various techniques to compress a permutation π over n integers, taking advantage of ordered subsequences in π, while supporting its application π(i) and the application of its inverse π −1 (i) in small time. Our compression schemes yield several interesting byproducts, in many cases mat ..."
Abstract
-
Cited by 12 (8 self)
- Add to MetaCart
We explore various techniques to compress a permutation π over n integers, taking advantage of ordered subsequences in π, while supporting its application π(i) and the application of its inverse π −1 (i) in small time. Our compression schemes yield several interesting byproducts, in many cases matching, improving or extending the best existing results on applications such as the encoding of a permutation in order to support iterated applications π k (i) of it, of integer functions, and of inverted lists and suffix arrays.
Self-indexed text compression using straight-line programs
- In Proc. 34th MFCS
, 2009
"... Abstract. Straight-line programs (SLPs) offer powerful text compression by representing a text T [1, u] in terms of a restricted context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We pr ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract. Straight-line programs (SLPs) offer powerful text compression by representing a text T [1, u] in terms of a restricted context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We present a grammar representation whose size is of the same order of that of a plain SLP representation, and can answer other queries apart from expanding nonterminals. This can be of independent interest. We then extend it to achieve the first grammar representation able of extracting text substrings, and of searching the text for patterns, in time o(n). We also give byproducts on representing binary relations. 1 Introduction and Related Work Grammar-based compression is a well-known technique since at least the seventies, and still a very active area of research. From the different variants of the idea, we focus on the case where a given text T [1, u] is replaced by a context-free grammar (CFG) G that generates just the string T. Then one can store G instead

