Results 1 - 10
of
21
Compressed full-text indexes
- ACM COMPUTING SURVEYS
, 2007
"... Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract
-
Cited by 142 (70 self)
- Add to MetaCart
Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into self-indexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Compressed representations of sequences and full-text indexes
- ACM Transactions on Algorithms
, 2007
"... Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) pro ..."
Abstract
-
Cited by 92 (55 self)
- Add to MetaCart
Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FM-index that indexes a string T [1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the k-th order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log |Σ | n, constant 0 < α < 1, and |Σ | = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length ℓ in O(ℓ + log 1+ε n) time.
Structuring labeled trees for optimal succinctness, and beyond
- In FOCS
, 2005
"... Consider an ordered, static tree T on t nodes where each node has a label from alphabet set Σ. TreeTmaybeofar bitrary degree and of arbitrary shape. Say, we wish to support basic navigational operations such as find the parent of node u,theith child of u, and any child of u with label α. In a semina ..."
Abstract
-
Cited by 44 (8 self)
- Add to MetaCart
Consider an ordered, static tree T on t nodes where each node has a label from alphabet set Σ. TreeTmaybeofar bitrary degree and of arbitrary shape. Say, we wish to support basic navigational operations such as find the parent of node u,theith child of u, and any child of u with label α. In a seminal work over fifteen years ago, Jacobson [15] observed that pointer-based tree representations are wasteful in space and introduced the notion of succinct data structures. He studied the special case of unlabeled trees and presented a succinct data structure of 2t+o(t) bits supporting navigational operations in O(1) time. The space used is asymptotically optimal with the information-theoretic lower bound averaged over all trees. This led to a slew of results on succinct data structures for arrays, trees, strings
An alphabet-friendly FM-index
- In Proc.SPIRE’04, LNCS 3246
, 2004
"... Abstract. We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T)+O � ..."
Abstract
-
Cited by 40 (19 self)
- Add to MetaCart
Abstract. We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T)+O � (n log log n) / log |Σ | n � bits, where Hk(T) is the k-th order empirical entropy of T. The above bound holds simultaneously for all k ≤ α log |Σ | n and 0 < α < 1. Moreover, the index design does not depend on the parameter k, which plays a role only in analysis of the space occupancy. Using our index, the counting of the occurrences of an arbitrary pattern P [1, p] as a substring of T takes O(p log |Σ|) time. Locating each pattern occurrence takes O(log |Σ | (log 2 n / log log n)) time. Reporting a text substring of length ℓ takes O((ℓ + log 2 n / log log n) log |Σ|) time. 1
Rank and select revisited and extended
- Workshop on SpaceConscious Algorithms, University of
, 2006
"... The deep connection between the Burrows-Wheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corr ..."
Abstract
-
Cited by 29 (17 self)
- Add to MetaCart
The deep connection between the Burrows-Wheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. It has been shown that improvements to rank/select algorithms, in combination with the BWT, turn into improved compressed text indexes. This paper is devoted to alternative implementations and extensions of rank and select data structures. First, we show that one can use gap encoding techniques to obtain constant time rank and select queries in essentially the same space as what is achieved by the best current direct solution (and sometimes less). Second, we extend symbol rank and select to substring rank and select, giving several space/time tradeoffs for the problem. An application of these queries is in position-restricted substring searching, where one can specify the range in the text where the search is restricted to, and only occurrences residing in that range are to be reported. In addition, arbitrary occurrences are reported in text position order. Several byproducts of our results display connections with searchable partial sums, Chazelle’s two-dimensional data structures, and Grossi et al.’s wavelet trees.
A simple storage scheme for strings achieving entropy bounds
, 2007
"... We propose a storage scheme for a string S[1, n], drawn from an alphabet Σ, that requires space close to the k-th order empirical entropy of S, and allows to retrieve any ℓ-long substring of S in optimal O(1 + ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
We propose a storage scheme for a string S[1, n], drawn from an alphabet Σ, that requires space close to the k-th order empirical entropy of S, and allows to retrieve any ℓ-long substring of S in optimal O(1 +
The myriad virtues of wavelet trees
- In Proc. of International Colloquium on Automata and Languages (ICALP
"... Abstract. Wavelet Trees have been introduced in [Grossi, Gupta and Vitter, SODA ’03] and have been rapidly recognized as a very flexible tool for the design of compressed full-text indexes and data compressors. Although several papers have investigated the beauty and usefulness of this data structur ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
Abstract. Wavelet Trees have been introduced in [Grossi, Gupta and Vitter, SODA ’03] and have been rapidly recognized as a very flexible tool for the design of compressed full-text indexes and data compressors. Although several papers have investigated the beauty and usefulness of this data structure in the full-text indexing scenario, its impact on data compression has not been fully explored. In this paper we provide a complete theoretical analysis of a wide class of compression algorithms based on Wavelet Trees. We also show how to improve their asymptotic performance by introducing a novel framework, called Generalized Wavelet Trees, that aims for the best combination of binary compressors (like, Run-Length encoders) versus non-binary compressors (like, Huffman and Arithmetic encoders) and Wavelet Trees of properly-designed shapes. As a corollary, we prove high-order entropy bounds for the challenging combination of Burrows-Wheeler Transform and Wavelet Trees. 1
Compressed permuterm index
- In Proceedings 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
"... The Permuterm index (Garfield, 1976) is a time-efficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wild-card symbol (called, Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dic ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
The Permuterm index (Garfield, 1976) is a time-efficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wild-card symbol (called, Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dictionary size. In this paper we propose the Compressed Permuterm Index which solves the Tolerant Retrieval problem in time proportional to the length of the searched pattern, and space close to the k-th order empirical entropy of the indexed dictionary. We also design a dynamic version of this index which allows to efficiently manage insertion in, and deletion from, the dictionary of individual strings. The result is based on a simple variant of the Burrows-Wheeler Transform defined on a dictionary of strings of variable length, that allows to efficiently solve the Tolerant Retrieval problem via known (dynamic) compressed indexes [17]. We will complement our theoretical study with a rich set of experiments which show that the Compressed Permuterm Index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip or bzip2. This improves known approaches based on Front-Coding [19] by more than 50 % in absolute space occupancy, still guaranteeing comparable query time.
The engineering of a compression boosting library: Theory vs practice in BWT compression
- In Proc. 14th European Symposium on Algorithms (ESA ’06
, 2006
"... Abstract. Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to des ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
Abstract. Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to design and engineer compression algorithms based on the BWT paradigm. In particular, Move-to-Front Encoding is generally believed to be an “inefficient ” part of the Burrows-Wheeler compression process. However, only recently two theoretically superior alternatives to Move-to-Front have been proposed, namely Compression Boosting and Wavelet Trees. The main contribution of this paper is to provide the first experimental comparison of these three techniques, giving a much needed methodological contribution to the current debate. We do so by providing a carefully engineered compression boosting library that can be used, on the one hand, to investigate the myriad new compression algorithms that can be based on boosting, and on the other hand, to make the first experimental assessment of how Move-to-Front behaves with respect to its recently proposed competitors. The main conclusion is that Boosting, Wavelet Trees and Move-to-Front yield quite close compression performance. Finally, our extensive experimental study of boosting technique brings to light a new fact overlooked in 10 years of experiments in the area: a fast adapting order-zero compressor is enough to provide state of the art BWT compression by simply compressing the run length encoded transform. In other words, Move-to-Front, Wavelet Trees, and Boosters can all be by-passed by a fast learner.
A simpler analysis of Burrows-Wheeler based compression
- In Proc. of the 17th Symposium on Combinatorial Pattern Matching (CPM ’06). Springer-Verlag LNCS
, 2006
"... In this paper we present a new technique for worst-case analysis of compression algorithms which are based on the Burrows-Wheeler Transform. We deal mainly with the algorithm proposed by Burrows and Wheeler in their first paper on the subject [6], called bw0. This algorithm consists of the following ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In this paper we present a new technique for worst-case analysis of compression algorithms which are based on the Burrows-Wheeler Transform. We deal mainly with the algorithm proposed by Burrows and Wheeler in their first paper on the subject [6], called bw0. This algorithm consists of the following three essential steps: 1) Obtain the Burrows-Wheeler Transform of the text, 2) Convert the transform into a sequence of integers using the move-to-front algorithm, 3) Encode the integers using Arithmetic code or any order-0 encoding (possibly with run-length encoding). We achieve a strong upper bound on the worst-case compression ratio of this algorithm. This bound is significantly better than bounds known to date and is obtained via simple analytical techniques. Specifically, we show that for any input string s, and µ> 1, the length of the compressed string is bounded by µ · |s|Hk(s)+ log(ζ(µ)) · |s | + µgk + O(log n) where Hk is the k-th order empirical entropy, gk is a constant depending only on k and on the size of the alphabet, and ζ(µ) = 1 1 1 µ+ 2 µ+... is the standard zeta function. As part of the analysis we prove a result on the compressibility of integer sequences, which is of independent interest. Finally, we apply our techniques to prove a worst-case bound on the compression ratio of a compression algorithm based on the Burrows-Wheeler Transform followed by distance coding, for which worst-case guarantees have never been given. We prove that the length of the compressed string is bounded by 1.7286 · |s|Hk(s) + gk + O(log n). This bound is better than the bound we give for bw0.

