Results 1  10
of
28
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 173 (78 self)
 Add to MetaCart
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Compressed representations of sequences and fulltext indexes
 ACM Transactions on Algorithms
, 2007
"... Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) pro ..."
Abstract

Cited by 110 (62 self)
 Add to MetaCart
Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed fulltext indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FMindex that indexes a string T [1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log Σ  n, constant 0 < α < 1, and Σ  = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length ℓ in O(ℓ + log 1+ε n) time.
Structuring labeled trees for optimal succinctness, and beyond
 In FOCS
, 2005
"... Consider an ordered, static tree T on t nodes where each node has a label from alphabet set Σ. TreeTmaybeofar bitrary degree and of arbitrary shape. Say, we wish to support basic navigational operations such as find the parent of node u,theith child of u, and any child of u with label α. In a semina ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
Consider an ordered, static tree T on t nodes where each node has a label from alphabet set Σ. TreeTmaybeofar bitrary degree and of arbitrary shape. Say, we wish to support basic navigational operations such as find the parent of node u,theith child of u, and any child of u with label α. In a seminal work over fifteen years ago, Jacobson [15] observed that pointerbased tree representations are wasteful in space and introduced the notion of succinct data structures. He studied the special case of unlabeled trees and presented a succinct data structure of 2t+o(t) bits supporting navigational operations in O(1) time. The space used is asymptotically optimal with the informationtheoretic lower bound averaged over all trees. This led to a slew of results on succinct data structures for arrays, trees, strings
An alphabetfriendly FMindex
 In Proc.SPIRE’04, LNCS 3246
, 2004
"... Abstract. We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FMindex which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T)+O � ..."
Abstract

Cited by 43 (19 self)
 Add to MetaCart
Abstract. We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FMindex which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T)+O � (n log log n) / log Σ  n � bits, where Hk(T) is the kth order empirical entropy of T. The above bound holds simultaneously for all k ≤ α log Σ  n and 0 < α < 1. Moreover, the index design does not depend on the parameter k, which plays a role only in analysis of the space occupancy. Using our index, the counting of the occurrences of an arbitrary pattern P [1, p] as a substring of T takes O(p log Σ) time. Locating each pattern occurrence takes O(log Σ  (log 2 n / log log n)) time. Reporting a text substring of length ℓ takes O((ℓ + log 2 n / log log n) log Σ) time. 1
S.S.: Succinct indexes for strings, binary relations, and multilabeled trees
 In: Proc. SODA
, 2007
"... We define and design succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the informationtheoretic lower bound on the space required to encode the given data, and support an extended set of ope ..."
Abstract

Cited by 42 (11 self)
 Add to MetaCart
We define and design succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the informationtheoretic lower bound on the space required to encode the given data, and support an extended set of operations using the basic operators defined in the ADT. The main advantage of succinct indexes as opposed to succinct (integrated data/index) encodings is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data. In this paper, we present succinct indexes for various data types, namely strings, binary relations and multilabeled trees. Given the support for the interface of the ADTs of these data types, we can support various useful operations efficiently by constructing succinct indexes for them. When the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data. Usingourtechniques,wedesignasuccinctencodingthatrepresentsastringoflengthnoveranalphabetof size σ using nHk(S)+lgσ·o(n)+O ( nlgσ lglglgσ) bits to support access/rank/select operations in o((lglgσ)1+ɛ) time, for any fixed constant ɛ> 0. We also design a succinct text index using nH0(S)+O ( nlgσ) bits that lglgσ
Rank and select revisited and extended
 Workshop on SpaceConscious Algorithms, University of
, 2006
"... The deep connection between the BurrowsWheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corr ..."
Abstract

Cited by 33 (17 self)
 Add to MetaCart
The deep connection between the BurrowsWheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. It has been shown that improvements to rank/select algorithms, in combination with the BWT, turn into improved compressed text indexes. This paper is devoted to alternative implementations and extensions of rank and select data structures. First, we show that one can use gap encoding techniques to obtain constant time rank and select queries in essentially the same space as what is achieved by the best current direct solution (and sometimes less). Second, we extend symbol rank and select to substring rank and select, giving several space/time tradeoffs for the problem. An application of these queries is in positionrestricted substring searching, where one can specify the range in the text where the search is restricted to, and only occurrences residing in that range are to be reported. In addition, arbitrary occurrences are reported in text position order. Several byproducts of our results display connections with searchable partial sums, Chazelle’s twodimensional data structures, and Grossi et al.’s wavelet trees.
A simple storage scheme for strings achieving entropy bounds
, 2007
"... We propose a storage scheme for a string S[1, n], drawn from an alphabet Σ, that requires space close to the kth order empirical entropy of S, and allows to retrieve any ℓlong substring of S in optimal O(1 + ..."
Abstract

Cited by 33 (5 self)
 Add to MetaCart
We propose a storage scheme for a string S[1, n], drawn from an alphabet Σ, that requires space close to the kth order empirical entropy of S, and allows to retrieve any ℓlong substring of S in optimal O(1 +
The myriad virtues of wavelet trees
 In Proc. of International Colloquium on Automata and Languages (ICALP
"... Abstract. Wavelet Trees have been introduced in [Grossi, Gupta and Vitter, SODA ’03] and have been rapidly recognized as a very flexible tool for the design of compressed fulltext indexes and data compressors. Although several papers have investigated the beauty and usefulness of this data structur ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
Abstract. Wavelet Trees have been introduced in [Grossi, Gupta and Vitter, SODA ’03] and have been rapidly recognized as a very flexible tool for the design of compressed fulltext indexes and data compressors. Although several papers have investigated the beauty and usefulness of this data structure in the fulltext indexing scenario, its impact on data compression has not been fully explored. In this paper we provide a complete theoretical analysis of a wide class of compression algorithms based on Wavelet Trees. We also show how to improve their asymptotic performance by introducing a novel framework, called Generalized Wavelet Trees, that aims for the best combination of binary compressors (like, RunLength encoders) versus nonbinary compressors (like, Huffman and Arithmetic encoders) and Wavelet Trees of properlydesigned shapes. As a corollary, we prove highorder entropy bounds for the challenging combination of BurrowsWheeler Transform and Wavelet Trees. 1
Alphabetindependent compressed text indexing
 In ESA
, 2011
"... Abstract. Selfindexes can represent a text in asymptotically optimal space under the kth order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, f ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
Abstract. Selfindexes can represent a text in asymptotically optimal space under the kth order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, for the first time, full alphabetindependence in the time complexities of selfindexes, while retaining space optimality. We obtain also some relevant byproducts on compressed suffix trees. 1
The engineering of a compression boosting library: Theory vs practice in BWT compression
 In Proc. 14th European Symposium on Algorithms (ESA ’06
, 2006
"... Abstract. Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to des ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
Abstract. Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to design and engineer compression algorithms based on the BWT paradigm. In particular, MovetoFront Encoding is generally believed to be an “inefficient ” part of the BurrowsWheeler compression process. However, only recently two theoretically superior alternatives to MovetoFront have been proposed, namely Compression Boosting and Wavelet Trees. The main contribution of this paper is to provide the first experimental comparison of these three techniques, giving a much needed methodological contribution to the current debate. We do so by providing a carefully engineered compression boosting library that can be used, on the one hand, to investigate the myriad new compression algorithms that can be based on boosting, and on the other hand, to make the first experimental assessment of how MovetoFront behaves with respect to its recently proposed competitors. The main conclusion is that Boosting, Wavelet Trees and MovetoFront yield quite close compression performance. Finally, our extensive experimental study of boosting technique brings to light a new fact overlooked in 10 years of experiments in the area: a fast adapting orderzero compressor is enough to provide state of the art BWT compression by simply compressing the run length encoded transform. In other words, MovetoFront, Wavelet Trees, and Boosters can all be bypassed by a fast learner.