Results 1  10
of
161
Highorder entropycompressed text indexes
, 2003
"... We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet Σ, where each symbol is encoded by lg Σ  bits. We show that compressed suffix arrays use just nHh + O(n lg lg ..."
Abstract

Cited by 194 (22 self)
 Add to MetaCart
We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet Σ, where each symbol is encoded by lg Σ  bits. We show that compressed suffix arrays use just nHh + O(n lg lg n / lg Σ  n) bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg Σ  + polylog(n)) time. The term Hh ≤ lg Σ  denotes the hthorder empirical entropy of the text, which means that our index is nearly optimal in space apart from lowerorder terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hh = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper. 1
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 173 (78 self)
 Add to MetaCart
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Compressed representations of sequences and fulltext indexes
 ACM Transactions on Algorithms
, 2007
"... Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) pro ..."
Abstract

Cited by 110 (62 self)
 Add to MetaCart
Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed fulltext indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FMindex that indexes a string T [1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log Σ  n, constant 0 < α < 1, and Σ  = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length ℓ in O(ℓ + log 1+ε n) time.
Succinct suffix arrays based on runlength encoding
 Nordic Journal of Computing
, 2005
"... A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex re ..."
Abstract

Cited by 53 (32 self)
 Add to MetaCart
A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex replaces the text. Several remarkable selfindexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n). In this paper we present a new selfindex, called RLFM index for “runlength FMindex”, that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHk log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ. In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the BurrowsWheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes. Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different spacetime tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those. 1
Structuring labeled trees for optimal succinctness, and beyond
 In FOCS
, 2005
"... Consider an ordered, static tree T on t nodes where each node has a label from alphabet set Σ. TreeTmaybeofar bitrary degree and of arbitrary shape. Say, we wish to support basic navigational operations such as find the parent of node u,theith child of u, and any child of u with label α. In a semina ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
Consider an ordered, static tree T on t nodes where each node has a label from alphabet set Σ. TreeTmaybeofar bitrary degree and of arbitrary shape. Say, we wish to support basic navigational operations such as find the parent of node u,theith child of u, and any child of u with label α. In a seminal work over fifteen years ago, Jacobson [15] observed that pointerbased tree representations are wasteful in space and introduced the notion of succinct data structures. He studied the special case of unlabeled trees and presented a succinct data structure of 2t+o(t) bits supporting navigational operations in O(1) time. The space used is asymptotically optimal with the informationtheoretic lower bound averaged over all trees. This led to a slew of results on succinct data structures for arrays, trees, strings
Succinct data structures for flexible text retrieval systems
 Journal of Discrete Algorithms
, 2007
"... University, Fukuoka, Japan. We propose succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support querie ..."
Abstract

Cited by 53 (1 self)
 Add to MetaCart
University, Fukuoka, Japan. We propose succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support queries only for some predetermined keywords. Recently Muthukrishnan proposed a data structure for document listing queries for arbitrary patterns at the cost of data structure size. For computing the tf*idf scores there has been no efficient data structures for arbitrary patterns. Our new data structures support these queries using small space. The space is only 2/ɛ times the size of compressed documents plus 10n bits for a document collection of length n, for any 0 <ɛ ≤ 1. This is much smaller than the previous O(n log n) bit data structures. Query time is O(m+q log ɛ n) for listing and computing tf*idf scores for all q documents containing a given pattern of length m. Our data structures are flexible in a sense that they support queries for arbitrary patterns.
K.: Practical EntropyCompressed Rank/Select Dictionary
 Proceedings of ALENEX’07, ACM
, 2007
"... Rank/Select dictionaries are data structures for an ordered set S ⊂ {0, 1,..., n − 1} to compute rank(x, S) (the number of elements in S which are no greater than x), and select(i, S) (the ith smallest element in S), which are the fundamental components of succinct data structures of strings, trees ..."
Abstract

Cited by 51 (1 self)
 Add to MetaCart
Rank/Select dictionaries are data structures for an ordered set S ⊂ {0, 1,..., n − 1} to compute rank(x, S) (the number of elements in S which are no greater than x), and select(i, S) (the ith smallest element in S), which are the fundamental components of succinct data structures of strings, trees, graphs, etc. In those data structures, however, only asymptotic behavior has been considered and their performance for real data is not satisfactory. In this paper, we propose novel four Rank/Select dictionaries, esp, recrank, vcode and sdarray, each of which is small if the number of elements in S is small, and indeed close to nH0(S) (H0(S) ≤ 1 is the zeroth order empirical entropy of S) in practice, and its query time is superior to the previous ones. Experimental results reveal the characteristics of our data structures and also show that these data structures are superior to existing implementations in both size and query time. 1
Breaking a TimeandSpace Barrier in Constructing FullText Indices
"... Suffix trees and suffix arrays are the most prominent fulltext indices, and their construction algorithms are well studied. It has been open for a long time whether these indicescan be constructed in both o(n log n) time and o(n log n)bit working space, where n denotes the length of the text. Int ..."
Abstract

Cited by 50 (3 self)
 Add to MetaCart
Suffix trees and suffix arrays are the most prominent fulltext indices, and their construction algorithms are well studied. It has been open for a long time whether these indicescan be constructed in both o(n log n) time and o(n log n)bit working space, where n denotes the length of the text. Inthe literature, the fastest algorithm runs in O(n) time, whileit requires O(n log n)bit working space. On the other hand,the most spaceefficient algorithm requires O(n)bit working space while it runs in O(n log n) time. This paper breaks the longstanding timeandspace barrier under the unitcost word RAM. We give an algorithm for constructing the suffix array which takes O(n) time and O(n)bit working space, for texts with constantsize alphabets. Note that both the time and the space bounds are optimal. For constructing the suffix tree, our algorithm requires O(n logffl n) time and O(n)bit working space forany 0! ffl! 1. Apart from that, our algorithm can alsobe adopted to build other existing fulltext indices, such as
An alphabetfriendly FMindex
 In Proc.SPIRE’04, LNCS 3246
, 2004
"... Abstract. We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FMindex which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T)+O � ..."
Abstract

Cited by 43 (19 self)
 Add to MetaCart
Abstract. We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FMindex which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T)+O � (n log log n) / log Σ  n � bits, where Hk(T) is the kth order empirical entropy of T. The above bound holds simultaneously for all k ≤ α log Σ  n and 0 < α < 1. Moreover, the index design does not depend on the parameter k, which plays a role only in analysis of the space occupancy. Using our index, the counting of the occurrences of an arbitrary pattern P [1, p] as a substring of T takes O(p log Σ) time. Locating each pattern occurrence takes O(log Σ  (log 2 n / log log n)) time. Reporting a text substring of length ℓ takes O((ℓ + log 2 n / log log n) log Σ) time. 1