Results 1  10
of
248
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 224 (94 self)
 Add to MetaCart
(Show Context)
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Compressed suffix arrays and suffix trees with applications to text indexing and string matching
, 2005
"... The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. ..."
Abstract

Cited by 211 (18 self)
 Add to MetaCart
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg Σ  bits by encoding each symbol with lg Σ  bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg Σ  n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg Σ) timeorinO(m +lgn) time, plus an outputsensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg Σ  n +lgɛ Σ  n) search time in the worst case, for any constant
Compressed representations of sequences and fulltext indexes
 ACM Transactions on Algorithms
, 2007
"... Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) pro ..."
Abstract

Cited by 147 (72 self)
 Add to MetaCart
(Show Context)
Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed fulltext indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FMindex that indexes a string T [1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log Σ  n, constant 0 < α < 1, and Σ  = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length ℓ in O(ℓ + log 1+ε n) time.
Practical EntropyCompressed Rank/Select Dictionary
 PROCEEDINGS OF ALENEX’07, ACM
, 2007
"... Rank/Select dictionaries are data structures for an ordered set S ⊂ {0, 1,..., n − 1} to compute rank(x, S) (the number of elements in S which are no greater than x), and select(i, S) (the ith smallest element in S), which are the fundamental components of succinct data structures of strings, trees ..."
Abstract

Cited by 69 (1 self)
 Add to MetaCart
(Show Context)
Rank/Select dictionaries are data structures for an ordered set S ⊂ {0, 1,..., n − 1} to compute rank(x, S) (the number of elements in S which are no greater than x), and select(i, S) (the ith smallest element in S), which are the fundamental components of succinct data structures of strings, trees, graphs, etc. In those data structures, however, only asymptotic behavior has been considered and their performance for real data is not satisfactory. In this paper, we propose novel four Rank/Select dictionaries, esp, recrank, vcode and sdarray, each of which is small if the number of elements in S is small, and indeed close to nH0(S) (H0(S) ≤ 1 is the zeroth order empirical entropy of S) in practice, and its query time is superior to the previous ones. Experimental results reveal the characteristics of our data structures and also show that these data structures are superior to existing implementations in both size and query time.
Engineering a lightweight suffix array construction algorithm (Extended Abstract)
"... In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matchi ..."
Abstract

Cited by 66 (4 self)
 Add to MetaCart
In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data [4, 11]. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the BurrowsWheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its fulltext index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for clustering and ranking the answers to user queries in websearch engines [22]. In all these applications the construction of the suffix array is the computational bottleneck both in time and space. This motivated our interest in designing yet another suffix array construction algorithm which is fast and "lightweight" in the sense that it uses small space...
Practical implementation of rank and select queries
 In Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA’05) (Greece
, 2005
"... Research on succinct data structures has made significant progress in recent years. An essential building block of many of those techniques is a data structure to perform rank and select operations over a bit array. The first operation tells how many bits are set up to some position, and the second ..."
Abstract

Cited by 58 (20 self)
 Add to MetaCart
(Show Context)
Research on succinct data structures has made significant progress in recent years. An essential building block of many of those techniques is a data structure to perform rank and select operations over a bit array. The first operation tells how many bits are set up to some position, and the second the position of the ith bit set. Albeit there exist constanttime solutions that require sublinear extra space, the practicality of those solutions against more naive ones has not been carefully studied. In this paper we show some results in this respect, which suggest that in many practical cases the simpler solutions are better in terms of time and extra space.
Succinct suffix arrays based on runlength encoding
 Nordic Journal of Computing
, 2005
"... A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex re ..."
Abstract

Cited by 58 (32 self)
 Add to MetaCart
A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex replaces the text. Several remarkable selfindexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n). In this paper we present a new selfindex, called RLFM index for “runlength FMindex”, that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHk log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ. In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the BurrowsWheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes. Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different spacetime tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those. 1
Succinct indexes for strings, binary relations, and multilabeled trees
 IN: PROC. SODA
, 2007
"... We define and design succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the informationtheoretic lower bound on the space required to encode the given data, and support an extended set of ope ..."
Abstract

Cited by 56 (21 self)
 Add to MetaCart
We define and design succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the informationtheoretic lower bound on the space required to encode the given data, and support an extended set of operations using the basic operators defined in the ADT. The main advantage of succinct indexes as opposed to succinct (integrated data/index) encodings is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data. In this paper, we present succinct indexes for various data types, namely strings, binary relations and multilabeled trees. Given the support for the interface of the ADTs of these data types, we can support various useful operations efficiently by constructing succinct indexes for them. When the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data. Using our techniques, we design a succinct encoding that represents a string of length n over an alphabet of size σ using nHk(S)+lgσ·o(n)+O ( nlgσ lglglgσ) bits to support access/rank/select operations in o((lglgσ)1+ɛ) time, for any fixed constant ɛ> 0. We also design a succinct text index using nH0(S)+O ( nlgσ) bits that lglgσ
Structuring labeled trees for optimal succinctness, and beyond
 In FOCS
, 2005
"... Consider an ordered, static tree T on t nodes where each node has a label from alphabet set Σ. TreeTmaybeofar bitrary degree and of arbitrary shape. Say, we wish to support basic navigational operations such as find the parent of node u,theith child of u, and any child of u with label α. In a semina ..."
Abstract

Cited by 56 (8 self)
 Add to MetaCart
(Show Context)
Consider an ordered, static tree T on t nodes where each node has a label from alphabet set Σ. TreeTmaybeofar bitrary degree and of arbitrary shape. Say, we wish to support basic navigational operations such as find the parent of node u,theith child of u, and any child of u with label α. In a seminal work over fifteen years ago, Jacobson [15] observed that pointerbased tree representations are wasteful in space and introduced the notion of succinct data structures. He studied the special case of unlabeled trees and presented a succinct data structure of 2t+o(t) bits supporting navigational operations in O(1) time. The space used is asymptotically optimal with the informationtheoretic lower bound averaged over all trees. This led to a slew of results on succinct data structures for arrays, trees, strings