Results 1  10
of
32
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 224 (94 self)
 Add to MetaCart
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Compressed representations of sequences and fulltext indexes
 ACM Transactions on Algorithms
, 2007
"... Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) pro ..."
Abstract

Cited by 147 (72 self)
 Add to MetaCart
(Show Context)
Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed fulltext indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FMindex that indexes a string T [1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log Σ  n, constant 0 < α < 1, and Σ  = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length ℓ in O(ℓ + log 1+ε n) time.
Indexing Text using the ZivLempel Trie
 Journal of Discrete Algorithms
, 2002
"... Let a text of u characters over an alphabet of size be compressible to n symbols by the LZ78 or LZW algorithm. We show that it is possible to build a data structure based on the ZivLempel trie that takes 4n log 2 n(1+o(1)) bits of space and reports the R occurrences of a pattern of length m in ..."
Abstract

Cited by 67 (43 self)
 Add to MetaCart
(Show Context)
Let a text of u characters over an alphabet of size be compressible to n symbols by the LZ78 or LZW algorithm. We show that it is possible to build a data structure based on the ZivLempel trie that takes 4n log 2 n(1+o(1)) bits of space and reports the R occurrences of a pattern of length m in worst case time O(m log(m)+(m+R)log n).
Practical implementation of rank and select queries
 In Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA’05) (Greece
, 2005
"... Research on succinct data structures has made significant progress in recent years. An essential building block of many of those techniques is a data structure to perform rank and select operations over a bit array. The first operation tells how many bits are set up to some position, and the second ..."
Abstract

Cited by 58 (20 self)
 Add to MetaCart
Research on succinct data structures has made significant progress in recent years. An essential building block of many of those techniques is a data structure to perform rank and select operations over a bit array. The first operation tells how many bits are set up to some position, and the second the position of the ith bit set. Albeit there exist constanttime solutions that require sublinear extra space, the practicality of those solutions against more naive ones has not been carefully studied. In this paper we show some results in this respect, which suggest that in many practical cases the simpler solutions are better in terms of time and extra space.
Succinct suffix arrays based on runlength encoding
 Nordic Journal of Computing
, 2005
"... A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex re ..."
Abstract

Cited by 58 (32 self)
 Add to MetaCart
A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex replaces the text. Several remarkable selfindexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n). In this paper we present a new selfindex, called RLFM index for “runlength FMindex”, that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHk log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ. In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the BurrowsWheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes. Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different spacetime tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those. 1
Two space saving tricks for linear time LCP computation
, 2004
"... Abstract. In this paper we consider the linear time algorithm of Kasai et al. [6] for the computation of the Longest Common Prefix (LCP) array given the text and the suffix array. We show that this algorithm can be implemented without any auxiliary array in addition to the ones required for the inpu ..."
Abstract

Cited by 37 (3 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper we consider the linear time algorithm of Kasai et al. [6] for the computation of the Longest Common Prefix (LCP) array given the text and the suffix array. We show that this algorithm can be implemented without any auxiliary array in addition to the ones required for the input (the text and the suffix array) and the output (the LCP array). Thus, for a text of length n, we reduce the space occupancy of this algorithm from 13n bytes to 9n bytes. We also consider the problem of computing the LCP array by “overwriting” the suffix array. For this problem we propose an algorithm whose space occupancy can be bounded in terms of the empirical entropy of the input text. Experiments show that for linguistic texts our algorithm uses roughly 7n bytes. Our algorithm makes use of the BurrowsWheeler Transform even if it does not represent any data in compressed form. To our knowledge this is the first application of the BurrowsWheeler Transform outside the domain of data compression. The source code for the algorithms described in this paper has been included in the lightweight suffix sorting package [13] which is freely available under the GNU GPL. 1
A Simple AlphabetIndependent FMIndex
 INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE
"... We design a succinct fulltext index based on the idea of Huffmancompressing the text and then applying the BurrowsWheeler transform over it. The resulting structure can be searched as an FMindex, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structu ..."
Abstract

Cited by 18 (9 self)
 Add to MetaCart
(Show Context)
We design a succinct fulltext index based on the idea of Huffmancompressing the text and then applying the BurrowsWheeler transform over it. The resulting structure can be searched as an FMindex, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structure. On a text of length n with zeroorder entropy H0, our index needs O(n(H0 + 1)) bits of space, without any significant dependence on σ. The average search time for a pattern of length m is O(m(H0 + 1)), under reasonable assumptions. Each position of a text occurrence can be located in worst case time O((H0 + 1)log n), while any text substring of length L can be retrieved in O((H0 + 1)L) average time in addition to the previous worst case time. Our index provides a relevant space/time tradeoff between existing succinct data structures, with the additional interest of being easy to implement. We also explore other coding variants alternative to Huffman and exploit their synchronization properties. Our experimental results on various types of texts show that our indexes are highly competitive in the space/time tradeoff map.
Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays
, 2004
"... Abstract. One of the most relevant succinct suffix array proposals in the literature is the Compressed Suffix Array (CSA) of Sadakane [ISAAC 2000]. The CSA needs n(H0 + O(log log σ)) bits of space, where n is the text size, σ is the alphabet size, and H0 the zeroorder entropy of the text. The numbe ..."
Abstract

Cited by 17 (11 self)
 Add to MetaCart
(Show Context)
Abstract. One of the most relevant succinct suffix array proposals in the literature is the Compressed Suffix Array (CSA) of Sadakane [ISAAC 2000]. The CSA needs n(H0 + O(log log σ)) bits of space, where n is the text size, σ is the alphabet size, and H0 the zeroorder entropy of the text. The number of occurrences of a pattern of length m can be computed in O(m log n) time. Most notably, the CSA does not need the text separately available to operate. The CSA simulates a binary search over the suffix array, where the query is compared against text substrings. These are extracted from the same CSA by following irregular access patterns over the structure. Sadakane [SODA 2002] has proposed using backward searching on the CSA in similar fashion as the FMindex of Ferragina and Manzini [FOCS 2000]. He has shown that the CSA can be searched in O(m) time whenever σ = O(polylog(n)). In this paper we consider some other consequences of backward searching applied to CSA. The most remarkable one is that we do not need, unlike all previous proposals, any complicated sublinear structures based on the fourRussians technique (such as constant time rank and select queries on bit arrays). We show that sampling and compression are enough to achieve O(m log n) query time using less space than the original structure. It is also possible to trade structure space for search time. Furthermore, the regular access pattern of backward searching permits an efficient secondary memory implementation, so that the search can be done with O(m log B n) disk accesses, being B the disk block size. Finally, it permits a distributed implementation with optimal speedup and negligible communication effort.
New Search Algorithms and Time/Space Tradeoffs for Succinct Suffix Arrays
, 2004
"... Abstract This paper is about compressed fulltext indexes. That is, our goal is to represent fulltext indexes in as small space as possible and, at the same time, retain the functionality of the index. The most important functionality for a fulltext index is the ability to find out whether a given ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
(Show Context)
Abstract This paper is about compressed fulltext indexes. That is, our goal is to represent fulltext indexes in as small space as possible and, at the same time, retain the functionality of the index. The most important functionality for a fulltext index is the ability to find out whether a given pattern string occurs inside the text string on which the index is built. In addition to supporting this existence query, fulltext indexes usually support counting queries and reporting queries; the former is for counting the number of times the pattern occurs in the text, and the latter is for reporting the exact locations of the occurrences. Suffix trees and arrays are wellknown fulltext indexes that support the above queries nearly optimally. This optimality refers only to the time complexity of the queries, since in space requirement neither are optimal; both structures occupy O(n log n) bits, where n is the length of the text. Notice that the text itself can be represented in n log oe bits, where oe is the alphabet size. Since the text (in some form) is crucial for the fulltext index, it is convenient to express the size of an index as the total size of the structure plus the text. Then obviously O(n log oe) space for a fulltext index would be optimal. For compressible texts it is still possible to achieve space requirement that is proportional to the entropy of the text.
An Experimental Study of a Compressed Index
, 2001
"... The size of electronic data is currently growing at a faster rate than computer memory and disk storage capacities. For this reason compression appears always as an attractive choice, if not mandatory. However space overhead is not the only resource to be optimized when managing large data collectio ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
The size of electronic data is currently growing at a faster rate than computer memory and disk storage capacities. For this reason compression appears always as an attractive choice, if not mandatory. However space overhead is not the only resource to be optimized when managing large data collections; in fact data turn out to be useful only when properly indexed to support search operations that e#ciently extract the userrequested information.