Results 1  10
of
26
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 180 (81 self)
 Add to MetaCart
(Show Context)
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Indexing Text using the ZivLempel Trie
 Journal of Discrete Algorithms
, 2002
"... Let a text of u characters over an alphabet of size be compressible to n symbols by the LZ78 or LZW algorithm. We show that it is possible to build a data structure based on the ZivLempel trie that takes 4n log 2 n(1+o(1)) bits of space and reports the R occurrences of a pattern of length m in ..."
Abstract

Cited by 66 (43 self)
 Add to MetaCart
(Show Context)
Let a text of u characters over an alphabet of size be compressible to n symbols by the LZ78 or LZW algorithm. We show that it is possible to build a data structure based on the ZivLempel trie that takes 4n log 2 n(1+o(1)) bits of space and reports the R occurrences of a pattern of length m in worst case time O(m log(m)+(m+R)log n).
Rank and select revisited and extended
 Workshop on SpaceConscious Algorithms, University of
, 2006
"... The deep connection between the BurrowsWheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corr ..."
Abstract

Cited by 32 (17 self)
 Add to MetaCart
The deep connection between the BurrowsWheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. It has been shown that improvements to rank/select algorithms, in combination with the BWT, turn into improved compressed text indexes. This paper is devoted to alternative implementations and extensions of rank and select data structures. First, we show that one can use gap encoding techniques to obtain constant time rank and select queries in essentially the same space as what is achieved by the best current direct solution (and sometimes less). Second, we extend symbol rank and select to substring rank and select, giving several space/time tradeoffs for the problem. An application of these queries is in positionrestricted substring searching, where one can specify the range in the text where the search is restricted to, and only occurrences residing in that range are to be reported. In addition, arbitrary occurrences are reported in text position order. Several byproducts of our results display connections with searchable partial sums, Chazelle’s twodimensional data structures, and Grossi et al.’s wavelet trees.
Positionrestricted substring searching
 OF LECTURE NOTES IN COMPUTER SCIENCE
, 2006
"... A fulltext index is a data structure built over a text string T[1, n]. The most basic functionality provided is (a) counting how many times a pattern string P[1, m] appears in T and (b) locating all those occ positions. There exist several indexes that solve (a) in O(m) time and (b) in O(occ) tim ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
A fulltext index is a data structure built over a text string T[1, n]. The most basic functionality provided is (a) counting how many times a pattern string P[1, m] appears in T and (b) locating all those occ positions. There exist several indexes that solve (a) in O(m) time and (b) in O(occ) time. In this paper we propose two new queries, (c) counting how many times P[1, m] appears in T[l, r] and (d) locating all those occl,r positions. These can be solved using (a) and (b) but this requires O(occ) time. We present two solutions to (c) and (d) in this paper. The first is an index that requires O(n log n) bits of space and answers (c) in O(m + log n) time and (d) in O(log n) time per occurrence (that is, O(occl,r log n) time overall). A variant of the first solution answers (c) in O(m + log log n) time and (d) in constant time per occurrence, but requires O(nlog 1+ǫ n) bits of space for any constant ǫ> 0. The second solution requires O(nm log σ) bits of space, solving (c) in O(m⌈log σ/log log n⌉) time and (d) in O(m⌈log σ/log log n⌉) time per
Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays
, 2004
"... Abstract. One of the most relevant succinct suffix array proposals in the literature is the Compressed Suffix Array (CSA) of Sadakane [ISAAC 2000]. The CSA needs n(H0 + O(log log σ)) bits of space, where n is the text size, σ is the alphabet size, and H0 the zeroorder entropy of the text. The numbe ..."
Abstract

Cited by 17 (11 self)
 Add to MetaCart
(Show Context)
Abstract. One of the most relevant succinct suffix array proposals in the literature is the Compressed Suffix Array (CSA) of Sadakane [ISAAC 2000]. The CSA needs n(H0 + O(log log σ)) bits of space, where n is the text size, σ is the alphabet size, and H0 the zeroorder entropy of the text. The number of occurrences of a pattern of length m can be computed in O(m log n) time. Most notably, the CSA does not need the text separately available to operate. The CSA simulates a binary search over the suffix array, where the query is compared against text substrings. These are extracted from the same CSA by following irregular access patterns over the structure. Sadakane [SODA 2002] has proposed using backward searching on the CSA in similar fashion as the FMindex of Ferragina and Manzini [FOCS 2000]. He has shown that the CSA can be searched in O(m) time whenever σ = O(polylog(n)). In this paper we consider some other consequences of backward searching applied to CSA. The most remarkable one is that we do not need, unlike all previous proposals, any complicated sublinear structures based on the fourRussians technique (such as constant time rank and select queries on bit arrays). We show that sampling and compression are enough to achieve O(m log n) query time using less space than the original structure. It is also possible to trade structure space for search time. Furthermore, the regular access pattern of backward searching permits an efficient secondary memory implementation, so that the search can be done with O(m log B n) disk accesses, being B the disk block size. Finally, it permits a distributed implementation with optimal speedup and negligible communication effort.
A Simple AlphabetIndependent FMIndex
 INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE
"... We design a succinct fulltext index based on the idea of Huffmancompressing the text and then applying the BurrowsWheeler transform over it. The resulting structure can be searched as an FMindex, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structu ..."
Abstract

Cited by 15 (7 self)
 Add to MetaCart
(Show Context)
We design a succinct fulltext index based on the idea of Huffmancompressing the text and then applying the BurrowsWheeler transform over it. The resulting structure can be searched as an FMindex, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structure. On a text of length n with zeroorder entropy H0, our index needs O(n(H0 + 1)) bits of space, without any significant dependence on σ. The average search time for a pattern of length m is O(m(H0 + 1)), under reasonable assumptions. Each position of a text occurrence can be located in worst case time O((H0 + 1)log n), while any text substring of length L can be retrieved in O((H0 + 1)L) average time in addition to the previous worst case time. Our index provides a relevant space/time tradeoff between existing succinct data structures, with the additional interest of being easy to implement. We also explore other coding variants alternative to Huffman and exploit their synchronization properties. Our experimental results on various types of texts show that our indexes are highly competitive in the space/time tradeoff map.
Runlength compressed indexes are superior for highly repetitive sequence collections
 In Proc. 15th SPIRE, LNCS 5280
, 2008
"... Abstract. A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
(Show Context)
Abstract. A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive sets of highly repetitive sequence collections in spaceefficient manner so that retrieval of the content as well as queries on the content of the sequences can be provided timeefficiently. We show that the stateoftheart entropybound fulltext selfindexes do not yet provide satisfactory space bounds for this specific task. We engineer some new structures that use runlength encoding and give empirical evidence that these structures are superior to the current structures. 1
Spaceefficient construction of LZindex
 In Proc. ISAAC’05
, 2005
"... Abstract. A compressed fulltext selfindex is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. The LZindex, in particular, requires 4uHk(1 + o(1)) bits of space, where u is the text length in characters a ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
(Show Context)
Abstract. A compressed fulltext selfindex is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. The LZindex, in particular, requires 4uHk(1 + o(1)) bits of space, where u is the text length in characters and Hk is its kth order empirical entropy. Although in practice the LZindex needs 1.01.5 times the text size, its construction requires much more main memory (around 5 times the text size), which limits its applicability to large texts. In this paper we present a practical spaceefficient algorithm to construct LZindex, requiring (4+ǫ)uHk+o(u) bits of space, for any constant 0 < ǫ < 1, and O(σu) time, being σ the alphabet size. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index. 1 Introduction and Previous Work A fulltext database is a system providing fast access to a large mass of textual data. The simplest (yet realistic and rather common) scenario is as follows. The text collection is regarded as a unique sequence of characters T1...u over an alphabet Σ of size σ,
New Search Algorithms and Time/Space Tradeoffs for Succinct Suffix Arrays
, 2004
"... Abstract This paper is about compressed fulltext indexes. That is, our goal is to represent fulltext indexes in as small space as possible and, at the same time, retain the functionality of the index. The most important functionality for a fulltext index is the ability to find out whether a given ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
(Show Context)
Abstract This paper is about compressed fulltext indexes. That is, our goal is to represent fulltext indexes in as small space as possible and, at the same time, retain the functionality of the index. The most important functionality for a fulltext index is the ability to find out whether a given pattern string occurs inside the text string on which the index is built. In addition to supporting this existence query, fulltext indexes usually support counting queries and reporting queries; the former is for counting the number of times the pattern occurs in the text, and the latter is for reporting the exact locations of the occurrences. Suffix trees and arrays are wellknown fulltext indexes that support the above queries nearly optimally. This optimality refers only to the time complexity of the queries, since in space requirement neither are optimal; both structures occupy O(n log n) bits, where n is the length of the text. Notice that the text itself can be represented in n log oe bits, where oe is the alphabet size. Since the text (in some form) is crucial for the fulltext index, it is convenient to express the size of an index as the total size of the structure plus the text. Then obviously O(n log oe) space for a fulltext index would be optimal. For compressible texts it is still possible to achieve space requirement that is proportional to the entropy of the text.
Wavelet Trees for All
"... The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabli ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
(Show Context)
The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabling compressed representations. New competitive solutions to a number of problems, based on wavelet trees, are appearing every year. In this survey we give an overview of wavelet trees and the surprising number of applications in which we have found them useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, fulltext indexes, XML indexes, and general numeric sequences.