Results 1  10
of
46
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 269 (97 self)
 Add to MetaCart
(Show Context)
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Compressed suffix arrays and suffix trees with applications to text indexing and string matching
, 2005
"... The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. ..."
Abstract

Cited by 239 (19 self)
 Add to MetaCart
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg Σ  bits by encoding each symbol with lg Σ  bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg Σ  n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg Σ) timeorinO(m +lgn) time, plus an outputsensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg Σ  n +lgɛ Σ  n) search time in the worst case, for any constant
Indexing Text using the ZivLempel Trie
 Journal of Discrete Algorithms
, 2002
"... Let a text of u characters over an alphabet of size be compressible to n symbols by the LZ78 or LZW algorithm. We show that it is possible to build a data structure based on the ZivLempel trie that takes 4n log 2 n(1+o(1)) bits of space and reports the R occurrences of a pattern of length m in ..."
Abstract

Cited by 72 (45 self)
 Add to MetaCart
(Show Context)
Let a text of u characters over an alphabet of size be compressible to n symbols by the LZ78 or LZW algorithm. We show that it is possible to build a data structure based on the ZivLempel trie that takes 4n log 2 n(1+o(1)) bits of space and reports the R occurrences of a pattern of length m in worst case time O(m log(m)+(m+R)log n).
Succinct suffix arrays based on runlength encoding
 Nordic Journal of Computing
, 2005
"... A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex re ..."
Abstract

Cited by 62 (33 self)
 Add to MetaCart
A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex replaces the text. Several remarkable selfindexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n). In this paper we present a new selfindex, called RLFM index for “runlength FMindex”, that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHk log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ. In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the BurrowsWheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes. Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different spacetime tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those. 1
Space Efficient Suffix Trees
, 1998
"... We first give a representation of a suffix tree that uses n lg n + O(n) bits of space and supports searching for a pattern in the given text (from a fixed size alphabet) in O(m) time, where n is the size of the text and m is the size of the pattern. The structure is quite simple and answers a questi ..."
Abstract

Cited by 61 (4 self)
 Add to MetaCart
We first give a representation of a suffix tree that uses n lg n + O(n) bits of space and supports searching for a pattern in the given text (from a fixed size alphabet) in O(m) time, where n is the size of the text and m is the size of the pattern. The structure is quite simple and answers a question raised by Muthukrishnan in [17]. Previous compact representations of suffix trees had a higher lower order term in space and had some expectation assumption [3], or required more time for searching [5]. Then, surprisingly, we show that we can even do better, by developing a structure that uses a suffix array (and so ndlg ne bits) and an additional o(n) bits. String searching can be done in this structure also in O(m) time. Besides supporting string searching, we can also report the number of occurrences of the pattern in the same time using no additional space. In this case the space occupied...
A General Practical Approach to Pattern Matching over ZivLempel Compressed Text
, 1998
"... . We address the problem of string matching on ZivLempel compressed text. The goal is to search a pattern in a text without uncompressing it. This is a highly relevant issue to keep compressed text databases where efficient searching is still possible. We develop a general technique for string matc ..."
Abstract

Cited by 59 (9 self)
 Add to MetaCart
. We address the problem of string matching on ZivLempel compressed text. The goal is to search a pattern in a text without uncompressing it. This is a highly relevant issue to keep compressed text databases where efficient searching is still possible. We develop a general technique for string matching when the text comes as a sequence of blocks. This abstracts the essential features of ZivLempel compression. We then apply the scheme to each particular type of compression. We present the first algorithm to find all the matches of a pattern in a text compressed using LZ77. When we apply our scheme to LZ78, we obtain a much more efficient search algorithm, which is faster than uncompressing the text and then searching on it. Finally, we propose a new hybrid compression scheme which is between LZ77 and LZ78, being in practice as good to compress as LZ77 and as fast to search in as LZ78. 1 Introduction String matching is one of the most pervasive problems in computer science, with appli...
Fast lightweight suffix array construction and checking
 14th Annual Symposium on Combinatorial Pattern Matching
, 2003
"... We describe an algorithm that, for any v 2 [2; n], constructs the suffix array of a string of length n in O(vn + n log n) time using O(v + n= p v) space in addition to the input (the string) and the output (the suffix array). By setting v = log n, we obtain an O(n log n) time algorithm using O n= p ..."
Abstract

Cited by 35 (5 self)
 Add to MetaCart
(Show Context)
We describe an algorithm that, for any v 2 [2; n], constructs the suffix array of a string of length n in O(vn + n log n) time using O(v + n= p v) space in addition to the input (the string) and the output (the suffix array). By setting v = log n, we obtain an O(n log n) time algorithm using O n= p
Geometric burrowswheeler transform: Linking range searching and text indexing
 In DCC
"... We introduce a new variant of the popular BurrowsWheeler transform (BWT) called Geometric BurrowsWheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2dimensional geometry. Using this transform, we can answer to many open questions i ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
(Show Context)
We introduce a new variant of the popular BurrowsWheeler transform (BWT) called Geometric BurrowsWheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2dimensional geometry. Using this transform, we can answer to many open questions in compressed text indexing: (1) Can compressed data structures be designed in external memory with similar performance as the uncompressed counterparts? (2) Can compressed data structures be designed for position restricted pattern matching [16]? We also introduce a reverse transform, called Points2Text, which converts a set of points into text. This transform allows us to derive the first known lower bounds in compressed text indexing. We show strong equivalence between data structural problems in geometric range searching and text pattern matching. This provides a way to derive new results in compressed text indexing by translating the results from range searching. 1
Fullycompressed suffix trees
 IN: PACS 2000. LNCS
, 2000
"... Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog ..."
Abstract

Cited by 31 (20 self)
 Add to MetaCart
Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog 2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but still unsatisfactory when σ is small as in DNA sequences. In this paper we introduce the first compressed suffix tree representation that breaks this linearspace barrier. Our representation requires sublinear extra space and supports a large set of navigational operations in logarithmic time. An essential ingredient of our representation is the lowest common ancestor (LCA) query. We reveal important connections between LCA queries and suffix tree navigation.
A compressed selfindex using a ZivLempel dictionary
 In: SPIRE. Volume 4209 of LNCS. (2006) 163–180
"... Abstract. A compressed fulltext selfindex for a text T, of size u, is a data structure used to search patterns P, of size m, in T that requires reduced space, i.e. that depends on the empirical entropy (Hk, H0) of T, and is, furthermore, able to reproduce any substring of T. In this paper we prese ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
(Show Context)
Abstract. A compressed fulltext selfindex for a text T, of size u, is a data structure used to search patterns P, of size m, in T that requires reduced space, i.e. that depends on the empirical entropy (Hk, H0) of T, and is, furthermore, able to reproduce any substring of T. In this paper we present a new compressed selfindex able to locate the occurrences of P in O((m + occ) log n) time, where occ is the number of occurrences and σ the size of the alphabet of T. The fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on m from O(m 2) to O(m). To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose and explore the nature of a recurrent structure in LZindexes, the T78 suffix tree. We show that our method is very competitive in practice by comparing it against the LZIndex, the FMindex and a compressed suffix array. 1