Results 1  10
of
10
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 173 (79 self)
 Add to MetaCart
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Implicit compression boosting with applications to selfindexing
 In Proc. SPIRE'07, LNCS 4726
, 2007
"... Abstract. Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique to enhance zeroth order entropy compressors ’ performance to kth order entropy. It works by constructing the BurrowsWheeler transform of the input text, finding optimal partitioning of the transform, and then compre ..."
Abstract

Cited by 29 (16 self)
 Add to MetaCart
Abstract. Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique to enhance zeroth order entropy compressors ’ performance to kth order entropy. It works by constructing the BurrowsWheeler transform of the input text, finding optimal partitioning of the transform, and then compressing each piece using an arbitrary zeroth order compressor. The optimal partitioning has the property that the achieved compression is boosted to kth order entropy, for any k. The technique has an application to text indexing: Essentially, building a wavelet tree (Grossi et al., SODA 2003) for each piece in the partitioning yields a kth order compressed fulltext selfindex providing efficient substring searches on the indexed text (Ferragina et al., SPIRE 2004). In this paper, we show that using explicit compression boosting with wavelet trees is not necessary; our new analysis reveals that the size of the wavelet tree built for the complete BurrowsWheeler transformed text is, in essence, the sum of those built for the pieces in the optimal partitioning. Hence, the technique provides a way to do compression boosting implicitly, with a trivial linear time algorithm, but fixed to a specific zeroth order compressor (Raman et al., SODA 2002). In addition to having these consequences on compression and static fulltext selfindexes, the analysis shows that a recent dynamic zeroth order compressed selfindex (Mäkinen & Navarro, CPM 2006) occupies in fact space proportional to kth order entropy. 1
A LempelZiv text index on secondary storage
 IN PROC. CPM, LNCS 4580
, 2007
"... Fulltext searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for fulltext searching on secondary storage, based on the LempelZiv compression algorithm and requiring 8uHk +o(u lo ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
Fulltext searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for fulltext searching on secondary storage, based on the LempelZiv compression algorithm and requiring 8uHk +o(u log σ) bits of space, where Hk denotes the kth order empirical entropy of T, for any k = o(log σ u). Our experimental results show that our index is significantly smaller than any other practical secondarymemory data structure: 1.4–2.3 times the text size including the text, which means 39%–65 % the size of traditional indexes like String Btrees [Ferragina and Grossi, JACM 1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04–1.68 times the text size, requiring about 20–60 disk accesses, depending on the pattern length.
Implementing the LZindex: Theory versus practice
"... The LZindex is a theoretical proposal of a lightweight data structure for text indexing, based on the ZivLempel trie. If a text of u characters over an alphabet of size σ is compressible to n symbols using the LZ78 algorithm, then the LZindex takes 4n log 2 n(1 + o(1)) bits of space (that is, 4 t ..."
Abstract

Cited by 10 (8 self)
 Add to MetaCart
The LZindex is a theoretical proposal of a lightweight data structure for text indexing, based on the ZivLempel trie. If a text of u characters over an alphabet of size σ is compressible to n symbols using the LZ78 algorithm, then the LZindex takes 4n log 2 n(1 + o(1)) bits of space (that is, 4 times the entropy of the text) and reports the R occurrences of a pattern of length m in worst case time O(m 3 log σ + (m + R) log n). In this paper we face the challenge of obtaining a practical implementation of the LZindex, which is not at all straightforward from the theoretical proposal. We end up with a prototype that takes the promised space and has average search time O(σm log u+ √ uR). This prototype is shown to be faster than other competing approaches when we take into account the time to report the positions or text contexts of the occurrences found. We show in detail the process of implementing the index, which involves interesting lessons of theory versus practice.
Stronger LempelZiv Based Compressed Text Indexing
, 2008
"... Given a text T[1..u] over an alphabet of size σ, the fulltext search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Given a text T[1..u] over an alphabet of size σ, the fulltext search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed fulltext selfindices, which replace the text with a more spaceefficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The LZindex of Navarro is a compressed fulltext selfindex able to represent T using 4uHk(T) + o(u log σ) bits of space, where Hk(T) denotes the kth order empirical entropy of T, for any k = o(log σ u). This space is about four times the compressed text size. It can locate all the occ occurrences of a pattern P in T in O(m 3 log σ+(m+occ) log u) worstcase time. Despite this index has shown to be very competitive in practice, the O(m 3 log σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other stateoftheart alternatives. In this paper we present stronger LempelZiv based indices, improving the overall performance of the LZindex. We achieve indices requiring (2+ǫ)uHk(T)+o(u log σ) bits of space, for any constant ǫ> 0, which makes our indices the smallest existing LZindices. We simultaneously improve the search time to
An Improved Succinct Representation for Dynamic kary Trees
"... Abstract. kary trees are a fundamental data structure in many textprocessing algorithms (e.g., text searching). The traditional pointerbased representation of trees is space consuming, and hence only relatively small trees can be kept in main memory. Nowadays, however, many applications need to st ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Abstract. kary trees are a fundamental data structure in many textprocessing algorithms (e.g., text searching). The traditional pointerbased representation of trees is space consuming, and hence only relatively small trees can be kept in main memory. Nowadays, however, many applications need to store a huge amount of information. In this paper we present a succinct representation for dynamic kary trees of n nodes, requiring 2n + nlog k + o(nlog k) bits of space, which is close to the informationtheoretic lower bound. Unlike alternative representations where the operations on the tree can be usually computed in O(log n) time, our data structure is able to take advantage of asymptotically smaller values of k, supporting the basic operations parent and child in O(log k+log log n) time, which is o(log n) time whenever log k = o(log n). Insertions and deletions of leaves in the tree are supported log k in O((log k + log log n)(1 +)) amortized time. Our replog (log k+log log n) resentation also supports more specialized operations (like subtreesize, depth, etc.), and provides a new tradeoff when k = O(1) allowing faster updates (in O(log log n) amortized time, versus the amortized time of O((log log n) 1+ǫ), for ǫ> 0, from Raman and Rao [21]), at the cost of slower basic operations (in O(log log n) time, versus O(1) time of [21]). 1
Practical Approaches to Reduce the Space Requirement of LempelZivBased Compressed Text Indices
"... Given a text T[1..u] over an alphabet of size σ, the fulltext search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. The current trend in indexed text searching is that of compressed fulltext selfindices, which replace the text with a spaceefficient representatio ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Given a text T[1..u] over an alphabet of size σ, the fulltext search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. The current trend in indexed text searching is that of compressed fulltext selfindices, which replace the text with a spaceefficient representation of it, while at the same time providing indexed access to the text. The LZindex of Navarro is a compressed fulltext selfindex based on the LZ78 compression algorithm. This index requires about 4 times the size of the compressed text, i.e. 4uHk(T) + o(u log σ) bits of space, where Hk(T) is the kth order empirical entropy of text T. This index has shown to be very competitive in practice for locating pattern occurrences and extracting text snippets. However, the LZindex is larger than competing schemes, and does not offer space/time tuning options, which limits its applicability in many practical scenarios. In this paper we study several ways to reduce the space of LZindex, from a practical point of view and in different application scenarios. The main idea used to reduce the space is to regard the original index as a navigation scheme that allows us moving between index components. Then we perform an abstract optimization on this scheme, defining alternative schemes that support the same navigation, yet reducing the original redundancy. We obtain reduced LZindices requiring 3uHk(T) + o(u log σ) and (2 + ǫ)uHk(T) + o(u log σ) bits of space, for any 0 < ǫ < 1. Our LZindices have an average locating time of O(m2 +
Spaceefficient construction of LempelZiv compressed text indexes
, 2009
"... Abstract. A compressed fulltext selfindex is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memo ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract. A compressed fulltext selfindex is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZindex [G. Navarro, Journal of Discrete Algorithms, 2004] stands out for its good performance at extracting text passages and locating pattern occurrences. Given a text T[1..u] over an alphabet of size σ, the LZindex requires 4uHk(T) + o(u log σ) bits of space, where Hk(T) is the kth order empirical entropy of T. Although in practice the LZindex needs 1.01.5 times the text size, its construction requires much more main memory (around 5 times the text size), which limits its applicability only to not so large texts. In this paper we present an spaceefficient algorithm to construct the LZindex in O(u(log σ + log log u)) time and requiring 4uHk(T)+o(ulog σ) bits of space. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index, and outperforming by far the construction time of other compressed indexes. We also adapt our algorithm to construct some recent reduced versions of the LZindex, showing that these can also be built without using extra space on top of that required by the final index. We study an alternative model in which we are given only a limited amount of main memory to carry out the indexing process (less than that required by the final index). We show how to build all the LZindex alternatives in
Compressed Dynamic Tries with Applications to LZCompression in Sublinear Time and Space
"... Abstract. The dynamic trie is a fundamental data structure which finds applications in many areas. This paper proposes a compressed version of the dynamic trie data structure. Our datastructure is not only space efficient, it also allows pattern searching in o(P) time and leaf insertion/deletion ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. The dynamic trie is a fundamental data structure which finds applications in many areas. This paper proposes a compressed version of the dynamic trie data structure. Our datastructure is not only space efficient, it also allows pattern searching in o(P) time and leaf insertion/deletion in o(log n) time, where P  is the length of the pattern and n is the size of the trie. To demonstrate the usefulness of the new data structure, we apply it to the LZcompression problem. For a string S of length s over an alphabet A of size σ, the previously best known algorithms for computing the ZivLempel encoding (lz78) ofS either run in: (1) O(s) timeandO(slog s) bits working space; or (2) O(sσ) time and O(sHk +slog σ/logσ s) bits working space, where Hk is the korder entropy of the text. No previous algorithm runs in sublinear time. Our new data structure implies a LZcompression algorithm which runs in sublinear time and uses optimal working space. More precisely, the LZcompression algorithm uses O(s(log σ +loglogσs)/logσ s)bitsworking space and runs in O(s(log log s) 2 /(logσ s log log log s)) worstcase time, log log log s o(log s which is sublinear when σ =2 (log log s) 2). 1