Results 11  20
of
29
Spaceefficient construction of LZindex
 In Proc. ISAAC’05
, 2005
"... Abstract. A compressed fulltext selfindex is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. The LZindex, in particular, requires 4uHk(1 + o(1)) bits of space, where u is the text length in characters a ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
Abstract. A compressed fulltext selfindex is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. The LZindex, in particular, requires 4uHk(1 + o(1)) bits of space, where u is the text length in characters and Hk is its kth order empirical entropy. Although in practice the LZindex needs 1.01.5 times the text size, its construction requires much more main memory (around 5 times the text size), which limits its applicability to large texts. In this paper we present a practical spaceefficient algorithm to construct LZindex, requiring (4+ǫ)uHk+o(u) bits of space, for any constant 0 < ǫ < 1, and O(σu) time, being σ the alphabet size. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index. 1 Introduction and Previous Work A fulltext database is a system providing fast access to a large mass of textual data. The simplest (yet realistic and rather common) scenario is as follows. The text collection is regarded as a unique sequence of characters T1...u over an alphabet Σ of size σ,
A Simple AlphabetIndependent FMIndex
 INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE
"... We design a succinct fulltext index based on the idea of Huffmancompressing the text and then applying the BurrowsWheeler transform over it. The resulting structure can be searched as an FMindex, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structu ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
We design a succinct fulltext index based on the idea of Huffmancompressing the text and then applying the BurrowsWheeler transform over it. The resulting structure can be searched as an FMindex, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structure. On a text of length n with zeroorder entropy H0, our index needs O(n(H0 + 1)) bits of space, without any significant dependence on σ. The average search time for a pattern of length m is O(m(H0 + 1)), under reasonable assumptions. Each position of a text occurrence can be located in worst case time O((H0 + 1)log n), while any text substring of length L can be retrieved in O((H0 + 1)L) average time in addition to the previous worst case time. Our index provides a relevant space/time tradeoff between existing succinct data structures, with the additional interest of being easy to implement. We also explore other coding variants alternative to Huffman and exploit their synchronization properties. Our experimental results on various types of texts show that our indexes are highly competitive in the space/time tradeoff map.
A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays ∗
"... With the first human DNA being decoded into a sequence of about 2.8 billion characters, many biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in the main memory so that any pattern can be located effici ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
With the first human DNA being decoded into a sequence of about 2.8 billion characters, many biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in the main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits for indexing a text of n characters. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 Gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from the text. The main contribution is a construction algorithm that uses only O(n) bits of working memory, and the time complexity is O(n log n). Our construction algorithm is also time and space efficient for texts with large alphabets such as Chinese or Japanese. Precisely, when the alphabet size is Σ, the working space becomes O(n(H0 + 1)) bits, where H0 denotes the order0 entropy of the text and it is at most log Σ; for the time complexity, it remains O(n log n) which is independent of Σ. 1
New Search Algorithms and Time/Space Tradeoffs for Succinct Suffix Arrays
, 2004
"... Abstract This paper is about compressed fulltext indexes. That is, our goal is to represent fulltext indexes in as small space as possible and, at the same time, retain the functionality of the index. The most important functionality for a fulltext index is the ability to find out whether a given ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
Abstract This paper is about compressed fulltext indexes. That is, our goal is to represent fulltext indexes in as small space as possible and, at the same time, retain the functionality of the index. The most important functionality for a fulltext index is the ability to find out whether a given pattern string occurs inside the text string on which the index is built. In addition to supporting this existence query, fulltext indexes usually support counting queries and reporting queries; the former is for counting the number of times the pattern occurs in the text, and the latter is for reporting the exact locations of the occurrences. Suffix trees and arrays are wellknown fulltext indexes that support the above queries nearly optimally. This optimality refers only to the time complexity of the queries, since in space requirement neither are optimal; both structures occupy O(n log n) bits, where n is the length of the text. Notice that the text itself can be represented in n log oe bits, where oe is the alphabet size. Since the text (in some form) is crucial for the fulltext index, it is convenient to express the size of an index as the total size of the structure plus the text. Then obviously O(n log oe) space for a fulltext index would be optimal. For compressible texts it is still possible to achieve space requirement that is proportional to the entropy of the text.
Improved dynamic rankselect entropybound structures
 in Proc. of the Latin American Theoretical Informatics (LATIN
"... Abstract. Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures to manage text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
Abstract. Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures to manage text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated via insertions and deletions of symbols. Two current solutions stand out as the best in the tradeoff of space versus time (considering all the operations). One by Mäkinen and Navarro achieves compressed space (i.e., nH0 + o(n log σ) bits) and O(log nlog σ) worstcase time for all the operations, where n is the sequence length, σ is the alphabet size, and H0 is the zeroorder entropy of the sequence. The other log σ log log n solution, by Lee and Park, achieves O(log n(1 +)) amortized time and uncompressed space, i.e. nlog σ +O(n)+o(nlog σ) bits. In this paper we show that the best of both worlds can be achieved. We log σ combine the solutions to obtain nH0+o(nlog σ) bits of space and O(log n(1+)) worstcase time log log n for all the operations. Apart from the best current solution, we obtain some byproducts that might be
Wavelet Trees for All
"... The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabli ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabling compressed representations. New competitive solutions to a number of problems, based on wavelet trees, are appearing every year. In this survey we give an overview of wavelet trees and the surprising number of applications in which we have found them useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, fulltext indexes, XML indexes, and general numeric sequences.
A simpler analysis of BurrowsWheeler based compression
 In Proc. of the 17th Symposium on Combinatorial Pattern Matching (CPM ’06). SpringerVerlag LNCS
, 2006
"... In this paper we present a new technique for worstcase analysis of compression algorithms which are based on the BurrowsWheeler Transform. We deal mainly with the algorithm proposed by Burrows and Wheeler in their first paper on the subject [6], called bw0. This algorithm consists of the following ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
In this paper we present a new technique for worstcase analysis of compression algorithms which are based on the BurrowsWheeler Transform. We deal mainly with the algorithm proposed by Burrows and Wheeler in their first paper on the subject [6], called bw0. This algorithm consists of the following three essential steps: 1) Obtain the BurrowsWheeler Transform of the text, 2) Convert the transform into a sequence of integers using the movetofront algorithm, 3) Encode the integers using Arithmetic code or any order0 encoding (possibly with runlength encoding). We achieve a strong upper bound on the worstcase compression ratio of this algorithm. This bound is significantly better than bounds known to date and is obtained via simple analytical techniques. Specifically, we show that for any input string s, and µ> 1, the length of the compressed string is bounded by µ · sHk(s)+ log(ζ(µ)) · s  + µgk + O(log n) where Hk is the kth order empirical entropy, gk is a constant depending only on k and on the size of the alphabet, and ζ(µ) = 1 1 1 µ+ 2 µ+... is the standard zeta function. As part of the analysis we prove a result on the compressibility of integer sequences, which is of independent interest. Finally, we apply our techniques to prove a worstcase bound on the compression ratio of a compression algorithm based on the BurrowsWheeler Transform followed by distance coding, for which worstcase guarantees have never been given. We prove that the length of the compressed string is bounded by 1.7286 · sHk(s) + gk + O(log n). This bound is better than the bound we give for bw0.
Fast compression with a static model in highorder entropy
 In Proceedings of the IEEE Data Compression Conference, Snowbird, UT
, 2004
"... We report on a simple encoding format called wzip for decompressing blocksorting transforms, such as the BurrowsWheeler Transform (BWT). Our compressor uses the simple notions of gamma encoding and RLE, organized with a wavelet tree, to achieve a slightly better compression ratio than bzip2 in less ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
We report on a simple encoding format called wzip for decompressing blocksorting transforms, such as the BurrowsWheeler Transform (BWT). Our compressor uses the simple notions of gamma encoding and RLE, organized with a wavelet tree, to achieve a slightly better compression ratio than bzip2 in less time. In fact, our compression/decompression time is dependent on Hh, the hth order empirical entropy. This relationship of performance to the compressibility of data is a key new idea among compression algorithms. Another key contribution of our compressor is its simplicity. Our compressor can also operate as a fulltext index with a small amount of data, while still preserving backward compatibility with just the compressor. 1
Rank/Select on Dynamic Compressed Sequences and Applications
, 2008
"... Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures managing text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated via inserti ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures managing text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated via insertions and deletions of symbols. Two current solutions stand out as the best in the tradeoff of space versus time (when considering all the operations). One solution, by Mäkinen and Navarro, achieves compressed space (i.e., nH0 +o(n log σ) bits) and O(log n log σ) worstcase time for all the operations, where n is the sequence length, σ is the alphabet size, and H0 is the zeroorder entropy of the sequence. The other solution, by Lee and log σ Park, achieves O(log n(1 + log log n)) amortized time and uncompressed space, i.e. n log2 σ +O(n)+o(n log σ) bits. In this paper we show that the best of both worlds can be achieved. We combine the solutions to obtain nH0 + o(n log σ) bits of space log σ log log n and O(log n(1 +)) worstcase time for all the operations. Apart from the best current solution to the problem, we obtain several byproducts of independent interest applicable to partial sums, text indexes, suffix arrays, the BurrowsWheeler transform, and others.
An algorithmic framework for compression and text indexing
"... We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lowerorder terms. For a text T of n symbols drawn from an alphabet Σ, our bounds are stated in terms of the hthorder empirical entropy of the text, Hh. In ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lowerorder terms. For a text T of n symbols drawn from an alphabet Σ, our bounds are stated in terms of the hthorder empirical entropy of the text, Hh. In particular, we provide a tight analysis of the BurrowsWheeler transform (bwt) establishing a bound of nHh + M(T,Σ,h) bits, where M(T,Σ,h) denotes the asymptotical number of bits required to store the empirical statistical model for contexts of order h appearing in T. Using the same framework, we also obtain an implementation of the compressed suffix array (csa) which achieves nHh + M(T,Σ,h) + O(nlg lg n/lg Σ  n) bits of space while still retaining competitive fulltext indexing functionality. The novelty of the proposed framework lies in its use of the finite set model instead of the empirical probability model (as in previous work), giving us new insight into the design and analysis of our algorithms. For example, we show that our analysis gives improved bounds since M(T,Σ,h) ≤ min{g ′ h lg(n/g ′ h + 1),H ∗ hn + lg n + g′′ h}, where g ′ h = O(Σh+1) and g ′′ h = O(Σ  h+1 lg Σ  h+1) do not depend on the text length n, while H ∗ h ≥ Hh is the modified hthorder empirical entropy of T. Moreover, we show a strong relationship between a compressed fulltext index and the succinct dictionary problem. We also examine the importance of lowerorder terms, as these can dwarf any savings achieved by highorder entropy. We report further results and tradeoffs on highorder entropycompressed text indexes in the paper. 1