Results 1  10
of
12
Compressing and indexing labeled trees, with applications
, 2009
"... Consider an ordered, static tree T where each node has a label from alphabet �. Tree T may be of arbitrary degree and shape. Our goal is designing a compressed storage scheme of T that supports basic navigational operations among the immediate neighbors of a node (i.e. parent, ith child, or any chi ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
Consider an ordered, static tree T where each node has a label from alphabet �. Tree T may be of arbitrary degree and shape. Our goal is designing a compressed storage scheme of T that supports basic navigational operations among the immediate neighbors of a node (i.e. parent, ith child, or any child with some label,...) as well as more sophisticated pathbased search operations over its labeled structure. We present a novel approach to this problem by designing what we call the XBWtransform of the tree in the spirit of the wellknown BurrowsWheeler transform for strings [1994]. The XBWtransform uses pathsorting to linearize the labeled tree T into two coordinated arrays, one capturing the structure and the other the labels. For the first time, by using the properties of the XBWtransform, our compressed indexes go beyond the informationtheoretic lower bound, and support navigational and pathsearch operations over labeled trees within (near)optimal time bounds and entropybounded space. Our XBWtransform is simple and likely to spur new results in the theory of tree compression and indexing, as well as interesting application contexts. As an example, we use the XBWtransform to design and implement a compressed index for XML documents whose compression ratio is significantly better than the one achievable by stateoftheart tools, and its query time performance is order
Lightweight data indexing and compression in external memory
 In Proc. 8th Latin American Symposium on Theoretical Informatics (LATIN
, 2010
"... Abstract. In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size n, they use only n bits of disk working space while all previou ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size n, they use only n bits of disk working space while all previous approaches use Θ(n log n) bits of disk working space. Moreover, our algorithms access disk data only via sequential scans, thus they take full advantage of modern disk features that make sequential disk accesses much faster than random accesses. We also present a scanbased algorithm for inverting the BWT that uses Θ(n) bits of working space, and a lightweight internalmemory algorithm for computing the BWT which is the fastest in the literature when the available working space is o(n) bits. Finally, we prove lower bounds on the complexity of computing and inverting the BWT via sequential scans in terms of the classic product: internalmemory space × number of passes over the disk data. 1
On Compressing the Textual Web
"... Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing t ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
(Show Context)
Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing the raw Web pages. Many Web applications use simple compression algorithms — e.g. gzip, or wordbased MovetoFront or Huffman coders — and conclude that, even compressed, raw data take more space than Inverted Lists. In this paper we investigate two typical scenarios of use of data compression for large Web collections. In the first scenario, the compressed pages are stored on disk and we only need to support the fast scanning of large parts of the compressed collection (such as for mapreduce paradigms). In the second scenario, we consider the fast access to individual pages of the compressed collection that is distributed among the RAMs of many PCs (such as for search engines and miners). For the first scenario, we provide a thorough experimental comparison among stateoftheart compressors thus indicating pros and cons of the available solutions. For the second scenario, we compare compressedstorage solutions with the new technology of compressed selfindexes [45]. Our results show that Web pages are more compressible than expected and, consequently, that some common beliefs in this area should be reconsidered. Our results are novel for the large spectrum of tested approaches and the size of datasets, and provide a threefold contribution: a nontrivial baseline for designing new compressedstorage solutions, a guide for software developers faced with Webpage storage, and a natural complement to the recent figures on InvertedListcompression achieved by [57, 58].
A simpler analysis of BurrowsWheeler based compression
 In Proc. of the 17th Symposium on Combinatorial Pattern Matching (CPM ’06). SpringerVerlag LNCS
, 2006
"... In this paper we present a new technique for worstcase analysis of compression algorithms which are based on the BurrowsWheeler Transform. We deal mainly with the algorithm proposed by Burrows and Wheeler in their first paper on the subject [6], called bw0. This algorithm consists of the following ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
In this paper we present a new technique for worstcase analysis of compression algorithms which are based on the BurrowsWheeler Transform. We deal mainly with the algorithm proposed by Burrows and Wheeler in their first paper on the subject [6], called bw0. This algorithm consists of the following three essential steps: 1) Obtain the BurrowsWheeler Transform of the text, 2) Convert the transform into a sequence of integers using the movetofront algorithm, 3) Encode the integers using Arithmetic code or any order0 encoding (possibly with runlength encoding). We achieve a strong upper bound on the worstcase compression ratio of this algorithm. This bound is significantly better than bounds known to date and is obtained via simple analytical techniques. Specifically, we show that for any input string s, and µ> 1, the length of the compressed string is bounded by µ · sHk(s)+ log(ζ(µ)) · s  + µgk + O(log n) where Hk is the kth order empirical entropy, gk is a constant depending only on k and on the size of the alphabet, and ζ(µ) = 1 1 1 µ+ 2 µ+... is the standard zeta function. As part of the analysis we prove a result on the compressibility of integer sequences, which is of independent interest. Finally, we apply our techniques to prove a worstcase bound on the compression ratio of a compression algorithm based on the BurrowsWheeler Transform followed by distance coding, for which worstcase guarantees have never been given. We prove that the length of the compressed string is bounded by 1.7286 · sHk(s) + gk + O(log n). This bound is better than the bound we give for bw0.
On the bitcomplexity of LempelZiv compression
"... One of the most famous and investigated lossless datacompression schemes is the one introduced by Lempel and Ziv about 30 years ago [37]. This compression scheme is known as “dictionarybased compressor ” and consists of squeezing an input string by replacing some of its substrings with (shorter) c ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
One of the most famous and investigated lossless datacompression schemes is the one introduced by Lempel and Ziv about 30 years ago [37]. This compression scheme is known as “dictionarybased compressor ” and consists of squeezing an input string by replacing some of its substrings with (shorter) codewords which are actually pointers to a dictionary of phrases built as the string is processed. Surprisingly enough, although many fundamental results are nowadays known about the speed and effectiveness of this compression process (see e.g. [23, 29] and references therein), “we are not aware of any parsing scheme that achieves optimality when the LZ77dictionary is in use under any constraint on the codewords other than being of equal length” [29, pag. 159]. Here optimality means to achieve the minimum number of bits in compressing each individual input string, without any assumption on its generating source. In this paper we investigate three issues pertaining to the bitcomplexity of LZbased compressors, and we design algorithms which achieve bitoptimality in the compressed output size by taking efficient/optimal time and optimal space. These theoretical results will be sustained by some experiments that will compare our novel LZbased compressors against the most popular compression tools (like gzip, bzip2) and stateoftheart compressors (like the booster of [13, 12]).
MovetoFront, Distance Coding, and Inversion Frequencies Revisited
, 2007
"... MovetoFront, Distance Coding and Inversion Frequencies are three somewhat related techniques used to process the output of the BurrowsWheeler Transform. In this paper we analyze these techniques from the point of view of how effective they are in the task of compressing lowentropy strings, that ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
MovetoFront, Distance Coding and Inversion Frequencies are three somewhat related techniques used to process the output of the BurrowsWheeler Transform. In this paper we analyze these techniques from the point of view of how effective they are in the task of compressing lowentropy strings, that is, strings which have many regularities and are therefore highly compressible. This is a nontrivial task since many compressors have nonconstant overheads that become nonnegligible when the input string is highly compressible. Because of the properties of the BurrowsWheeler transform, being locally optimal ensures an algorithm compresses lowentropy strings effectively. Informally, local optimality implies that an algorithm is able to effectively compress an arbitrary partition of the input string. We show that in their original formulation neither MovetoFront, nor Distance Coding, nor Inversion Frequencies is locally optimal. Then, we describe simple variants of the above algorithms which are locally optimal. To achieve local optimality with MovetoFront it suffices to combine it with Run Length Encoding. To achieve local optimality with Distance Coding and Inversion Frequencies we use a novel “escape and reenter” strategy.
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/spe Revisiting Bounded Context BlockSorting Transformations
"... The BurrowsWheeler Transform (bwt) produces a permutation of a string X, denoted X ∗ , by sorting the n cyclic rotations of X into full lexicographical order, and taking the last column of the resulting n × n matrix to be X ∗. The transformation is reversible in O(n) time. In this paper, we conside ..."
Abstract
 Add to MetaCart
(Show Context)
The BurrowsWheeler Transform (bwt) produces a permutation of a string X, denoted X ∗ , by sorting the n cyclic rotations of X into full lexicographical order, and taking the last column of the resulting n × n matrix to be X ∗. The transformation is reversible in O(n) time. In this paper, we consider an alteration to the process, called kbwt, where rotations are only sorted to a depth k. We propose new approaches to the forward and reverse transform, and show the methods are efficient in practice. More than a decade ago, two algorithms were independently discovered for reversing kbwt, both of which run in O(nk) time. Two recent algorithms have lowered the bounds for the reverse transformation to O(n log k) and O(n) respectively. We examine the practical performance for these reversal algorithms. We find the original O(nk) approach is most efficient in practice, and investigate new approaches, aimed at further speeding reversal, which store precomputed context boundaries in the compressed file. By explicitly encoding the context boundaries, we present an O(n) reversal technique that is both efficient and effective. Finally, our study elucidates an inherently cachefriendly – and hitherto unobserved – behaviour in the reverse kbwt, which could lead to new applications of the kbwt transform. In contrast to previous empirical studies, we show the partial transform can be reversed significantly faster than the full transform, without significantly affecting compression effectiveness.
Post BWT Stages of the . . .
"... The lossless BurrowsWheeler compression algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence − the BurrowsWheeler transformation − which groups symbols with a similar context close together. In ..."
Abstract
 Add to MetaCart
The lossless BurrowsWheeler compression algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence − the BurrowsWheeler transformation − which groups symbols with a similar context close together. In the original version, this permutation was followed by a MoveToFront transformation and a final entropy coding stage. Later versions used different algorithms, placed after the BurrowsWheeler transformation, since the following stages have a significant influence on the compression rate. This article describes different algorithms and improvements for these post BWT stages including a new context based approach. Results for compression rates are presented together with compression and decompression times on the Calgary corpus, the Canterbury corpus, the large Canterbury corpus and the Lukas 2D 16 bit medical image corpus.