Results 1  10
of
17
Reducing the Space Requirement of Suffix Trees
 Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract

Cited by 117 (10 self)
 Add to MetaCart
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
Boosting textual compression in optimal linear time
 Journal of the ACM
, 2005
"... Abstract. We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression Extended abstracts related to this article appeared in Proceedings of CPM 2001 and Proceedings of ACMSIAM SOD ..."
Abstract

Cited by 39 (19 self)
 Add to MetaCart
Abstract. We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression Extended abstracts related to this article appeared in Proceedings of CPM 2001 and Proceedings of ACMSIAM SODA 2004, and were combined due to their strong relatedness and complementarity. The work of P. Ferragina was partially supported by the Italian MIUR projects “Algorithms for the Next
Universal Lossless Source Coding With the Burrows Wheeler Transform
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2002
"... The Burrows Wheeler Transform (BWT) is a reversible sequence transformation used in a variety of practical lossless sourcecoding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWTbased compression schemes ar ..."
Abstract

Cited by 38 (3 self)
 Add to MetaCart
The Burrows Wheeler Transform (BWT) is a reversible sequence transformation used in a variety of practical lossless sourcecoding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWTbased compression schemes are widely touted as lowcomplexity algorithms giving lossless coding rates better than those of the ZivLempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWTbased coding. The main results of this theoretical evaluation include: 1) statistical characterizations of the BWT output on both finite strings and sequences of length , 2) a variety of very simple new techniques for BWTbased lossless source coding, and 3) proofs of the universality and bounds on the rates of convergence of both new and existing BWTbased codes for finitememory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWTbased lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in ZivLempel style codes and, for some BWTbased codes, within a constant factor of the optimal rate of convergence for finitememory sources.
Modifications of the Burrows and Wheeler Data Compression Algorithm
 Proceedings of the ieee Data Compression Conference
, 1999
"... this paper we improve upon these previous results on the BWalgorithm. Based on the context tree model, we consider the specific statistical properties of the data at the output of the BWT. We describe six important properties, three of which have not been described elsewhere. These considerations l ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
this paper we improve upon these previous results on the BWalgorithm. Based on the context tree model, we consider the specific statistical properties of the data at the output of the BWT. We describe six important properties, three of which have not been described elsewhere. These considerations lead to modifications of the coding method, which in turn improve the coding efficiency. We shortly describe how to compute the BWT with low complexity in time and space, using suffix trees in two different representations. Finally, we present experimental results about the compression rate and running time of our method, and compare these results to previous achievements. More references on the methods described in this paper can be found in [1, 5].
Linear Time Universal Coding and Time Reversal of Tree Sources via FSM Closure
 IEEE Trans. Inform. Theory
, 2004
"... Tree models are efficient parametrizations of finitememory processes, offering potentially significant model cost savings. The information theory literature has focused mostly on redundancy aspects of the universal estimation and coding of these models. In this paper, we investigate representations ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
Tree models are efficient parametrizations of finitememory processes, offering potentially significant model cost savings. The information theory literature has focused mostly on redundancy aspects of the universal estimation and coding of these models. In this paper, we investigate representations and supporting data structures for finitememory processes, as well as the major impact these structures have on the computational complexity of the universal algorithms in which they are used. We first generalize the class of tree models, and then define and investigate the properties of the finite state machine (FSM) closure of a tree, which is the smallest FSM that generates all the processes generated by the tree. The interaction between FSM closures, generalized context trees, and classical data structures such as compact suffix trees brings together the informationtheoretic and the computational aspects, leading to an implementation in linear encoding/decoding time of the semipredictive approach to the Context algorithm, a lossless universal coding scheme in the class of tree models. An optimal context selection rule and the corresponding context transitions are computationally not more expensive than the various steps involved in the implementation of the BurrowsWheeler transform (BWT) and use, in fact, similar tools. We also present a reversible transform that displays the same "context deinterleaving" feature as the BWT but is naturally based on an optimal context tree. FSM closures are also applied to an investigation of the effect of time reversal on tree models, motivated in part by the following question: When compressing a data sequence using a universal scheme in the class of tree models, can it make a difference whether we read the sequence from...
LIPT: A Lossless Text Transform to improve compression
 In Proceedings of International Conference on Information and Theory : Coding and Computing, Las Vegas
, 2001
"... We propose an approach to develop a dictionary based reversible lossless text transformation, called LIPT (Length Index Preserving Transform), which can be applied to a source text to improve existing algorithm’s ability to compress. In LIPT, the length of the input word and the offset of the words ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
We propose an approach to develop a dictionary based reversible lossless text transformation, called LIPT (Length Index Preserving Transform), which can be applied to a source text to improve existing algorithm’s ability to compress. In LIPT, the length of the input word and the offset of the words in the dictionary are denoted with alphabets. Our encoding scheme makes use of recurrence of same length of words in the English Language to create context in the transformed text that the entropy coders can exploit. LIPT achieves some compression at the preprocessing stage as well and retains enough context and redundancy for the compression algorithms to give better results. Bzip2 with LIPT gives 5.24 % improvement in average BPC over Bzip2 without LIPT, and PPMD with LIPT gives 4.46% improvement in average BPC over PPMD without LIPT, for our test corpus. 1.
PPM performance with BWT complexity: A fast and effective data compression algorithm
 Proceedings of the IEEE
, 2000
"... This paper introduces a new data compression algorithm. The goal underlying this new code design is to achieve a single lossless compression algorithm with the excellent compression ratios of the Prediction by Partial Mapping (PPM) algorithms and the low complexity of codes based on the Burrows Whee ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
This paper introduces a new data compression algorithm. The goal underlying this new code design is to achieve a single lossless compression algorithm with the excellent compression ratios of the Prediction by Partial Mapping (PPM) algorithms and the low complexity of codes based on the Burrows Wheeler Transform (BWT). Like the BWTbased codes, the proposed algorithm requires worst case O(n) computational complexity and memory; in contrast, the unboundedcontext PPM algorithm, called PPM 3, requires worst case O(n 2) computational complexity. Like PPM 3, the proposed algorithm allows the use of unbounded contexts. Using standard data sets for comparison, the proposed algorithm achieves compression performance better than that of the BWTbased codes and comparable to that of PPM 3. In particular, the proposed algorithm yields an average rate of 2.29 bits per character (bpc) on the Calgary corpus; this result compares favorably with the 2.33 and 2.34 bpc of PPM5 and PPM 3 (PPM algorithms), the 2.43 bpc of BW94 (the original BWTbased code), and the 3.64 and 2.69 bpc of compress and gzip (popular Unix compression algorithms based on Lempel–Ziv (LZ) coding techniques) on the same data set. The given code does not, however, match the best reported compression performance—2.12 bpc with PPMZ9—listed on the Calgary corpus results web page at the time of this publication. Results on the Canterbury corpus give a similar relative standing. The proposed algorithm gives an average rate of 2.15 bpc on the Canterbury corpus, while the Canterbury corpus web page gives average rates of 1.99 bpc for PPMZ9, 2.11 bpc for PPM5, 2.15 bpc for PPM7, 2.23 bpc for BZIP2 (a popular BWTbased code), and 3.31 and 2.53 bpc for compress and gzip, respectively. Keywords—Burrows Wheeler Transform, lossless source coding, prediction by partial mapping algorithm, suffix trees, text compression. I.
The BurrowsWheeler Transform: Theory and Practice
 Lecture Notes in Computer Science
, 1999
"... In this paper we describe the BurrowsWheeler Transform (BWT) a completely new approach to data compression which is the basis of some of the best compressors available today. Although it is easy to intuitively understand why the BWT helps compression, the analysis of BWTbased algorithms requir ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
In this paper we describe the BurrowsWheeler Transform (BWT) a completely new approach to data compression which is the basis of some of the best compressors available today. Although it is easy to intuitively understand why the BWT helps compression, the analysis of BWTbased algorithms requires a careful study of every single algorithmic component. We describe two algorithms which use the BWT and we show that their compression ratio can be bounded in terms of the kth order empirical entropy of the input string for any k 0. Intuitively, this means that these algorithms are able to make use of all the regularity which is in the input string.
Space Efficient Linear Time Computation Of The Burrows And WheelerTransformation
 complexity, Festschrift in honour of Rudolf Ahlswede's 60th Birthday
, 1999
"... this paper, we further improve on [7], and show that a sux tree based method requires on average about the same amount of space as the nonlinear methods mentioned above. The improvement is achieved by exploiting the fact, that in practice, the BWalgorithm processes long input strings in blocks of ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
this paper, we further improve on [7], and show that a sux tree based method requires on average about the same amount of space as the nonlinear methods mentioned above. The improvement is achieved by exploiting the fact, that in practice, the BWalgorithm processes long input strings in blocks of a limited size (for this reason some researchers use the notion of \BlockSorting"algorithm ). Assuming a maximal block size of 2