Results 1 - 10
of
37
Reducing the Space Requirement of Suffix Trees
- Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract
-
Cited by 109 (10 self)
- Add to MetaCart
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
On prediction using variable order Markov models
- JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Compact Suffix Array
, 2000
"... Suffix array is a data structure that can be used to index a large text le so that queries of its content can be answered quickly. Basically a suffix array is an array of all suffixes of the text in the lexicographic order. Whether or not a word occurs in the text can be answered in logarithmic time ..."
Abstract
-
Cited by 29 (10 self)
- Add to MetaCart
Suffix array is a data structure that can be used to index a large text le so that queries of its content can be answered quickly. Basically a suffix array is an array of all suffixes of the text in the lexicographic order. Whether or not a word occurs in the text can be answered in logarithmic time by binary search over the suffix array. In this work we present a method to compress a suffix array such that the search time remains logarithmic. Our experiments show that in some cases a suffix array can be compressed by our method such that the total space requirement is about half of the original.
Linear-Time, Incremental Hierarchy Inference for Compression
- Data Compression Conference, Snowbird, Utah, IEEE Computer Society
, 1997
"... this paper, we present three new results that characterize SEQUITUR's computational and compression performance. First, we prove that SEQUITUR operates in time linear in n, the length of the input sequence, despite its ability to build a hierarchy as deep as log(n). Second, we show that a sequence c ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
this paper, we present three new results that characterize SEQUITUR's computational and compression performance. First, we prove that SEQUITUR operates in time linear in n, the length of the input sequence, despite its ability to build a hierarchy as deep as log(n). Second, we show that a sequence can be compressed incrementally, improving on the non-incremental algorithm that was described by Nevill-Manning et al. (1994), and making on-line compression feasible. Third, we present an intriguing result that emerged during benchmarking; whereas PPMC (Moffat, 1990) outperforms SEQUITUR on most files in the Calgary corpus, SEQUITUR regains the lead when tested on multimegabyte sequences. We make some tentative conclusions about the underlying reasons for this phenomenon, and about the nature of current compression benchmarking.
A Fast Algorithm for Making Suffix Arrays and for Burrows-Wheeler Transformation
- IN PROCEEDINGS OF THE IEEE DATA COMPRESSION CONFERENCE, SNOWBIRD, UTAH, MARCH 30 - APRIL 1
, 1998
"... We propose a fast and memory efficient algorithm for sorting suffixes of a text in lexicographic order. It is important to sort suffixes because an arrayof indexes of suffixes is called suffix array and it is a memory efficient alternative of the suffix tree. Sorting suffixes is also used for the ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
We propose a fast and memory efficient algorithm for sorting suffixes of a text in lexicographic order. It is important to sort suffixes because an arrayof indexes of suffixes is called suffix array and it is a memory efficient alternative of the suffix tree. Sorting suffixes is also used for the Burrows-Wheeler transformation in the Block Sorting text compression, therefore fast sorting algorithms are desired. We compare
The at Most k-Deep Factor Tree
, 2004
"... We present a new data structure to index strings that is very similar to a su#x tree. ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
We present a new data structure to index strings that is very similar to a su#x tree.
Improvements to Burrows-Wheeler Compression Algorithm
, 2000
"... In 1994 Burrows and Wheeler presented a new algorithm for lossless data compression. The compression ratio that can be achieved using their algorithm is comparable with the best known other algorithms, whilst its complexity is relatively small. In this paper we explain the internals of this algorith ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
In 1994 Burrows and Wheeler presented a new algorithm for lossless data compression. The compression ratio that can be achieved using their algorithm is comparable with the best known other algorithms, whilst its complexity is relatively small. In this paper we explain the internals of this algorithm and discuss its various modifications that have been presented so far. Then we propose new improvements for its effectiveness. They allow us for obtaining the compression ratio equal to 2.271 bpc for the Calgary Corpus files, which is the best result in the class of Burrows-Wheeler Transform based algorithms.
Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice
, 2000
"... ..."
PPM performance with BWT complexity: A fast and effective data compression algorithm
- Proceedings of the IEEE
, 2000
"... This paper introduces a new data compression algorithm. The goal underlying this new code design is to achieve a single lossless compression algorithm with the excellent compression ratios of the Prediction by Partial Mapping (PPM) algorithms and the low complexity of codes based on the Burrows Whee ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper introduces a new data compression algorithm. The goal underlying this new code design is to achieve a single lossless compression algorithm with the excellent compression ratios of the Prediction by Partial Mapping (PPM) algorithms and the low complexity of codes based on the Burrows Wheeler Transform (BWT). Like the BWT-based codes, the proposed algorithm requires worst case O(n) computational complexity and memory; in contrast, the unbounded-context PPM algorithm, called PPM 3, requires worst case O(n 2) computational complexity. Like PPM 3, the proposed algorithm allows the use of unbounded contexts. Using standard data sets for comparison, the proposed algorithm achieves compression performance better than that of the BWT-based codes and comparable to that of PPM 3. In particular, the proposed algorithm yields an average rate of 2.29 bits per character (bpc) on the Calgary corpus; this result compares favorably with the 2.33 and 2.34 bpc of PPM5 and PPM 3 (PPM algorithms), the 2.43 bpc of BW94 (the original BWT-based code), and the 3.64 and 2.69 bpc of compress and gzip (popular Unix compression algorithms based on Lempel–Ziv (LZ) coding techniques) on the same data set. The given code does not, however, match the best reported compression performance—2.12 bpc with PPMZ9—listed on the Calgary corpus results web page at the time of this publication. Results on the Canterbury corpus give a similar relative standing. The proposed algorithm gives an average rate of 2.15 bpc on the Canterbury corpus, while the Canterbury corpus web page gives average rates of 1.99 bpc for PPMZ9, 2.11 bpc for PPM5, 2.15 bpc for PPM7, 2.23 bpc for BZIP2 (a popular BWT-based code), and 3.31 and 2.53 bpc for compress and gzip, respectively. Keywords—Burrows Wheeler Transform, lossless source coding, prediction by partial mapping algorithm, suffix trees, text compression. I.
Lossless compression based on the Sequence Memoizer
- In Data Compression Conference 2010
, 2010
"... In this work we describe a sequence compression method based on combining a Bayesian nonparametric sequence model with entropy encoding. The model, a hierarchy of Pitman-Yor processes of unbounded depth previously proposed by Wood et al. [2009] in the context of language modelling, allows modelling ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
In this work we describe a sequence compression method based on combining a Bayesian nonparametric sequence model with entropy encoding. The model, a hierarchy of Pitman-Yor processes of unbounded depth previously proposed by Wood et al. [2009] in the context of language modelling, allows modelling of long-range dependencies by allowing conditioning contexts of unbounded length. We show that incremental approximate inference can be performed in this model, thereby allowing it to be used in a text compression setting. The resulting compressor reliably outperforms several PPM variants on many types of data, but is particularly effective in compressing data that exhibits power law properties. 1

