Results 1  10
of
18
Unbounded Length Contexts for PPM
 The Computer Journal
, 1995
"... uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we ..."
Abstract

Cited by 111 (7 self)
 Add to MetaCart
uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we give some results that demonstrate an improvement of about 6% over the old method. Finally, a recentlypublished and seemingly unrelated compression scheme [2] is related to the unboundedcontext idea that forms the essential innovation of PPM*. 1 PPM: Prediction by partial match The basic idea of PPM is to use the last few characters in the input stream to predict the upcoming one. Models that condition their predictions on a few immediately preceding symbols are called "finitecontext" models of order k, where k is the number of preceding symbols used. PPM employs a suite of fixedorder context models with different values of k
On prediction using variable order Markov models
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract

Cited by 56 (1 self)
 Add to MetaCart
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average logloss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the LempelZiv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Second step algorithms in the BurrowsWheeler compression algorithm
 Software Practice and Experience
, 2001
"... In this paper we fix our attention on the second step algorithms of the BurrowsWheeler compression algorithm, which in the original version is the Move To Front transform. We discuss many of its replacements presented so far, and compare compression results obtained using them. Then we propose ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
In this paper we fix our attention on the second step algorithms of the BurrowsWheeler compression algorithm, which in the original version is the Move To Front transform. We discuss many of its replacements presented so far, and compare compression results obtained using them. Then we propose a new algorithm that yields a better compression ratio than the previous ones.
OnLine Stochastic Processes in Data Compression
, 1996
"... The ability to predict the future based upon the past in finitealphabet sequences has many applications, including communications, data security, pattern recognition, and natural language processing. By Shannon's theory and the breakthrough development of arithmetic coding, any sequence, a 1 a 2 \ ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
The ability to predict the future based upon the past in finitealphabet sequences has many applications, including communications, data security, pattern recognition, and natural language processing. By Shannon's theory and the breakthrough development of arithmetic coding, any sequence, a 1 a 2 \Delta \Delta \Delta a n , can be encoded in a number of bits that is essentially equal to the minimal informationlossless codelength, P i \Gamma log 2 p(a i ja 1 \Delta \Delta \Delta a i\Gamma1 ). The goal of universal online modeling, and therefore of universal data compression, is to deduce the model of the input sequence a 1 a 2 \Delta \Delta \Delta a n that can estimate each p(a i ja 1 \Delta \Delta \Delta a i\Gamma1 ) knowing only a 1 a 2 \Delta \Delta \Delta a i\Gamma1 so that the ex...
Improvements to BurrowsWheeler Compression Algorithm
, 2000
"... In 1994 Burrows and Wheeler presented a new algorithm for lossless data compression. The compression ratio that can be achieved using their algorithm is comparable with the best known other algorithms, whilst its complexity is relatively small. In this paper we explain the internals of this algorith ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
In 1994 Burrows and Wheeler presented a new algorithm for lossless data compression. The compression ratio that can be achieved using their algorithm is comparable with the best known other algorithms, whilst its complexity is relatively small. In this paper we explain the internals of this algorithm and discuss its various modifications that have been presented so far. Then we propose new improvements for its effectiveness. They allow us for obtaining the compression ratio equal to 2.271 bpc for the Calgary Corpus files, which is the best result in the class of BurrowsWheeler Transform based algorithms.
Lossless compression based on the Sequence Memoizer
 In Data Compression Conference 2010
, 2010
"... In this work we describe a sequence compression method based on combining a Bayesian nonparametric sequence model with entropy encoding. The model, a hierarchy of PitmanYor processes of unbounded depth previously proposed by Wood et al. [2009] in the context of language modelling, allows modelling ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
In this work we describe a sequence compression method based on combining a Bayesian nonparametric sequence model with entropy encoding. The model, a hierarchy of PitmanYor processes of unbounded depth previously proposed by Wood et al. [2009] in the context of language modelling, allows modelling of longrange dependencies by allowing conditioning contexts of unbounded length. We show that incremental approximate inference can be performed in this model, thereby allowing it to be used in a text compression setting. The resulting compressor reliably outperforms several PPM variants on many types of data, but is particularly effective in compressing data that exhibits power law properties. 1
PPM Performance with BWT Complexity: A New Method for Lossless Data Compression
"... This work combines a new fast contextsearch algorithm with the lossless source coding models of PPM to achieve a lossless data compression algorithm with the linear contextsearch complexity and memory of BWT and ZivLempel codes and the compression performance of PPMbased algorithms. Both sequent ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
This work combines a new fast contextsearch algorithm with the lossless source coding models of PPM to achieve a lossless data compression algorithm with the linear contextsearch complexity and memory of BWT and ZivLempel codes and the compression performance of PPMbased algorithms. Both sequential and nonsequential encoding are considered. The proposed algorithm yields an average rate of 2.27 bits per character (bpc) on the Calgary corpus, comparing favorably to the 2.33 and 2.34 bpc of PPM5 and PPM and the 2.43 bpc of BW94 but not matching the 2.12 bpc of PPMZ9, which, at the time of this publication, gives the greatest compression of all algorithms reported on the Calgary corpus results page. The proposed algorithm gives an average rate of 2.14 bpc on the Canterbury corpus. The Canterbury corpus web page gives average rates of 1.99 bpc for PPMZ9, 2.11 bpc for PPM5, 2.15 bpc for PPM7, and 2.23 bpc for BZIP2 (a BWTbased code) on the same data set.
A Percolating State Selector for SuffixTree Context Models
 In Proceedings Data Compression Conference. IEEE Computer
, 1997
"... This paper introduces into practice and empirically evaluates a set of techniques for performing informationtheoretic state selection that have been developing in asymptotic results for over a decade. State selection, which actually implements the selection of an entire model from among a set of co ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
This paper introduces into practice and empirically evaluates a set of techniques for performing informationtheoretic state selection that have been developing in asymptotic results for over a decade. State selection, which actually implements the selection of an entire model from among a set of competing models, is performed at least trivially by all of the suffixtree FSMs used for online probability estimation. The set of stateselection techniques presented here combines orthogonally with the other sets of design options covered in the companion papers, "A Generalization and Improvement to PPM's Blending," and, "An Executable Taxonomy of OnLine Modeling Algorithms," written by this author. The main results of this paper are: ffl a novel dynamic programming solution that does not resort to the suboptimal hillclimbing or global order bounds that are used in other techniques, ffl the successful combination of informationtheoretic state selection and em mixtures, which include ...
On compression of parse trees
, 2001
"... We consider methods for compressing parse trees, especially techniques based on statistical modeling. We regard a sequence of productions corresponding to a suffix of the path from the root of a tree to a node as the context of a node. The contexts are augmented with branching information of the nod ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We consider methods for compressing parse trees, especially techniques based on statistical modeling. We regard a sequence of productions corresponding to a suffix of the path from the root of a tree to a node as the context of a node. The contexts are augmented with branching information of the nodes. By applying the text compression algorithm PPM on such contexts we achieve good compression results. We compare experimentally the PPM approach with other methods. 1
An Open Interface for Probabilistic Models of Text
 In Data Compression Conference, Proceedings
, 1999
"... An Application Program Interface (API) for modelling sequential text is described. The API is intended to shield the user from details of the modelling and probability estimation process. This should enable different implementations of models to be replaced transparently in application programs. The ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
An Application Program Interface (API) for modelling sequential text is described. The API is intended to shield the user from details of the modelling and probability estimation process. This should enable different implementations of models to be replaced transparently in application programs. The motivation for this API is work on the use of textual models for applications in addition to strict data compression, e.g. determination of the source of text, spelling correction or segmentation of text by inserting spaces. The API is probabilistic: that is, it supplies the probability of the next symbol in the sequence. It is general enough to deal accurately with models that include escapes for probabilities. The concepts abstracted by the API are explained together with details of the API calls.