Results 1  10
of
16
Unbounded length contexts for PPM
 in Proc. Data Compression Conf., DCC95
, 1995
"... ..."
(Show Context)
Models of English text
, 1997
"... The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models of English text have been the result of research into compression. Not only is this an impo ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models of English text have been the result of research into compression. Not only is this an important application of such models but the amount of compression provides a measure of how well such models perform. Three main classes of models are considered: character based models, word based models, and models which use auxilary information in the form of parts of speech. These models are compared in terms of their memory usage and compression.
Semantically Motivated Improvements for PPM Variants
 The Computer Journal
, 1997
"... This paper explains how to significantly improve the compression performance of any PPM variant ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
This paper explains how to significantly improve the compression performance of any PPM variant
PPM performance with BWT complexity: A fast and effective data compression algorithm
 Proceedings of the IEEE
, 2000
"... This paper introduces a new data compression algorithm. The goal underlying this new code design is to achieve a single lossless compression algorithm with the excellent compression ratios of the Prediction by Partial Mapping (PPM) algorithms and the low complexity of codes based on the Burrows Whee ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
This paper introduces a new data compression algorithm. The goal underlying this new code design is to achieve a single lossless compression algorithm with the excellent compression ratios of the Prediction by Partial Mapping (PPM) algorithms and the low complexity of codes based on the Burrows Wheeler Transform (BWT). Like the BWTbased codes, the proposed algorithm requires worst case O(n) computational complexity and memory; in contrast, the unboundedcontext PPM algorithm, called PPM 3, requires worst case O(n 2) computational complexity. Like PPM 3, the proposed algorithm allows the use of unbounded contexts. Using standard data sets for comparison, the proposed algorithm achieves compression performance better than that of the BWTbased codes and comparable to that of PPM 3. In particular, the proposed algorithm yields an average rate of 2.29 bits per character (bpc) on the Calgary corpus; this result compares favorably with the 2.33 and 2.34 bpc of PPM5 and PPM 3 (PPM algorithms), the 2.43 bpc of BW94 (the original BWTbased code), and the 3.64 and 2.69 bpc of compress and gzip (popular Unix compression algorithms based on Lempel–Ziv (LZ) coding techniques) on the same data set. The given code does not, however, match the best reported compression performance—2.12 bpc with PPMZ9—listed on the Calgary corpus results web page at the time of this publication. Results on the Canterbury corpus give a similar relative standing. The proposed algorithm gives an average rate of 2.15 bpc on the Canterbury corpus, while the Canterbury corpus web page gives average rates of 1.99 bpc for PPMZ9, 2.11 bpc for PPM5, 2.15 bpc for PPM7, 2.23 bpc for BZIP2 (a popular BWTbased code), and 3.31 and 2.53 bpc for compress and gzip, respectively. Keywords—Burrows Wheeler Transform, lossless source coding, prediction by partial mapping algorithm, suffix trees, text compression. I.
A Percolating State Selector for SuffixTree Context Models
 In Proceedings Data Compression Conference. IEEE Computer
, 1997
"... This paper introduces into practice and empirically evaluates a set of techniques for performing informationtheoretic state selection that have been developing in asymptotic results for over a decade. State selection, which actually implements the selection of an entire model from among a set of co ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
This paper introduces into practice and empirically evaluates a set of techniques for performing informationtheoretic state selection that have been developing in asymptotic results for over a decade. State selection, which actually implements the selection of an entire model from among a set of competing models, is performed at least trivially by all of the suffixtree FSMs used for online probability estimation. The set of stateselection techniques presented here combines orthogonally with the other sets of design options covered in the companion papers, "A Generalization and Improvement to PPM's Blending," and, "An Executable Taxonomy of OnLine Modeling Algorithms," written by this author. The main results of this paper are: ffl a novel dynamic programming solution that does not resort to the suboptimal hillclimbing or global order bounds that are used in other techniques, ffl the successful combination of informationtheoretic state selection and em mixtures, which include ...
An Open Interface for Probabilistic Models of Text
 In Data Compression Conference, Proceedings
, 1999
"... An Application Program Interface (API) for modelling sequential text is described. The API is intended to shield the user from details of the modelling and probability estimation process. This should enable different implementations of models to be replaced transparently in application programs. The ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
An Application Program Interface (API) for modelling sequential text is described. The API is intended to shield the user from details of the modelling and probability estimation process. This should enable different implementations of models to be replaced transparently in application programs. The motivation for this API is work on the use of textual models for applications in addition to strict data compression, e.g. determination of the source of text, spelling correction or segmentation of text by inserting spaces. The API is probabilistic: that is, it supplies the probability of the next symbol in the sequence. It is general enough to deal accurately with models that include escapes for probabilities. The concepts abstracted by the API are explained together with details of the API calls.
Combining PPM models using a text mining approach
 In Storer and Cohn [128
, 2001
"... : This paper introduces a novel switching method which can be used to combine two or more PPM models. The work derives from our earlier work on modelling English and text mining, and the approach takes advantage of both to help improve the compression performance signicantly. The performance of ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
: This paper introduces a novel switching method which can be used to combine two or more PPM models. The work derives from our earlier work on modelling English and text mining, and the approach takes advantage of both to help improve the compression performance signicantly. The performance of the combination of models is at least as good as (and in many cases signicantly better than) the best performed of the individual models. 1 Introduction The PPM data compression scheme has consistently set the standard in lossless compression of text since it was originally described by Cleary & Witten back in 1984. Moat's (1990) implementation, PPMC, set the benchmark for over a decade, and currently, an implementation of the PPMD algorithm (Howard, 1993) has the distinction of being the best \allround" compression scheme (ACT, 2000). Other variations on a very productive research theme include improved blending algorithms (Bunton, 1996), improved escape estimation for the nely tun...
Symboldriven compression of burrows wheeler transformed text
, 2000
"... Despite the enormous growth in storage capacity in recent years, the search for fast and efficient text compression algorithms continues. As processor speed is increasing at a higher rate than disk access time is decreasing, there is now even more reason to store information in a compressed form th ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Despite the enormous growth in storage capacity in recent years, the search for fast and efficient text compression algorithms continues. As processor speed is increasing at a higher rate than disk access time is decreasing, there is now even more reason to store information in a compressed form than there was previously. Prediction by Partial Matching (PPM), first published in 1984, was a significant step forward in the quest for efficient text compression. The Burrows Wheeler transform (BWT), introduced ten years later, has been the next significant breakthrough; its best implementations rank alongside those of PPM. In most BWT implementations, transformed text is converted to a string of ranks with a movetofront (MTF) or similar mechanism before being compressed. Ranks are then encoded with an Order model or a hierarchy of such models, with some substrings of repeated ranks encoded as run lengths. Although these rank based methods perform very well, the transformation to MTF numbers blurs the distinction between individual symbols and is a possible cause of ineffectiveness. Instead of relying on symbol ranking, we examine the problem of modelling the transformed text as a sequence of segments with iid symbols, using three different techniques.
An Executable Taxonomy of OnLine Modeling Algorithms
, 1997
"... This paper gives an overview of our decomposition of a group of existing and novel online sequence modeling algorithms into component parts. Our decomposition, and its implementation, show that these algorithms can be implemented as a cross product of predominantly independent sets. The result is a ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
This paper gives an overview of our decomposition of a group of existing and novel online sequence modeling algorithms into component parts. Our decomposition, and its implementation, show that these algorithms can be implemented as a cross product of predominantly independent sets. The result is all of the following: a test bed for executing controlled experiments with algorithm components, a framework that unifies existing techniques and defines novel techniques, and a taxonomy for describing online sequence modeling algorithms precisely and completely in a way that enables meaningful comparison. Keywords: data compression, universal coding, online stochastic modeling, statistical inference, finitestate automata 1 A version of this paper appears in Proceedings of the DCC, March 1997. This report contains minor corrections to the DCC97 version. An Executable Taxonomy of OnLine Modeling Algorithms Suzanne Bunton The University of Washington This paper gives an overview of our...
A Generalization and Improvement to PPM's "Blending"
, 1997
"... The bestperforming method in the data compression literature for computing probability estimates of sequences online using a suffixtree model is the blending technique used by PPM. Blending can be viewed as a bottomup recursive procedure for computing a mixture, barring one missing term for each ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
The bestperforming method in the data compression literature for computing probability estimates of sequences online using a suffixtree model is the blending technique used by PPM. Blending can be viewed as a bottomup recursive procedure for computing a mixture, barring one missing term for each level of the recursion, where a mixture is basically a weighted average of several probability estimates. We show by decomposition into an inheritance evaluation time and a mixture weighting function that mixtures generalize the techniques used in PPM variants. Doubly controlled experiments with our executable taxonomy of online sequence modeling algorithms and the Calgary Corpus demonstrate the impact of varying inheritance evaluation time, mixture weighting function, and including update exclusion. Keywords: data compression, universal coding, online stochastic modeling, statistical inference, finitestate automata 1 Portions of this paper also appear in Proceedings of the DCC, March ...