• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

On-line stochastic processes in data compression (1996)

by S Bunton
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 12
Next 10 →

Unbounded Length Contexts for PPM

by John G. Cleary, W. J. Teahan, Ian H. Witten - The Computer Journal , 1995
"... uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we ..."
Abstract - Cited by 103 (7 self) - Add to MetaCart
uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we give some results that demonstrate an improvement of about 6% over the old method. Finally, a recently-published and seemingly unrelated compression scheme [2] is related to the unbounded-context idea that forms the essential innovation of PPM*. 1 PPM: Prediction by partial match The basic idea of PPM is to use the last few characters in the input stream to predict the upcoming one. Models that condition their predictions on a few immediately preceding symbols are called "finite-context" models of order k, where k is the number of preceding symbols used. PPM employs a suite of fixed-order context models with different values of k

Semantically Motivated Improvements for PPM Variants

by Suzanne Bunton - The Computer Journal , 1997
"... This paper explains how to significantly improve the compression performance of any PPM variant ..."
Abstract - Cited by 23 (3 self) - Add to MetaCart
This paper explains how to significantly improve the compression performance of any PPM variant

PPM performance with BWT complexity: A fast and effective data compression algorithm

by Michelle Effros - Proceedings of the IEEE , 2000
"... This paper introduces a new data compression algorithm. The goal underlying this new code design is to achieve a single lossless compression algorithm with the excellent compression ratios of the Prediction by Partial Mapping (PPM) algorithms and the low complexity of codes based on the Burrows Whee ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
This paper introduces a new data compression algorithm. The goal underlying this new code design is to achieve a single lossless compression algorithm with the excellent compression ratios of the Prediction by Partial Mapping (PPM) algorithms and the low complexity of codes based on the Burrows Wheeler Transform (BWT). Like the BWT-based codes, the proposed algorithm requires worst case O(n) computational complexity and memory; in contrast, the unbounded-context PPM algorithm, called PPM 3, requires worst case O(n 2) computational complexity. Like PPM 3, the proposed algorithm allows the use of unbounded contexts. Using standard data sets for comparison, the proposed algorithm achieves compression performance better than that of the BWT-based codes and comparable to that of PPM 3. In particular, the proposed algorithm yields an average rate of 2.29 bits per character (bpc) on the Calgary corpus; this result compares favorably with the 2.33 and 2.34 bpc of PPM5 and PPM 3 (PPM algorithms), the 2.43 bpc of BW94 (the original BWT-based code), and the 3.64 and 2.69 bpc of compress and gzip (popular Unix compression algorithms based on Lempel–Ziv (LZ) coding techniques) on the same data set. The given code does not, however, match the best reported compression performance—2.12 bpc with PPMZ9—listed on the Calgary corpus results web page at the time of this publication. Results on the Canterbury corpus give a similar relative standing. The proposed algorithm gives an average rate of 2.15 bpc on the Canterbury corpus, while the Canterbury corpus web page gives average rates of 1.99 bpc for PPMZ9, 2.11 bpc for PPM5, 2.15 bpc for PPM7, 2.23 bpc for BZIP2 (a popular BWT-based code), and 3.31 and 2.53 bpc for compress and gzip, respectively. Keywords—Burrows Wheeler Transform, lossless source coding, prediction by partial mapping algorithm, suffix trees, text compression. I.

A Percolating State Selector for Suffix-Tree Context Models

by Suzanne Bunton, Suzanne Bunton - In Proceedings Data Compression Conference. IEEE Computer , 1997
"... This paper introduces into practice and empirically evaluates a set of techniques for performing information-theoretic state selection that have been developing in asymptotic results for over a decade. State selection, which actually implements the selection of an entire model from among a set of co ..."
Abstract - Cited by 5 (3 self) - Add to MetaCart
This paper introduces into practice and empirically evaluates a set of techniques for performing information-theoretic state selection that have been developing in asymptotic results for over a decade. State selection, which actually implements the selection of an entire model from among a set of competing models, is performed at least trivially by all of the suffix-tree FSMs used for on-line probability estimation. The set of state-selection techniques presented here combines orthogonally with the other sets of design options covered in the companion papers, "A Generalization and Improvement to PPM's Blending," and, "An Executable Taxonomy of On-Line Modeling Algorithms," written by this author. The main results of this paper are: ffl a novel dynamic programming solution that does not resort to the suboptimal hill-climbing or global order bounds that are used in other techniques, ffl the successful combination of information-theoretic state selection and em mixtures, which include ...

An Open Interface for Probabilistic Models of Text

by John G. Cleary, W. J. Teahan - In Data Compression Conference, Proceedings , 1999
"... An Application Program Interface (API) for modelling sequential text is described. The API is intended to shield the user from details of the modelling and probability estimation process. This should enable different implementations of models to be replaced transparently in application programs. The ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
An Application Program Interface (API) for modelling sequential text is described. The API is intended to shield the user from details of the modelling and probability estimation process. This should enable different implementations of models to be replaced transparently in application programs. The motivation for this API is work on the use of textual models for applications in addition to strict data compression, e.g. determination of the source of text, spelling correction or segmentation of text by inserting spaces. The API is probabilistic: that is, it supplies the probability of the next symbol in the sequence. It is general enough to deal accurately with models that include escapes for probabilities. The concepts abstracted by the API are explained together with details of the API calls.

Symbol-driven compression of burrows wheeler transformed text

by Anthony Ian Wirth , 2000
"... Despite the enormous growth in storage capacity in recent years, the search for fast and effi-cient text compression algorithms continues. As processor speed is increasing at a higher rate than disk access time is decreasing, there is now even more reason to store information in a compressed form th ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Despite the enormous growth in storage capacity in recent years, the search for fast and effi-cient text compression algorithms continues. As processor speed is increasing at a higher rate than disk access time is decreasing, there is now even more reason to store information in a compressed form than there was previously. Prediction by Partial Matching (PPM), first published in 1984, was a significant step forward in the quest for efficient text compression. The Burrows Wheeler transform (BWT), introduced ten years later, has been the next significant breakthrough; its best implementations rank along-side those of PPM. In most BWT implementations, transformed text is converted to a string of ranks with a move-to-front (MTF) or similar mechanism before being compressed. Ranks are then encoded with an Order-  model or a hierarchy of such models, with some substrings of repeated ranks encoded as run lengths. Although these rank based methods perform very well, the transfor-mation to MTF numbers blurs the distinction between individual symbols and is a possible cause of ineffectiveness. Instead of relying on symbol ranking, we examine the problem of modelling the transformed text as a sequence of segments with iid symbols, using three different techniques.

Combining PPM models using a text mining approach

by W. J. Teahan, David J. Harper - In Storer and Cohn [128 , 2001
"... : This paper introduces a novel switching method which can be used to combine two or more PPM models. The work derives from our earlier work on modelling English and text mining, and the approach takes advantage of both to help improve the compression performance signicantly. The performance of ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
: This paper introduces a novel switching method which can be used to combine two or more PPM models. The work derives from our earlier work on modelling English and text mining, and the approach takes advantage of both to help improve the compression performance signicantly. The performance of the combination of models is at least as good as (and in many cases signicantly better than) the best performed of the individual models. 1 Introduction The PPM data compression scheme has consistently set the standard in lossless compression of text since it was originally described by Cleary & Witten back in 1984. Moat's (1990) implementation, PPMC, set the benchmark for over a decade, and currently, an implementation of the PPMD algorithm (Howard, 1993) has the distinction of being the best \all-round" compression scheme (ACT, 2000). Other variations on a very productive research theme include improved blending algorithms (Bunton, 1996), improved escape estimation for the nely tun...

A generalization and improvement to PPM's blending

by Suzanne Bunton , 1997
"... The best-performing method in the data compression literature for computing probability estimates of sequences on-line using a suffix-tree model is the blending technique used by PPM [CW84, MofSO]. Blending can be viewed as a bottom-up recursive procedure for computing a mixture, barring one missing ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
The best-performing method in the data compression literature for computing probability estimates of sequences on-line using a suffix-tree model is the blending technique used by PPM [CW84, MofSO]. Blending can be viewed as a bottom-up recursive procedure for computing a mixture, barring one missing term for each level of the recursion, where a mixture is basically a weighted average of several probability estimates. In [Bun971 we have shown by decomposition into an inheritance weight &{A, B, C, D} and an inheritance evaluation time, Mh, that mixtures generalize the techniques used in DMC variants [CH87], as well as PPM variants, and thus these techniques, along with other variants of mixtures, are interchangeable. Table 1 shows the relative effectiveness of most combinations of mixture weight-ing functions and inheritance evaluation times. Table 2 is a study on the value of using update exclusion, especially in models using state selection. Table 1: How average compression performance on the Calgary Corpus as a whole is affected by varying mixture inheritance times and mixture weight functions, in models with and without (percolating) state selection.

An Executable Taxonomy of On-Line Modeling Algorithms

by Suzanne Bunton, Suzanne Bunton , 1997
"... This paper gives an overview of our decomposition of a group of existing and novel on-line sequence modeling algorithms into component parts. Our decomposition, and its implementation, show that these algorithms can be implemented as a cross product of predominantly independent sets. The result is a ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
This paper gives an overview of our decomposition of a group of existing and novel on-line sequence modeling algorithms into component parts. Our decomposition, and its implementation, show that these algorithms can be implemented as a cross product of predominantly independent sets. The result is all of the following: a test bed for executing controlled experiments with algorithm components, a framework that unifies existing techniques and defines novel techniques, and a taxonomy for describing on-line sequence modeling algorithms precisely and completely in a way that enables meaningful comparison. Keywords: data compression, universal coding, on-line stochastic modeling, statistical inference, finite-state automata 1 A version of this paper appears in Proceedings of the DCC, March 1997. This report contains minor corrections to the DCC97 version. An Executable Taxonomy of On-Line Modeling Algorithms Suzanne Bunton The University of Washington This paper gives an overview of our...

A Characterization of the Dynamic Markov Compression FSM with Finite Conditioning Contexts

by Suzanne Bunton , 1994
"... Structure of DMC s 0 s 1 a s 2 ac s 3 c s 5 acab b c suffix(s ): prefix(s ): i i M : 6 s 4 aca s 6 a a Figure 2: Observable Structure in DMC Models. For any state s i , suffix(s i ) is the original destination of the transition that was redirected to s i when s i was created; prefix(s i ) is the sou ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Structure of DMC s 0 s 1 a s 2 ac s 3 c s 5 acab b c suffix(s ): prefix(s ): i i M : 6 s 4 aca s 6 a a Figure 2: Observable Structure in DMC Models. For any state s i , suffix(s i ) is the original destination of the transition that was redirected to s i when s i was created; prefix(s i ) is the source of the transition which was redirected to s i , when s i was added to the model; and symbol(s i ) labels the transition that was originally redirected to s i , and any subsequently added transitions into s i . The context of s i , context(s i ), labels each state. The non-reflexive transitions of model M 6 , pictured in Figure 1, are omitted. However, the reflexive transitions of M 6 are included here to illustrate the consistent substructures they define in the DMC model. There are always jAj reflexive transitions in the model. (Here A = fa; b; cg). When a reflexive transition is redirected by cloning, the newly added state will have a reflexive transition with the same symbol. For an...
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University