• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The Design and Analysis of Efficient Lossless Data Compression Systems (1993)

by Paul Glor Howard
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 31
Next 10 →

Unbounded Length Contexts for PPM

by John G. Cleary, W. J. Teahan, Ian H. Witten - The Computer Journal , 1995
"... uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we ..."
Abstract - Cited by 103 (7 self) - Add to MetaCart
uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we give some results that demonstrate an improvement of about 6% over the old method. Finally, a recently-published and seemingly unrelated compression scheme [2] is related to the unbounded-context idea that forms the essential innovation of PPM*. 1 PPM: Prediction by partial match The basic idea of PPM is to use the last few characters in the input stream to predict the upcoming one. Models that condition their predictions on a few immediately preceding symbols are called "finite-context" models of order k, where k is the number of preceding symbols used. PPM employs a suite of fixed-order context models with different values of k

A Compression-based Algorithm for Chinese Word Segmentation

by W. J. Teahan, Yingying Wen, Rodger Mcnab, Ian H. Witten - Computational Linguistics
"... This paper describes a general scheme for segmenting text by inferring the position of word boundaries, thus supplying a necessary preprocessing step for applications like those mentioned above. Unlike other approaches, which involve a dictionary of legal words and are therefore language-specific, i ..."
Abstract - Cited by 48 (7 self) - Add to MetaCart
This paper describes a general scheme for segmenting text by inferring the position of word boundaries, thus supplying a necessary preprocessing step for applications like those mentioned above. Unlike other approaches, which involve a dictionary of legal words and are therefore language-specific, it works by using a corpus of already segmented text for training and thus can easily be retargeted for any language for which a suitable corpus of segmented material is available. To infer word boundaries, a general adaptive text compression technique is used that predicts upcoming characters on the basis of their preceding context. Spaces are inserted into positions where their presence enables the text to be compressed more effectively. This approach means that we can capitalize on existing research in text compression to create good models for word segmentation. To build a segmenter for a new language, the only resource required is a corpus of segmented text to train the compression model...

On prediction using variable order Markov models

by Ron Begleiter, Ran El-Yaniv, Golan Yona - JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH , 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract - Cited by 42 (1 self) - Add to MetaCart
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems.

Spam filtering using statistical data compression models

by Andrej Bratko, Gordon V. Cormack, David R, Bogdan Filipič, Philip Chan, Thomas R. Lynam, Thomas R. Lynam - Journal of Machine Learning Research , 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract - Cited by 33 (12 self) - Add to MetaCart
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.

Text Mining: A new frontier for lossless compression

by Ian H. Witten, Zane Bray, Malika Mahoui, Bill Teahan - In Data Compression Conference , 1999
"... This paper aims to promote text compression as a key technology for text mining ..."
Abstract - Cited by 28 (5 self) - Add to MetaCart
This paper aims to promote text compression as a key technology for text mining

Semantically Motivated Improvements for PPM Variants

by Suzanne Bunton - The Computer Journal , 1997
"... This paper explains how to significantly improve the compression performance of any PPM variant ..."
Abstract - Cited by 23 (3 self) - Add to MetaCart
This paper explains how to significantly improve the compression performance of any PPM variant

Switching between two universal source coding algorithms

by Paul A. J. Volf, Frans M. J. Willems - In Data Compression Conference , 1998
"... This paper discusses a switching method which can be used to combine two sequential universal source coding algorithms. The switching method treats these two algorithms as black-boxes and can only use their estimates of the probability distributions for the consecutive symbols of the source sequence ..."
Abstract - Cited by 21 (1 self) - Add to MetaCart
This paper discusses a switching method which can be used to combine two sequential universal source coding algorithms. The switching method treats these two algorithms as black-boxes and can only use their estimates of the probability distributions for the consecutive symbols of the source sequence. Three weighting algorithms based on this switching method are presented. Empirical results show that all three weighting algorithms give a performance better than the performance of the source coding algorithms they combine. 1

The Context Trees of Block Sorting Compression

by N. Jesper Larsson - IN PROCEEDINGS OF THE IEEE DATA COMPRESSION CONFERENCE, SNOWBIRD, UTAH, MARCH 30 - APRIL 1 , 1998
"... The Burrows-Wheeler transform (BWT)andblock sorting compression are closely related to the context trees of PPM. The usual approach of treating BWT as merely a permutation is not able to fully exploit this relation. We show that ..."
Abstract - Cited by 16 (0 self) - Add to MetaCart
The Burrows-Wheeler transform (BWT)andblock sorting compression are closely related to the context trees of PPM. The usual approach of treating BWT as merely a permutation is not able to fully exploit this relation. We show that

On-Line Stochastic Processes in Data Compression

by Suzanne Bunton , 1996
"... The ability to predict the future based upon the past in finite-alphabet sequences has many applications, including communications, data security, pattern recognition, and natural language processing. By Shannon's theory and the breakthrough development of arithmetic coding, any sequence, a 1 a 2 \ ..."
Abstract - Cited by 14 (6 self) - Add to MetaCart
The ability to predict the future based upon the past in finite-alphabet sequences has many applications, including communications, data security, pattern recognition, and natural language processing. By Shannon's theory and the breakthrough development of arithmetic coding, any sequence, a 1 a 2 \Delta \Delta \Delta a n , can be encoded in a number of bits that is essentially equal to the minimal information-lossless codelength, P i \Gamma log 2 p(a i ja 1 \Delta \Delta \Delta a i\Gamma1 ). The goal of universal on-line modeling, and therefore of universal data compression, is to deduce the model of the input sequence a 1 a 2 \Delta \Delta \Delta a n that can estimate each p(a i ja 1 \Delta \Delta \Delta a i\Gamma1 ) knowing only a 1 a 2 \Delta \Delta \Delta a i\Gamma1 so that the ex...

Spam filtering using compression models

by Andrej Bratko, Bogdan Filipič , 2005
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract - Cited by 13 (2 self) - Add to MetaCart
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. This paper summarizes our experiments for the TREC 2005 spam track, in which we consider the use of adaptive statistical data compression models for the spam filtering task. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. Since messages are modeled as sequences of characters, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We present experimental results indicating that compression models perform well in comparison to established spam filters. We also show that the method is extremely robust to noise, which should make such filters difficult to defeat. 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University