Results 1  10
of
57
On prediction using variable order Markov models
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract

Cited by 96 (1 self)
 Add to MetaCart
(Show Context)
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average logloss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the LempelZiv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Spam filtering using statistical data compression models
 Journal of Machine Learning Research
, 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract

Cited by 71 (12 self)
 Add to MetaCart
(Show Context)
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on characterlevel or binary sequences. By modeling messages as sequences, tokenization and other errorprone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.
A Compressionbased Algorithm for Chinese Word Segmentation
 Computational Linguistics
"... This paper describes a general scheme for segmenting text by inferring the position of word boundaries, thus supplying a necessary preprocessing step for applications like those mentioned above. Unlike other approaches, which involve a dictionary of legal words and are therefore languagespecific, i ..."
Abstract

Cited by 65 (7 self)
 Add to MetaCart
This paper describes a general scheme for segmenting text by inferring the position of word boundaries, thus supplying a necessary preprocessing step for applications like those mentioned above. Unlike other approaches, which involve a dictionary of legal words and are therefore languagespecific, it works by using a corpus of already segmented text for training and thus can easily be retargeted for any language for which a suitable corpus of segmented material is available. To infer word boundaries, a general adaptive text compression technique is used that predicts upcoming characters on the basis of their preceding context. Spaces are inserted into positions where their presence enables the text to be compressed more effectively. This approach means that we can capitalize on existing research in text compression to create good models for word segmentation. To build a segmenter for a new language, the only resource required is a corpus of segmented text to train the compression model...
Text Mining: A new frontier for lossless compression
 In Data Compression Conference
, 1999
"... This paper aims to promote text compression as a key technology for text mining ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
(Show Context)
This paper aims to promote text compression as a key technology for text mining
Switching between two universal source coding algorithms
 In Data Compression Conference
, 1998
"... This paper discusses a switching method which can be used to combine two sequential universal source coding algorithms. The switching method treats these two algorithms as blackboxes and can only use their estimates of the probability distributions for the consecutive symbols of the source sequence ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
This paper discusses a switching method which can be used to combine two sequential universal source coding algorithms. The switching method treats these two algorithms as blackboxes and can only use their estimates of the probability distributions for the consecutive symbols of the source sequence. Three weighting algorithms based on this switching method are presented. Empirical results show that all three weighting algorithms give a performance better than the performance of the source coding algorithms they combine. 1
Semantically Motivated Improvements for PPM Variants
 The Computer Journal
, 1997
"... This paper explains how to significantly improve the compression performance of any PPM variant ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
This paper explains how to significantly improve the compression performance of any PPM variant
The Context Trees of Block Sorting Compression
 IN PROCEEDINGS OF THE IEEE DATA COMPRESSION CONFERENCE, SNOWBIRD, UTAH, MARCH 30  APRIL 1
, 1998
"... The BurrowsWheeler transform (BWT)andblock sorting compression are closely related to the context trees of PPM. The usual approach of treating BWT as merely a permutation is not able to fully exploit this relation. We show that ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
The BurrowsWheeler transform (BWT)andblock sorting compression are closely related to the context trees of PPM. The usual approach of treating BWT as merely a permutation is not able to fully exploit this relation. We show that
Probability estimation for PPM
 In Proceedings NZCSRSC'95. Available from http://www.cs.waikato.ac.nz/wjt
, 1995
"... The state of the art in lossless text compression is the PPM data compression scheme. Two approaches to the problem of selecting the context models used in the scheme are described. One uses an a priori upper bound on the lengths of the contexts, while the other method is unbounded. Several techniqu ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
(Show Context)
The state of the art in lossless text compression is the PPM data compression scheme. Two approaches to the problem of selecting the context models used in the scheme are described. One uses an a priori upper bound on the lengths of the contexts, while the other method is unbounded. Several techniques that improve the probability estimation are described, including four new methods: partial update exclusions for the unbounded approach, deterministic scaling, recency scaling and multiple probability estimators. Each of these methods improves the performance for both the bounded and unbounded approaches. In addition, further savings are possible by combining the two approaches. 1 Introduction The state of the art in lossless text compression is the PPM data compression scheme [1, 4]. PPM, or prediction by partial matching, is an adaptive statistical modeling technique based on blending together different length context models to predict the next character in the input sequence. The sche...
Using compression to identify acronyms in text
 Computer Science, University of Waikato
, 2000
"... Text mining is about looking for patterns in natural language text, and may be defined as the process of analyzing text to extract information from it for particular purposes. In previous work, we claimed that compression is a key technology for text mining, and backed this up with a study that show ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
(Show Context)
Text mining is about looking for patterns in natural language text, and may be defined as the process of analyzing text to extract information from it for particular purposes. In previous work, we claimed that compression is a key technology for text mining, and backed this up with a study that showed how particular kinds of lexical tokens—names,