Results 1  10
of
36
Unbounded Length Contexts for PPM
 The Computer Journal
, 1995
"... uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we ..."
Abstract

Cited by 111 (7 self)
 Add to MetaCart
uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we give some results that demonstrate an improvement of about 6% over the old method. Finally, a recentlypublished and seemingly unrelated compression scheme [2] is related to the unboundedcontext idea that forms the essential innovation of PPM*. 1 PPM: Prediction by partial match The basic idea of PPM is to use the last few characters in the input stream to predict the upcoming one. Models that condition their predictions on a few immediately preceding symbols are called "finitecontext" models of order k, where k is the number of preceding symbols used. PPM employs a suite of fixedorder context models with different values of k
A Compressionbased Algorithm for Chinese Word Segmentation
 Computational Linguistics
"... This paper describes a general scheme for segmenting text by inferring the position of word boundaries, thus supplying a necessary preprocessing step for applications like those mentioned above. Unlike other approaches, which involve a dictionary of legal words and are therefore languagespecific, i ..."
Abstract

Cited by 56 (7 self)
 Add to MetaCart
This paper describes a general scheme for segmenting text by inferring the position of word boundaries, thus supplying a necessary preprocessing step for applications like those mentioned above. Unlike other approaches, which involve a dictionary of legal words and are therefore languagespecific, it works by using a corpus of already segmented text for training and thus can easily be retargeted for any language for which a suitable corpus of segmented material is available. To infer word boundaries, a general adaptive text compression technique is used that predicts upcoming characters on the basis of their preceding context. Spaces are inserted into positions where their presence enables the text to be compressed more effectively. This approach means that we can capitalize on existing research in text compression to create good models for word segmentation. To build a segmenter for a new language, the only resource required is a corpus of segmented text to train the compression model...
Augmenting Naive Bayes Classifiers with Statistical Language Models
, 2003
"... We augment naive Bayes models with statistical ngram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier ..."
Abstract

Cited by 47 (0 self)
 Add to MetaCart
We augment naive Bayes models with statistical ngram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier
Text Mining: A new frontier for lossless compression
 In Data Compression Conference
, 1999
"... This paper aims to promote text compression as a key technology for text mining ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
This paper aims to promote text compression as a key technology for text mining
Structures of String Matching and Data Compression
, 1999
"... This doctoral dissertation presents a range of results concerning efficient algorithms and data structures for string processing, including several schemes contributing to sequential data compression. It comprises both theoretic results and practical implementations. We study the suffix tree data st ..."
Abstract

Cited by 29 (0 self)
 Add to MetaCart
This doctoral dissertation presents a range of results concerning efficient algorithms and data structures for string processing, including several schemes contributing to sequential data compression. It comprises both theoretic results and practical implementations. We study the suffix tree data structure, presenting an efficient representation and several generalizations. This includes augmenting the suffix tree to fully support sliding window indexing (including a practical implementation) in linear time. Furthermore, we consider a variant that indexes naturally wordpartitioned data, and present a lineartime construction algorithm for a tree that represents only suffixes starting at word boundaries, requiring space linear in the number of words. By applying our sliding window indexing techniques, we achieve an efficient implementation for dictionarybased compression based on the LZ77 algorithm. Furthermore, considering predictive source
Text categorization using compression models
 In Proceedings of DCC00, IEEE Data Compression Conference, Snowbird, US
, 2000
"... Text categorization, or the assignment of natural language texts to predefined categories based on their content, is of growing importance as the volume of information available on the internet continues to overwhelm us. The use of predefined categories implies a “supervised learning ” approach to c ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
Text categorization, or the assignment of natural language texts to predefined categories based on their content, is of growing importance as the volume of information available on the internet continues to overwhelm us. The use of predefined categories implies a “supervised learning ” approach to categorization, where alreadyclassified articles—which
Text Classification and Segmentation Using Minimum CrossEntropy
, 2000
"... Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their crossentropy calculated using a fixed order characterbased Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their crossentropy calculated using a fixed order characterbased Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant improvement over previously used methods in a number of areas. For example, text can be classified with a very high degree of accuracy by authorship, language, dialect and genre. Highly accurate text segmentation is also possible  the accuracy of the PPMbased Chinese word segmenter is close to 99% on Chinese news text; similarly, a PPMbased method of segmenting text by language achieves an accuracy of over 99%.
Nonparametric Entropy Estimation for Stationary Processes and Random Fields, with Applications to English Text
, 1998
"... We discuss a family of estimators for the entropy rate of a stationary ergodic process and prove their pointwise and mean consistency under a Doeblintype mixing condition. The estimators are Ces`aro averages of longest matchlengths, and their consistency follows from a generalized ergodic theorem ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
We discuss a family of estimators for the entropy rate of a stationary ergodic process and prove their pointwise and mean consistency under a Doeblintype mixing condition. The estimators are Ces`aro averages of longest matchlengths, and their consistency follows from a generalized ergodic theorem due to Maker. We provide examples of their performance on English text, and we generalize our results to countable alphabet processes and to random fields.
Correcting English text using PPM models
 Proc Data Compression Conference, edited by J.A. Storer and J.H. Reif. IEEE Computer Society Press, Los Alamitos, CA
, 1998
"... This paper describes a method for correcting English text using a PPM model. A method that segments words in English text is introduced and is shown to be a significant improvement over previously used methods. A similar technique is also applied as a postprocessing stage after pages have been reco ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
This paper describes a method for correcting English text using a PPM model. A method that segments words in English text is introduced and is shown to be a significant improvement over previously used methods. A similar technique is also applied as a postprocessing stage after pages have been recognized by a stateoftheart commercial OCR system. We show that the accuracy of the OCR system can be increased from 96.3% to 96.9%, a decrease of about 14 errors per page.
On compressionbased text classification
 In Proc. ECIR05, 300–314
, 2005
"... Abstract. Compressionbased text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are characterbased, and thus have the potential to automatically capture nonword features of a document, such as punctuation, wordstems, and features spann ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Abstract. Compressionbased text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are characterbased, and thus have the potential to automatically capture nonword features of a document, such as punctuation, wordstems, and features spanning more than one word. However, compressionbased classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compressionbased text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture nonword (including superword) features causes characterbased text compression methods to achieve more accurate classification. 1