Results 1 - 10
of
20
Unbounded Length Contexts for PPM
- The Computer Journal
, 1995
"... uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we ..."
Abstract
-
Cited by 103 (7 self)
- Add to MetaCart
uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we give some results that demonstrate an improvement of about 6% over the old method. Finally, a recently-published and seemingly unrelated compression scheme [2] is related to the unbounded-context idea that forms the essential innovation of PPM*. 1 PPM: Prediction by partial match The basic idea of PPM is to use the last few characters in the input stream to predict the upcoming one. Models that condition their predictions on a few immediately preceding symbols are called "finite-context" models of order k, where k is the number of preceding symbols used. PPM employs a suite of fixed-order context models with different values of k
Text Classification and Segmentation Using Minimum Cross-Entropy
, 2000
"... Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant improvement over previously used methods in a number of areas. For example, text can be classified with a very high degree of accuracy by authorship, language, dialect and genre. Highly accurate text segmentation is also possible -- the accuracy of the PPM-based Chinese word segmenter is close to 99% on Chinese news text; similarly, a PPM-based method of segmenting text by language achieves an accuracy of over 99%.
Correcting English text using PPM models
- Proc Data Compression Conference, edited by J.A. Storer and J.H. Reif. IEEE Computer Society Press, Los Alamitos, CA
, 1998
"... This paper describes a method for correcting English text using a PPM model. A method that segments words in English text is introduced and is shown to be a significant improvement over previously used methods. A similar technique is also applied as a post-processing stage after pages have been reco ..."
Abstract
-
Cited by 13 (7 self)
- Add to MetaCart
This paper describes a method for correcting English text using a PPM model. A method that segments words in English text is introduced and is shown to be a significant improvement over previously used methods. A similar technique is also applied as a post-processing stage after pages have been recognized by a state-of-theart commercial OCR system. We show that the accuracy of the OCR system can be increased from 96.3% to 96.9%, a decrease of about 14 errors per page.
Implementing the Context Tree Weighting Method for Text Compression
- In Data Compression Conference
, 2000
"... Context tree weighting method is a universal compression algorithm for FSMX sources. Though we expect that it will have good compression ratio in practice, it is difficult to implement it and in many cases the implementation is only for estimating compression ratio. Though Willems and Tjalkens showe ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Context tree weighting method is a universal compression algorithm for FSMX sources. Though we expect that it will have good compression ratio in practice, it is difficult to implement it and in many cases the implementation is only for estimating compression ratio. Though Willems and Tjalkens showed practical implementation using not block probabilities but conditional probabilities, it is used for only binary alphabet sequences. We extend the method for multi-alphabet sequences and show a simple implementation using PPM techniques. We also propose a method to optimize a parameter of the context tree weighting for binary alphabet case. Experimental results on texts and DNA sequences show that the performance of PPM can be improved by combining the context tree weighting and that DNA sequences can be compressed in less than 2.0 bpc.
Nonparametric Entropy Estimation for Stationary Processes and Random Fields, with Applications to English Text
, 1998
"... We discuss a family of estimators for the entropy rate of a stationary ergodic process and prove their pointwise and mean consistency under a Doeblin-type mixing condition. The estimators are Ces`aro averages of longest match-lengths, and their consistency follows from a generalized ergodic theorem ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
We discuss a family of estimators for the entropy rate of a stationary ergodic process and prove their pointwise and mean consistency under a Doeblin-type mixing condition. The estimators are Ces`aro averages of longest match-lengths, and their consistency follows from a generalized ergodic theorem due to Maker. We provide examples of their performance on English text, and we generalize our results to countable alphabet processes and to random fields.
Unifying Text Search And Compression - Suffix Sorting, Block Sorting and Suffix Arrays
, 2000
"... Today many electronic documents are available such as articles of newspapers, dictionaries, books, DNA sequences, etc. and they are stored in databases. We also have many documents on the Internet and have many e-mail documents. Therefore, fast queries on such huge amount of documents and their comp ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Today many electronic documents are available such as articles of newspapers, dictionaries, books, DNA sequences, etc. and they are stored in databases. We also have many documents on the Internet and have many e-mail documents. Therefore, fast queries on such huge amount of documents and their compression to reduce costs for storing or transferring them are important. In this thesis, a unified method for improving efficiency of search and compression for huge text data is proposed. All search methods and compression methods used in this thesis are related to a data structure called suffix array. The suffix array is a text search data structure and it is used in a text compression method called block sorting. Both are promising search method and compression method and there are many studies on the methods. Now a data structure called inverted file is used for queries from huge amount of documents. Though it is widely used, query unit is a document in order to reduce disk space to sto...
The Complexity and Entropy of Literary Styles
, 1996
"... Since Shannon's original experiment in 1951, several methods have been applied to the problem of determining the entropy of English text. These methods were based either on prediction by human subjects, or on computer-implemented parametric models for the data, of a certain Markov order. We ask why ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Since Shannon's original experiment in 1951, several methods have been applied to the problem of determining the entropy of English text. These methods were based either on prediction by human subjects, or on computer-implemented parametric models for the data, of a certain Markov order. We ask why computer-based experiments almost always yield much higher entropy estimates than the ones produced by humans. We argue that there are two main reasons for this discrepancy. First, the long-range correlations of English text are not captured by Markovian models and, second, computerbased models only take advantage of the text statistics without being able to "understand" the contextual structure and the semantics of the given text. The second question we address is what does the "entropy" of a text say about the author's literary style. In particular, is there an intuitive notion of "complexity of style" that is captured by the entropy? We present preliminary results based on a non-parametric entropy estimation algorithm that o er partial answers to these questions. These results indicate that taking long-range correlations into account significantly improves the entropy estimates. We get an estimate of 1.77 bits-per-character for a onemillion-character sample taken from Jane Austen's works. Also comparing the estimates obtained from several di erent texts provides some insight into the interpretation of the notion of "entropy" when applied to English text rather than to random processes, and the relationship between the entropy and the "literary complexity" of an author's style. Advantages of this entropy estimation method are that it does not require prior training, it is uniformly good over different styles and languages, and it seems to converge reasonably fast.
A note on brain actuated spelling with the Berlin Brain-Computer Interface
- Universal Access in HCI, Part II, HCII 2007
, 2007
"... Abstract. Brain-Computer Interfaces (BCIs) are systems capable of decoding neural activity in real time, thereby allowing a computer application to be directly controlled by the brain. Since the characteristics of such direct brain-tocomputer interaction are limited in several aspects, one major cha ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract. Brain-Computer Interfaces (BCIs) are systems capable of decoding neural activity in real time, thereby allowing a computer application to be directly controlled by the brain. Since the characteristics of such direct brain-tocomputer interaction are limited in several aspects, one major challenge in BCI research is intelligent front-end design. Here we present the mental text entry application ‘Hex-o-Spell ’ which incorporates principles of Human-Computer Interaction research into BCI feedback design. The system utilises the high visual display bandwidth to help compensate for the extremely limited control bandwidth which operates with only two mental states, where the timing of the state changes encodes most of the information. The display is visually appealing, and control is robust. The effectiveness and robustness of the interface was demonstrated at the CeBIT 2006 (world’s largest IT fair) where two subjects operated the mental text entry system at a speed of up to 7.6 char/min. 1
Applying Compression to Natural Language Processing
- SPAE : The Corpus of Spoken Professional American-English. I have
, 1997
"... A number of powerful modelling techniques have been developed in recent years to compress natural language text. The best of these are adaptive models operating on the character and word level which are able to perform almost as well as humans at predicting text. We show how to apply character based ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
A number of powerful modelling techniques have been developed in recent years to compress natural language text. The best of these are adaptive models operating on the character and word level which are able to perform almost as well as humans at predicting text. We show how to apply character based methods to five areas where language modelling is critical, providing novel solutions to each of these problems.
Mitzenmacher The Markov Expert for Finding Episodes in Time Series Unpublished technical report
"... We describe a domain-independent, unsupervised algorithm for refined segmentation of time series data into meaningful episodes, focusing on the problem of text segmentation. The VOTING EXPERTS algorithm of Cohen et al. [1] achieves results with fairly low rates of error. The MARKOV EXPERT is a new a ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We describe a domain-independent, unsupervised algorithm for refined segmentation of time series data into meaningful episodes, focusing on the problem of text segmentation. The VOTING EXPERTS algorithm of Cohen et al. [1] achieves results with fairly low rates of error. The MARKOV EXPERT is a new approach that improves the performance of VOTING EXPERTS by further refining those results with votes from an additional expert. The new expert applies a Markov-based segmentation method inspired by the approach of Teahan et al. [2], using the results of VOTING EXPERTS ’ frequency and entropy experts as a sample corpus from which to draw prefix/suffix frequency data. In contrast with the setting of Teahan et al., in this setting the sample corpus is small and somewhat inaccurate, but despite its errors, it is directly similar to the intended output in terms of non-space characters. The result is a high quality domainindependent segmentation algorithm that performs substantially better than VOTING EXPERTS. 1

