Results 1 -
8 of
8
A Compression-based Algorithm for Chinese Word Segmentation
- Computational Linguistics
"... This paper describes a general scheme for segmenting text by inferring the position of word boundaries, thus supplying a necessary preprocessing step for applications like those mentioned above. Unlike other approaches, which involve a dictionary of legal words and are therefore language-specific, i ..."
Abstract
-
Cited by 48 (7 self)
- Add to MetaCart
This paper describes a general scheme for segmenting text by inferring the position of word boundaries, thus supplying a necessary preprocessing step for applications like those mentioned above. Unlike other approaches, which involve a dictionary of legal words and are therefore language-specific, it works by using a corpus of already segmented text for training and thus can easily be retargeted for any language for which a suitable corpus of segmented material is available. To infer word boundaries, a general adaptive text compression technique is used that predicts upcoming characters on the basis of their preceding context. Spaces are inserted into positions where their presence enables the text to be compressed more effectively. This approach means that we can capitalize on existing research in text compression to create good models for word segmentation. To build a segmenter for a new language, the only resource required is a corpus of segmented text to train the compression model...
Text Mining: A new frontier for lossless compression
- In Data Compression Conference
, 1999
"... This paper aims to promote text compression as a key technology for text mining ..."
Abstract
-
Cited by 28 (5 self)
- Add to MetaCart
This paper aims to promote text compression as a key technology for text mining
Text Classification and Segmentation Using Minimum Cross-Entropy
, 2000
"... Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant improvement over previously used methods in a number of areas. For example, text can be classified with a very high degree of accuracy by authorship, language, dialect and genre. Highly accurate text segmentation is also possible -- the accuracy of the PPM-based Chinese word segmenter is close to 99% on Chinese news text; similarly, a PPM-based method of segmenting text by language achieves an accuracy of over 99%.
An Open Interface for Probabilistic Models of Text
- In Data Compression Conference, Proceedings
, 1999
"... An Application Program Interface (API) for modelling sequential text is described. The API is intended to shield the user from details of the modelling and probability estimation process. This should enable different implementations of models to be replaced transparently in application programs. The ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
An Application Program Interface (API) for modelling sequential text is described. The API is intended to shield the user from details of the modelling and probability estimation process. This should enable different implementations of models to be replaced transparently in application programs. The motivation for this API is work on the use of textual models for applications in addition to strict data compression, e.g. determination of the source of text, spelling correction or segmentation of text by inserting spaces. The API is probabilistic: that is, it supplies the probability of the next symbol in the sequence. It is general enough to deal accurately with models that include escapes for probabilities. The concepts abstracted by the API are explained together with details of the API calls.
Combining PPM models using a text mining approach
- In Storer and Cohn [128
, 2001
"... : This paper introduces a novel switching method which can be used to combine two or more PPM models. The work derives from our earlier work on modelling English and text mining, and the approach takes advantage of both to help improve the compression performance signicantly. The performance of ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
: This paper introduces a novel switching method which can be used to combine two or more PPM models. The work derives from our earlier work on modelling English and text mining, and the approach takes advantage of both to help improve the compression performance signicantly. The performance of the combination of models is at least as good as (and in many cases signicantly better than) the best performed of the individual models. 1 Introduction The PPM data compression scheme has consistently set the standard in lossless compression of text since it was originally described by Cleary & Witten back in 1984. Moat's (1990) implementation, PPMC, set the benchmark for over a decade, and currently, an implementation of the PPMD algorithm (Howard, 1993) has the distinction of being the best \all-round" compression scheme (ACT, 2000). Other variations on a very productive research theme include improved blending algorithms (Bunton, 1996), improved escape estimation for the nely tun...
Implementing the context tree weighting method for context recognition
- in Proc. Data Compression Conf., Snowbird, UT, Mar. 2004, p. 536. et al.: UNIVERSAL DIVERGENCE ESTIMATION FOR FINITE-ALPHABET SOURCES 3475
"... The context tree weighting method (CTW) is a statistics–based universal date compres-sion algorithm that is capable of achieving superior performance compared to Lempel– Ziv based algorithms [1], [2]. Motivated by this fact, we investigate the usability of CTW for applications involving content reco ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The context tree weighting method (CTW) is a statistics–based universal date compres-sion algorithm that is capable of achieving superior performance compared to Lempel– Ziv based algorithms [1], [2]. Motivated by this fact, we investigate the usability of CTW for applications involving content recognition. Recently, various authors have explored the application of other data compression algorithms for content recognition, e.g. see [3], [4], [5]. Given a test file that needs to be classified among a set of several reference files that represent different classes, the reference file which leads to the best compression of the test file when both files are appended is selected as the most probable match. Moreover, we modify CTW for content recognition purposes by introducing the concept of context tree freezing after the reference sequence is encoded to avoid learning the memory structure of the appended test sequence. Results show that CTW with the proposed freezing technique achieves a clearly superior performance compared to a wide range of other compression algorithms for content recognition problems such as language recognition, authorship attribution, and DNA data classification. For more details, the reader is referred to the full paper version available at [6].
Using Language Models for Generic Entity Extraction
- In International Conference on Machine Learning Workshop on Text Mining
, 1999
"... This paper describes the use of statistical language modeling techniques, such as are commonly used for text compression, to extract meaningful, low-level, information about the location of semantic tokens, or "entities," in text. We begin by marking up several different token types in trainin ..."
Abstract
- Add to MetaCart
This paper describes the use of statistical language modeling techniques, such as are commonly used for text compression, to extract meaningful, low-level, information about the location of semantic tokens, or "entities," in text. We begin by marking up several different token types in training documents---for example, people's names, dates and time periods, phone numbers, and sums of money. We form a language model for each token type and examine how accurately it identifies new tokens. We then apply a search algorithm to insert token boundaries in a way that maximizes compression of the entire test document. The technique can be applied to hierarchically-defined tokens, leading to a kind of "soft parsing" that will, we believe, be able to identify structured items such as references and tables in html or plain text, based on nothing more than a few marked-up examples in training documents. 1. INTRODUCTION Text mining is about looking for patterns in text, and may...
Protein is Incompressible
- In Data Compression Conference
, 1999
"... This paper develops these ideas. We begin by reviewing the structure of protein and its relation to DNA. Then we briefly review PPM, a standard compression scheme from which our new method has been developed. Next we describe the new scheme, CP for Compress Protein. Following that, we describe exper ..."
Abstract
- Add to MetaCart
This paper develops these ideas. We begin by reviewing the structure of protein and its relation to DNA. Then we briefly review PPM, a standard compression scheme from which our new method has been developed. Next we describe the new scheme, CP for Compress Protein. Following that, we describe experiments used to evaluate its performance on three genomes, and finally draw some conclusions

