Text Classification and Segmentation Using Minimum Cross-Entropy (2000)
| Citations: | 18 - 0 self |
BibTeX
@MISC{Teahan00textclassification,
author = {W. J. Teahan},
title = {Text Classification and Segmentation Using Minimum Cross-Entropy},
year = {2000}
}
Years of Citing Articles
OpenURL
Abstract
Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant improvement over previously used methods in a number of areas. For example, text can be classified with a very high degree of accuracy by authorship, language, dialect and genre. Highly accurate text segmentation is also possible -- the accuracy of the PPM-based Chinese word segmenter is close to 99% on Chinese news text; similarly, a PPM-based method of segmenting text by language achieves an accuracy of over 99%.







