Text Classification and Segmentation Using Minimum Cross-Entropy
user correction - Legacy Corrections
W. J. Teahan
SVM HeaderParse 0.1
School of Computing and Mathematical Sciences,; The Robert Gordon University, Aberdeen, Scotland
SVM HeaderParse 0.2
Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant improvement over previously used methods in a number of areas. For example, text can be classified with a very high degree of accuracy by authorship, language, dialect and genre. Highly accurate text segmentation is also possible -- the accuracy of the PPM-based Chinese word segmenter is close to 99% on Chinese news text; similarly, a PPM-based method of segmenting text by language achieves an accuracy of over 99%.