On compression-based text classification (2005)
| Venue: | In Proc. ECIR-05, 300–314 |
| Citations: | 7 - 0 self |
BibTeX
@INPROCEEDINGS{Marton05oncompression-based,
author = {Yuval Marton and Ning Wu and Lisa Hellerstein},
title = {On compression-based text classification},
booktitle = {In Proc. ECIR-05, 300–314},
year = {2005}
}
Years of Citing Articles
OpenURL
Abstract
Abstract. Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, wordstems, and features spanning more than one word. However, compressionbased classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification. 1







