MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Distributional Word Clusters vs. Words for Text Categorization (2002) [31 citations — 5 self]

Abstract:

We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and e#cient representation of documents. When combined with the classification power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters significantly outperforms the word-based representation in terms of categorization accuracy or representation e#ciency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the wordcluster representation. We investigate the potential reasons for this behavior and relate it to structural di#erences between the datasets.

Citations

4923 Elements of Information Theory – Cover, Thomas - 1991
1636 Indexing by latent semantic analysis – Deerwester, Dumais, et al. - 1990
1439 Modern Information Retrieval – Baeza-Yates, Ribeiro - 1999
1091 Support-vector network – Cortes, Vapnik - 1995
1053 Text Categorization with Support Vector Machines: Learning with Many Relevant Features – Joachims - 1998
1045 Experiments with a new boosting algorithm – Freund, Schapire - 1996
719 A training algorithm for optimal margin classifiers – Boser, Guyon, et al. - 1992
640 Combining labeled and unlabeled data with co-training – Blum, Mitchell - 1998
587 Machine learning in automated text categorization – SEBASTIANI
450 X: A re-examination of text categorization methods – Yang, Liu
407 Distributional clustering of english words – Pereira, Tishby, et al. - 1993
347 Inductive learning algorithms and representations for text categorization – Dumais, Platt, et al. - 1998
269 BoosTexter: A boostingbased system for text categorization – Schapire, Singer - 2000
259 Toward optimal feature selection – Koller, Sahami - 1996
249 Learning to extract symbolic knowledge from the World Wide Web – Craven, DiPasquo, et al. - 1998
248 Reducing multiclass to binary: A unifying approach for margin classifiers – Allwein, Schapire, et al.
231 W: The information bottleneck method – Tishby, Pereira, et al. - 1999
218 Making large-scale support vector machine learning practical – Joachims - 1999
207 Text classification using string kernels – Lodhi, Saunders, et al. - 2002
160 Distributional clustering of words for text classification – Baker, McCallum - 1998
149 Deterministic annealing for clustering, compression, classification, regression, and related optimization problems – Rose - 1998
123 Learning to classify text from labeled and unlabeled documents – Nigam, McCallum, et al. - 1998
83 Agglomerative information bottleneck – Slonim, Tishby - 2000
69 Estimating the generalization performance of a SVM efficiently – Joachims - 1999
59 Maximizing text-mining performance – Weiss, Apte, et al. - 1999
56 Multivariate information bottleneck – Friedman, Mosenzon, et al. - 2001
53 Unsupervised document classification using sequential information maximization – Slonim, Friedman, et al. - 2002
38 Round robin classification – Furnkranz
37 The power of word clusters for text classification – SLONIM, TISHBY
36 A statistical learning model of text classification with support vector machines – Joachims - 2001
33 Extracting relevant structures with side information – Chechik, Tishby - 2002
33 A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization – Caropreso, Matwin, et al. - 2001
20 Iterative double clustering for unsupervised and semi-supervised learning – El-Yaniv, Souroujon - 2001
18 Modern Information Retrieval. Addison-Wesley and ACM – Baeza-Yates, Ribeiro-Neto - 1999
12 Joining statistics with nlp for text categorization – Jacobs - 1992
7 Relevance Feedback in Information Retrieval, chapter 14 – Rocchio - 1971
6 Language-sensitive text classification – Basili, Moschitti, et al. - 2000
5 Machine learning for information retrieval: Advanced techniques, 2000. A tutorial presented at SIGIR'00 – Singer, Lewis
3 Unsupervised learning by probabilistic latent semantic analysis – Hoffman
1 Reducing multiclass to binary: A unifying approach for margin classifiers – Bekkerman, Tishby, et al. - 2000
1 Clusters vs. Words for Text Categorization – Word - 1977
1 Toward optimal feature selection – Bekkerman, Tishby, et al. - 1996
1 Clusters vs. Words for Text Categorization N. Slonim and N. Tishby. The power of word clusters for text classification – Word - 2001
1 Maximum likelihood from incomplete data via the em algorithm – BEKKERMAN, TISHBY, et al. - 1977
1 A re-examination of text categorization methods – BEKKERMAN, TISHBY, et al. - 1999