Results 11 - 20
of
39
The power of word clusters for text classification
- In 23rd European Colloquium on Information Retrieval Research
, 2001
"... The recently introduced Information Bottleneck method [21] provides an information theoretic framework, for extracting features of one variable, that are relevant for the values of another variable. Several previous works already suggested applying this method for document clustering, gene expressio ..."
Abstract
-
Cited by 53 (6 self)
- Add to MetaCart
The recently introduced Information Bottleneck method [21] provides an information theoretic framework, for extracting features of one variable, that are relevant for the values of another variable. Several previous works already suggested applying this method for document clustering, gene expression data analysis, spectral analysis and more. In this work we present a novel implementation of this method for supervised text classification. Specifically, we apply the information bottleneck method to find word-clusters that preserve the information about document categories and use these clusters as features for classification. Previous work [1] used a similar clustering procedure to show that word-clusters can significantly reduce the feature space dimensionality, with only a minor change in classification accuracy. In this work we present similar results and go further to show that when the training sample is small word clusters can yield significant improvement in classification accuracy (up to ¢¡¤£) over the performance using the words directly. 1
Using Unlabeled Data to Improve Text Classification
, 2001
"... One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high- ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high-accuracy text classifiers. By assuming that documents are created by a parametric generative model, Expectation-Maximization (EM) finds local maximum a posteriori models and classifiers from all the data -- labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse. Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling sub-topic class structure, and by modeling super-topic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to low-probability models. Performance can be significantly improved by using active learning to select high-quality initializations, and by using alternatives to EM that avoid low-probability local maxima.
ifile: An Application of Machine Learning to E-Mail Filtering
- Proc. KDD Workshop on Text Mining
, 2000
"... The rise of the World Wide Web and the ever-increasing amounts of machine-readable text has caused text classification to become a important aspect of machine learning. One specific application that has the potential to affect almost every user of the Internet is e-mail filtering. The WorldTalk Corp ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
The rise of the World Wide Web and the ever-increasing amounts of machine-readable text has caused text classification to become a important aspect of machine learning. One specific application that has the potential to affect almost every user of the Internet is e-mail filtering. The WorldTalk Corporation estimates that over 60 million business people use e-mail [6]. Many more use e-mail purely on a personal basis and the pool of e-mail users is growing daily. And yet, automated techniques for learning to filter e-mail have yet to significantly affect the e-mail market. Here, I attack problems that plague practical e-mail ltering and suggest solutions that will bring us closer to the acceptance of using automated classification techniques to filter personal e-mail. I also present a filtering system, ifile, that is both effective and efficient, and which has been adapted to a popular e-mail client. Results are presented from a number of experiments and show that a system such as ifile could become a...
Translingual Information Retrieval: Learning from Bilingual Corpora
- Artificial Intelligence
, 1997
"... Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more different languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones i ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more different languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones in a realistic setting. Methods fall into two categories: query translation and statistical-IR approaches establishing translingual associations. The results show that using bilingual corpora for automated extraction of term equivalences in context outperforms dictionary-based methods. Translingual versions of the Generalized Vector Space Model (GVSM) and Latent Semantic Indexing (LSI) also perform well, as does translingual pseudo relevance feedback (PRF) and Example-Based Term-in-context Translation (EBT). All showed relatively small performance loss between monolingual and translingual versions, ranging between 87% to 101% of monolingual IR performance. Query translation based on a general...
Text Clustering Based on Background Knowledge
, 2003
"... Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. Standard partitional or agglomerative clustering methods efficiently compute results to this end. ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. Standard partitional or agglomerative clustering methods efficiently compute results to this end.
Intelligent Document Classification
- Intelligent Data Analysis
, 2000
"... In this work we investigate some technical questions related to the application of neural networks in document classification. First, we discuss the effects of different averaging protocols for the 2 statistic used to remove non-informative terms. This is an especially relevant issue for the n ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
In this work we investigate some technical questions related to the application of neural networks in document classification. First, we discuss the effects of different averaging protocols for the 2 statistic used to remove non-informative terms. This is an especially relevant issue for the neural network technique, which requires an aggressive dimensionality reduction to be feasible. Second, we estimate the importance of performance fluctuations due to inherent randomness in the training process of a neural network, a point not properly addressed in previous works. Finally, we compare the neural network results with those obtained using the best methods for this application. For this we optimize the network architecture by evaluating much larger nets than previously considered in similar studies in the literature.
On the use of Bernoulli mixture models for text classification
- Pattern Recognition
, 2001
"... Mixture modelling of class-conditional densities is a standard pattern recognition technique. Although most research on mixture models has concentrated on mixtures for continuous data, emerging pattern recognition applications demand extending research eorts to other data types. This paper focus ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Mixture modelling of class-conditional densities is a standard pattern recognition technique. Although most research on mixture models has concentrated on mixtures for continuous data, emerging pattern recognition applications demand extending research eorts to other data types. This paper focuses on the application of mixtures of multivariate Bernoulli distributions to binary data. More concretely, a text classi cation task aimed at improving language modelling for machine translation is considered.
Category levels in hierarchical text categorization
- Proceedings of the Third Conference on Empirical Methods in Natural Language Processing (EMNLP-3
, 1998
"... We consider the problem of assigning level numbers (weights) to hierarchically organized categories during the process of text categorization. These levels control the ability of the categories to attract documents during the categorization process. The levels are adjusted ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
We consider the problem of assigning level numbers (weights) to hierarchically organized categories during the process of text categorization. These levels control the ability of the categories to attract documents during the categorization process. The levels are adjusted
Uncertainty-based Noise Reduction and Term Selection in Text Categorization
- Proceedings 24th BCS-IRSG European Colloquium on IR Research, Springer LNCS 2291
, 2002
"... This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most signi cant ones. ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most signi cant ones.

