Results 1 - 10
of
25
Text Classification from Labeled and Unlabeled Documents using EM
- Machine Learning
, 1999
"... . This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large qua ..."
Abstract
-
Cited by 632 (16 self)
- Add to MetaCart
. This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve ...
An introduction to variable and feature selection
- Journal of Machine Learning Research
, 2003
"... Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. ..."
Abstract
-
Cited by 431 (8 self)
- Add to MetaCart
Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available.
Learning from Labeled and Unlabeled Data using Graph Mincuts
, 2001
"... Many application domains suffer from not having enough labeled training data for learning. However, large amounts of unlabeled examples can often be gathered cheaply. As a result, there has been a great deal of work in recent years on how unlabeled data can be used to aid classification. We consi ..."
Abstract
-
Cited by 202 (5 self)
- Add to MetaCart
Many application domains suffer from not having enough labeled training data for learning. However, large amounts of unlabeled examples can often be gathered cheaply. As a result, there has been a great deal of work in recent years on how unlabeled data can be used to aid classification. We consider an algorithm based on finding minimum cuts in graphs, that uses pairwise relationships among the examples in order to learn from both labeled and unlabeled data. Our algorithm
Learning with Labeled and Unlabeled Data
, 2001
"... In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as ..."
Abstract
-
Cited by 134 (1 self)
- Add to MetaCart
In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as numerous suggestions for potential future work. Therefore, this work contains more speculative and partly subjective material than the reader might expect from a literature review. We give a rigorous definition of the problem and relate it to supervised and unsupervised learning. The crucial role of prior knowledge is put forward, and we discuss the important notion of input-dependent regularization. We postulate a number of baseline methods, being algorithms or algorithmic schemes which can more or less straightforwardly be applied to the problem, without the need for genuinely new concepts. However, some of them might serve as basis for a genuine method. In the literature revi...
Enhancing Supervised Learning with Unlabeled Data
, 2000
"... In many practical learning scenarios, there is a small amount of labeled data along with a large pool of unlabeled data. Many supervised learning algorithms have been developed and extensively studied. We present a new "co-training" strategy for using unlabeled data to improve the performance ..."
Abstract
-
Cited by 94 (0 self)
- Add to MetaCart
In many practical learning scenarios, there is a small amount of labeled data along with a large pool of unlabeled data. Many supervised learning algorithms have been developed and extensively studied. We present a new "co-training" strategy for using unlabeled data to improve the performance of standard supervised learning algorithms. Unlike much of the prior work, such as the co-training procedure of Blum and Mitchell (1998), we do not assume there are two redundant views both of which are sufficient for perfect classification. The only requirement our co-training strategy places on each supervised learning algorithm is that its hypothesis partitions the example space into a set of equivalence classes (e.g. for a decision tree each leaf defines an equivalence class). We evaluate our co-training strategy via experiments using data from the UCI repository. 1. Introduction In many practical learning scenarios, there is a small amount of labeled data along with a lar...
The role of Occam’s Razor in knowledge discovery
- Data Mining and Knowledge Discovery
, 1999
"... Abstract. Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam’s razor ” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam’s razor has been interpreted in two quite di ..."
Abstract
-
Cited by 70 (1 self)
- Add to MetaCart
Abstract. Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam’s razor ” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam’s razor has been interpreted in two quite different ways. The first interpretation (simplicity is a goal in itself) is essentially correct, but is at heart a preference for more comprehensible models. The second interpretation (simplicity leads to greater accuracy) is much more problematic. A critical review of the theoretical arguments for and against it shows that it is unfounded as a universal principle, and demonstrably false. A review of empirical evidence shows that it also fails as a practical heuristic. This article argues that its continued use in KDD risks causing significant opportunities to be missed, and should therefore be restricted to the comparatively few applications where it is appropriate. The article proposes and reviews the use of domain constraints as an alternative for avoiding overfitting, and examines possible methods for handling the accuracy–comprehensibility trade-off.
Using Unlabeled Data to Improve Text Classification
, 2001
"... One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high- ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high-accuracy text classifiers. By assuming that documents are created by a parametric generative model, Expectation-Maximization (EM) finds local maximum a posteriori models and classifiers from all the data -- labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse. Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling sub-topic class structure, and by modeling super-topic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to low-probability models. Performance can be significantly improved by using active learning to select high-quality initializations, and by using alternatives to EM that avoid low-probability local maxima.
Occam's Two Razors: The Sharp and the Blunt
- In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining
, 1998
"... Occam's razor has been the subject of much controversy. This paper argues that this is partly because it has been interpreted in two quite different ways, the first of which (simplicity is a goal in itself) is essentially correct, while the second (simplicity leads to greater accuracy) is not. The p ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
Occam's razor has been the subject of much controversy. This paper argues that this is partly because it has been interpreted in two quite different ways, the first of which (simplicity is a goal in itself) is essentially correct, while the second (simplicity leads to greater accuracy) is not. The paper reviews the large variety of theoretical arguments and empirical evidence for and against the "second razor," and concludes that the balance is strongly against it. In particular, it builds on the case of (Schaffer, 1993) and (Webb, 1996) by considering additional theoretical arguments and recent empirical evidence that the second razor fails in most domains. A version of the first razor more appropriate to KDD is proposed, and we argue that continuing to apply the second razor risks causing significant opportunities to be missed. 1 Occam's Two Razors William of Occam's famous razor states that "Nunquam ponenda est pluralitas sin necesitate," which, approximately translated, means "En...
Metric-Based Methods for Adaptive Model Selection and Regularization
- Machine Learning
, 2001
"... We present a general approach to model selection and regularization that exploits unlabeled data to adaptively control hypothesis complexity in supervised learning tasks. The idea is to impose a metric structure on hypotheses by determining the discrepancy between their predictions across the di ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
We present a general approach to model selection and regularization that exploits unlabeled data to adaptively control hypothesis complexity in supervised learning tasks. The idea is to impose a metric structure on hypotheses by determining the discrepancy between their predictions across the distribution of unlabeled data. We show how this metric can be used to detect untrustworthy training error estimates, and devise novel model selection strategies that exhibit theoretical guarantees against over-tting (while still avoiding under- tting). We then extend the approach to derive a general training criterion for supervised learning|yielding an adaptive regularization method that uses unlabeled data to automatically set regularization parameters. This new criterion adjusts its regularization level to the specic set of training data received, and performs well on a variety of regression and conditional density estimation tasks. The only proviso for these methods is that s...
Characterizing the Generalization Performance of Model Selection Strategies
- In ICML-97
, 1997
"... : We investigate the structure of model selection problems via the bias/variance decomposition. In particular, we characterize the essential structure of a model selection task by the bias and variance profiles it generates over the sequence of hypothesis classes. This leads to a new understanding o ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
: We investigate the structure of model selection problems via the bias/variance decomposition. In particular, we characterize the essential structure of a model selection task by the bias and variance profiles it generates over the sequence of hypothesis classes. This leads to a new understanding of complexity-penalization methods: First, the penalty terms in effect postulate a particular profile for the variances as a function of model complexity--- if the postulated and true profiles do not match, then systematic under-fitting or over-fitting results, depending on whether the penalty terms are too large or too small. Second, it is usually best to penalize according to the true variances of the task, and therefore no fixed penalization strategy is optimal across all problems. We then use this bias/variance characterization to identify the notion of easy and hard model selection problems. In particular, we show that if the variance profile grows too rapidly in relation to the biases t...

