Results 1 - 10
of
15
Semi-Supervised Learning Literature Survey
, 2006
"... We review the literature on semi-supervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole
spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semi-supervised learning. This document is a chapter ..."
Abstract
-
Cited by 268 (7 self)
- Add to MetaCart
We review the literature on semi-supervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole
spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semi-supervised learning. This document is a chapter excerpt from the author’s
doctoral thesis (Zhu, 2005). However the author plans to update the online version frequently to incorporate the latest development in the field. Please obtain the latest
version at http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf
Building text classifiers using positive and unlabeled examples
- In: Intl. Conf. on Data Mining
, 2003
"... This paper studies the problem of building text classifiers using positive and unlabeled examples. The key feature of this problem is that there is no negative example for learning. Recently, a few techniques for solving this problem were proposed in the literature. These techniques are based on the ..."
Abstract
-
Cited by 46 (8 self)
- Add to MetaCart
This paper studies the problem of building text classifiers using positive and unlabeled examples. The key feature of this problem is that there is no negative example for learning. Recently, a few techniques for solving this problem were proposed in the literature. These techniques are based on the same idea, which builds a classifier in two steps. Each existing technique uses a different method for each step. In this paper, we first introduce some new methods for the two steps, and perform a comprehensive evaluation of all possible combinations of methods of the two steps. We then propose a more principled approach to solving the problem based on a biased formulation of SVM, and show experimentally that it is more accurate than the existing techniques. 1.
Learning Classifiers from Only Positive and Unlabeled Data
, 2008
"... The input to an algorithm that learns a binary classifier normally consists of two sets of examples, where one set consists of positive examples of the concept to be learned, and the other set consists of negative examples. However, it is often the case that the available training data are an incomp ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
The input to an algorithm that learns a binary classifier normally consists of two sets of examples, where one set consists of positive examples of the concept to be learned, and the other set consists of negative examples. However, it is often the case that the available training data are an incomplete set of positive examples, and a set of unlabeled examples, some of which are positive and some of which are negative. The problem solved in this paper is how to learn a standard binary classifier given a nontraditional training set of this nature. Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive. We show how to use this result in two different ways to learn a classifier from a nontraditional training set. We then apply these two new methods to solve a real-world problem: identifying protein records that should be included in an incomplete specialized molecular biology database. Our experiments in this domain show that models trained using the new methods perform better than the current state-of-the-art biased SVM method for learning from positive and unlabeled examples.
A Method for Inferring Label Sampling Mechanisms In Semi-Supervised Learning
, 2005
"... We consider the situation in semi-supervised learning, where the "label sampling" mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We consider the situation in semi-supervised learning, where the "label sampling" mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to "de-bias" its results using labeled data only and b. As a potentially interesting learning task in itself.
Learning from positive and unlabeled examples with different data distributions
- Proceedings of ECML 2005
, 2005
"... Abstract. We study the problem of learning from positive and unlabeled examples. Although several techniques exist for dealing with this problem, they all assume that positive examples in the positive set P and the positive examples in the unlabeled set U are generated from the same distribution. Th ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract. We study the problem of learning from positive and unlabeled examples. Although several techniques exist for dealing with this problem, they all assume that positive examples in the positive set P and the positive examples in the unlabeled set U are generated from the same distribution. This assumption may be violated in practice. For example, one wants to collect all printer pages from the Web. One can use the printer pages from one site as the set P of positive pages and use product pages from another site as U. One wants to classify the pages in U into printer pages and non-printer pages. Although printer pages from the two sites have many similarities, they can also be quite different because different sites often present similar products in different styles and have different focuses. In such cases, existing methods perform poorly. This paper proposes a novel technique A-EM to deal with the problem. Experiment results with product page classification demonstrate the effectiveness of the proposed technique. 1
A Simple Probabilistic Approach to Learning from Positive and Unlabeled Examples
- In Proc. of the 5th Annual UK Workshop on Computational Intelligence
, 2005
"... We propose a simple probabilistic approach to learning from positive and unlabeled examples, and show experimentally that it can approximate or outperform other state-ofthe-art approaches to this problem in spite of its simplicity. By employing a linear-time learning algorithm such as PrTFIDF, our a ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We propose a simple probabilistic approach to learning from positive and unlabeled examples, and show experimentally that it can approximate or outperform other state-ofthe-art approaches to this problem in spite of its simplicity. By employing a linear-time learning algorithm such as PrTFIDF, our approach can be highly efficient and scalable. 1
Semi-Supervised Novelty Detection
, 2010
"... A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level se ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions.
Learning Classifiers without Negative Examples: A Reduction Approach
"... The problem of PU Learning, i.e., learning classifiers with positive and unlabelled examples (but not negative examples), is very important in information retrieval and data mining. We address this problem through a novel approach: reducing it to the problem of learning classifiers for some meaningf ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The problem of PU Learning, i.e., learning classifiers with positive and unlabelled examples (but not negative examples), is very important in information retrieval and data mining. We address this problem through a novel approach: reducing it to the problem of learning classifiers for some meaningful multivariate performance measures. In particular, we show how a powerful machine learning algorithm, Support Vector Machine, can be adapted to solve this problem. The effectiveness and efficiency of the proposed approach have been confirmed by our experiments on three real-world datasets. 1
Negative Training Data can be Harmful to Text Classification
"... This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The cla ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been conducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm. 1
Incremental Learning from Positive Examples
"... Abstract. Classical supervised learning techniques are generally based on an inductive mechanism able to generalise a model from a set of positive examples, assuring its consistency with respect to a set of negative examples. In case of learning from positive evidence only, the problem of over-gener ..."
Abstract
- Add to MetaCart
Abstract. Classical supervised learning techniques are generally based on an inductive mechanism able to generalise a model from a set of positive examples, assuring its consistency with respect to a set of negative examples. In case of learning from positive evidence only, the problem of over-generalisation comes into account. This paper proposes a general technique for incremental multi-class learning from positive examples only, which has been embedded in the learning system INTHELEX. The idea is to incrementally suppose the positive evidence for a class to be a negative evidence for all other classes until the environment explicitly declares the contrary. An application of the proposed technique to the agent learning domain has been provided. The proposed framework has been used to simulate an agent learning and revising in an incremental way a logical model of a task by imitating skilled agents. In particular, demonstrations are incrementally received and used as training examples while the agent interacts in a stochastic environment. The experimental results prove the validity of the proposed approach on this application domain.

