Results 1  10
of
16
SemiSupervised Learning Literature Survey
, 2006
"... We review the literature on semisupervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole
spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semisupervised learning. This document is a chapter ..."
Abstract

Cited by 447 (8 self)
 Add to MetaCart
We review the literature on semisupervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole
spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semisupervised learning. This document is a chapter excerpt from the author’s
doctoral thesis (Zhu, 2005). However the author plans to update the online version frequently to incorporate the latest development in the field. Please obtain the latest
version at http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf
MultiManifold SemiSupervised Learning
"... We study semisupervised learning when the data consists of multiple intersecting manifolds. We give a finite sample analysis to quantify the potential gain of using unlabeled data in this multimanifold setting. We then propose a semisupervised learning algorithm that separates different manifolds ..."
Abstract

Cited by 74 (6 self)
 Add to MetaCart
We study semisupervised learning when the data consists of multiple intersecting manifolds. We give a finite sample analysis to quantify the potential gain of using unlabeled data in this multimanifold setting. We then propose a semisupervised learning algorithm that separates different manifolds into decision sets, and performs supervised learning within each set. Our algorithm involves a novel application of Hellinger distance and sizeconstrained spectral clustering. Experiments demonstrate the benefit of our multimanifold semisupervised learning approach. 1
Unlabeled data: Now it helps, now it doesn’t
"... Empirical evidence shows that in favorable situations semisupervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
Empirical evidence shows that in favorable situations semisupervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semisupervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates. 1
Does Unlabeled Data Provably Help? Worstcase Analysis of the Sample Complexity of SemiSupervised Learning
"... We study the potential benefits of unlabeled data to classification prediction to the learner. We compare learning in the semisupervised model to the standard, supervised PAC (distribution free) model, considering both the realizable and the unrealizable (agnostic) settings. Roughly speaking, our ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
We study the potential benefits of unlabeled data to classification prediction to the learner. We compare learning in the semisupervised model to the standard, supervised PAC (distribution free) model, considering both the realizable and the unrealizable (agnostic) settings. Roughly speaking, our conclusion is that access to unlabeled samples cannot provide sample size guarantees that are better than those obtainable without access to unlabeled data, unless one postulates very strong assumptions about the distribution of the labels. In particular, we prove that for basic hypothesis classes over the real line, if the distribution of unlabeled data is ‘smooth’, knowledge of that distribution cannot improve the labeled sample complexity by more than a constant factor (e.g., 2). We conjecture that a similar phenomena holds for any hypothesis class and any unlabeled data distribution. We also discuss the utility of semisupervised learning under the common cluster assumption concerning the distribution of labels, and show that even in the most accommodating cases, where data is generated by two unimodal labelhomogeneous distributions, common SSL paradigms may be misleading and inflict poor prediction performance.
A Discriminative Model for SemiSupervised Learning
, 2008
"... Supervised learning — that is, learning from labeled examples — is an area of Machine Learning that has reached substantial maturity. It has generated generalpurpose and practicallysuccessful algorithms and the foundations are quite well understood and captured by theoretical frameworks such as th ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Supervised learning — that is, learning from labeled examples — is an area of Machine Learning that has reached substantial maturity. It has generated generalpurpose and practicallysuccessful algorithms and the foundations are quite well understood and captured by theoretical frameworks such as the PAClearning model and the Statistical Learning theory framework. However, for many contemporary practical problems such as classifying web pages or detecting spam, there is often additional information available in the form of unlabeled data, which is often much cheaper and more plentiful than labeled data. As a consequence, there has recently been substantial interest in semisupervised learning — using unlabeled data together with labeled data — since any useful information that reduces the amount of labeled data needed can be a significant benefit. Several techniques have been developed for doing this, along with experimental results on a variety of different learning problems. Unfortunately, the standard learning frameworks for reasoning about supervised learning do not capture the key aspects and the assumptions underlying these semisupervised learning methods. In this paper we describe an augmented version of the PAC model designed for semisupervised learning, that can be used to reason about many of the different approaches taken over the past
Learning with online constraints: shifting concepts and active learning
 PHD THESIS. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LAB
, 2006
"... Many practical problems such as forecasting, realtime decision making, streaming data applications, and resourceconstrained learning, can be modeled as learning with online constraints. This thesis is concerned with analyzing and designing algorithms for learning under the following online constra ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Many practical problems such as forecasting, realtime decision making, streaming data applications, and resourceconstrained learning, can be modeled as learning with online constraints. This thesis is concerned with analyzing and designing algorithms for learning under the following online constraints: i) The algorithm has only sequential, or oneattime, access to data. ii) The time and space complexity of the algorithm must not scale with the number of observations. We analyze learning with online constraints in a variety of settings, including active learning. The active learning model is applicable to any domain in which unlabeled data is easy to come by and there exists a (potentially difficult or expensive) mechanism by which to attain labels. First, we
A Selective Sampling Strategy for Label Ranking
, 2006
"... We propose a novel active learning strategy based on the compression framework of [9] for label ranking functions which, given an input instance, predict a total order over a predefined set of alternatives. Our approach is theoretically motivated by an extension to ranking and active learning of Kä ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We propose a novel active learning strategy based on the compression framework of [9] for label ranking functions which, given an input instance, predict a total order over a predefined set of alternatives. Our approach is theoretically motivated by an extension to ranking and active learning of Kääriäinen’s generalization bounds using unlabeled data [7], initially developed in the context of classification. The bounds we obtain suggest a selective sampling strategy provided that a sufficiently, yet reasonably large initial labeled dataset is provided. Experiments on Information Retrieval corpora from automatic text summarization and question/answering show that the proposed approach allows to substantially reduce the labeling effort in comparison to random and heuristicbased sampling strategies.
A comparison of tight generalization error bounds
 In Proceedings of the 22nd International Conference on Machine Learning
, 2005
"... We investigate the empirical applicability of several bounds (a number of which are new) on the true error rate of learned classifiers which hold whenever the examples are chosen independently at random from a fixed distribution. The collection of tricks we use includes: 1. A technique using unlabel ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We investigate the empirical applicability of several bounds (a number of which are new) on the true error rate of learned classifiers which hold whenever the examples are chosen independently at random from a fixed distribution. The collection of tricks we use includes: 1. A technique using unlabeled data for a tight derandomization of randomized bounds. 2. A tight form of the progressive validation bound. 3. The exact form of the test set bound. The bounds are implemented in the semibound package and are freely available. 1.
New Theoretical Frameworks for Machine Learning
, 2007
"... This thesis develops and analyzes theoretical frameworks for new emerging paradigms of Machine Learning including Semisupervised, Active, and Similaritybased Learning. These are areas of significant practical importance and significant activity in Machine Learning, and a number of different algori ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
This thesis develops and analyzes theoretical frameworks for new emerging paradigms of Machine Learning including Semisupervised, Active, and Similaritybased Learning. These are areas of significant practical importance and significant activity in Machine Learning, and a number of different algorithmic approaches have been developed for each of them. Standard Learning Theory frameworks such as PAC or Statistical Learning Theory models tend to not capture these learning approaches, hence developing sound and rigorous models that provide a thorough understanding of these new paradigms is desirable. The purpose of this thesis is to propose and to study new theoretical frameworks and algorithms for better understanding and extending some of these learning approaches. In addition, this dissertation also presents new applications of techniques from Machine Learning Theory to new emerging areas of Computer Science at large, such as Auction and Mechanism Design. In Machine Learning, there has been growing interest in using unlabeled data together with labeled data due to the availability of large amounts of unlabeled data in many applications. As a result, a number of different algorithmic approaches have been developed for this
Finite sample analysis of semisupervised learning
, 2008
"... Empirical evidence shows that in favorable situations semisupervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Empirical evidence shows that in favorable situations semisupervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semisupervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates. 1