Results 1 
9 of
9
SemiSupervised Learning Literature Survey
, 2006
"... We review the literature on semisupervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole
spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semisupervised learning. This document is a chapter ..."
Abstract

Cited by 447 (8 self)
 Add to MetaCart
We review the literature on semisupervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole
spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semisupervised learning. This document is a chapter excerpt from the author’s
doctoral thesis (Zhu, 2005). However the author plans to update the online version frequently to incorporate the latest development in the field. Please obtain the latest
version at http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf
Correcting sample selection bias by unlabeled data
"... We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We prese ..."
Abstract

Cited by 130 (9 self)
 Add to MetaCart
We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
An improved categorization of classifier’s sensitivity on sample selection bias
 In In Proceedings of the Fifth IEEE International Conference on Data Mining
, 2005
"... A recent paper categorizes classifier learning algorithms according to their sensitivity to a common type of sample selection bias where the chance of an example being selected into the training sample depends on its feature vector x but not (directly) on its class label y. A classifier learner is c ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
A recent paper categorizes classifier learning algorithms according to their sensitivity to a common type of sample selection bias where the chance of an example being selected into the training sample depends on its feature vector x but not (directly) on its class label y. A classifier learner is categorized as “local ” if it is insensitive to this type of sample selection bias, otherwise, it is considered “global”. In that paper, the true model is not clearly distinguished from the model that the algorithm outputs. In their discussion of Bayesian classifiers, logistic regression and hardmargin SVMs, the true model (or the model that generates the true class label for every example) is implicitly assumed to be contained in the model space of the learner, and the true class probabilities and model estimated class probabilities are assumed to asymptotically converge as the training data set size increases. However, in the discussion of naive Bayes, decision trees and softmargin SVMs, the model space is assumed not to contain the true model, and these three algorithms are instead argued to be “global learners”. We argue that most classifier learners may or may not be affected by sample selection bias; this depends on the dataset as well as the heuristics or inductive bias implied by the learning algorithm and their appropriateness to the particular dataset. 1
On sample selection bias and its efficient correction via model averaging and unlabeled examples
 In Proc. of SIAM Data Mining Conference
, 2007
"... Sample selection bias is a common problem encountered when using data mining algorithms for many realworld applications. Traditionally, it is assumed that training and test data are sampled from the same probability distribution, the so called “stationary or nonbiased distribution assumption. ” Ho ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
Sample selection bias is a common problem encountered when using data mining algorithms for many realworld applications. Traditionally, it is assumed that training and test data are sampled from the same probability distribution, the so called “stationary or nonbiased distribution assumption. ” However, this assumption is often violated in reality. Typical examples include marketing solicitation, fraud detection, drug testing, loan approval, school enrollment, etc. For these applications the only labeled data available for training is a biased representation, in various ways, of the future data on which the inductive model will predict. Intuitively, some examples sampled frequently into the training data may actually be infrequent in the testing data, and vice versa. When this happens, an inductive model constructed from biased training set may not be as accurate on unbiased testing data if there had not been any selection bias in the training data. In this paper, we first improve and clarify a previously proposed categorization of sample selection bias. In particular, we show that unless under very restricted conditions, sample selection bias is a common problem for many realworld situations. We then analyze various effects of sample selection bias on inductive modeling, in particular, how the “true ” conditional probability P(yx) to be modeled by inductive learners can be misrepresented in the biased training data, that subsequently misleads a learning algorithm. To solve inaccuracy problems due to sample selection bias, we explore how to use model averaging of (1) conditional probabilities P(yx), (2) feature probabilities P(x), and (3) joint probabilities, P(x, y), to reduce the influence of sample selection bias on model accuracy. In particular, we explore on how to use unlabeled data in a semisupervised learning framework to improve the accuracy of descriptive models constructed from biased training samples.
ReverseTesting: An efficient framework to select amongst classifiers under sample selection bias
 In KDD’06
, 2006
"... One of the most important assumptions made by many classification algorithms is that the training and test sets are drawn from the same distribution, i.e., the socalled “stationary distribution assumption” that the future and the past data sets are identical from a probabilistic standpoint. In many ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
One of the most important assumptions made by many classification algorithms is that the training and test sets are drawn from the same distribution, i.e., the socalled “stationary distribution assumption” that the future and the past data sets are identical from a probabilistic standpoint. In many domains of realworld applications, such as marketing solicitation, fraud detection, drug testing, loan approval, subpopulation surveys, school enrollment among others, this is rarely the case. This is because the only labeled sample available for training is biased in different ways due to a variety of practical reasons and limitations. In these circumstances, traditional methods to evaluate the expected generalization error of classification algorithms, such as structural risk minimization, tenfold crossvalidation, and leaveoneout validation, usually return poor estimates of which classification algorithm, when trained on biased dataset, will be the most accurate for future unbiased dataset, among a number of competing candidates. Sometimes, the estimated order of the learning algorithms ’ accuracy could be so poor that it is not even better than random guessing. Therefore, a method to determine the most accurate learner is needed for data mining under sample selection bias for many realworld applications. We present such an approach that can determine which learner will perform the best on an unbiased test set, given a possibly biased training set, in a fraction of the computational cost to use crossvalidation based approaches.
Abstract Type Independent Correction of Sample Selection Bias via Structural Discovery and Rebalancing
"... Sample selection bias is a common problem in many real world applications, where training data are obtained under realistic constraints that make them follow a different distribution from the future testing data. For example, in the application of hospital clinical studies, it is common practice to ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Sample selection bias is a common problem in many real world applications, where training data are obtained under realistic constraints that make them follow a different distribution from the future testing data. For example, in the application of hospital clinical studies, it is common practice to build models from the eligible volunteers as the training data, and then apply the model to the entire populations. Because these volunteers are usually not selected at random, the training set may not be drawn from the same distribution as the test set. Thus, such a dataset suffers from “sample selection bias ” or “covariate shift”. In the past few years, much work has been proposed to reduce sample selection bias, mainly by statically matching the distribution between training set and test set. But in this paper, we do not explore the different distributions directly. Instead, we propose
An Improved Categorization of Classifier's Sensitivity on Sample Selection Bias
 In In Proceedings of the Fifth IEEE International Conference on Data Mining
, 2005
"... A recent paper categorizes classifier learning algorithms according to their sensitivity to a common type of sample selection bias where the chance of an example being selected into the training sample depends on its feature vector but not (directly) on its class label y. A classifier learner is c ..."
Abstract
 Add to MetaCart
A recent paper categorizes classifier learning algorithms according to their sensitivity to a common type of sample selection bias where the chance of an example being selected into the training sample depends on its feature vector but not (directly) on its class label y. A classifier learner is categorized as "local" if it is insensitive to this type of sample selection bias, otherwise, it is considered "global". In that paper, the true model is not clearly distinguished from the model that the algorithm outputs. In their discussion of Bayesian classifiers, logistic regression and hardmargin support vector machines, the true model (or the model that generates the true class label for every example) is implicitly assumed to be contained in the model space of the learner, and the true class probabilities and model estimated class probabilities are assumed to asymptotically converge as the training data set size increases. However, in the discussion of naive Bayes, decision trees and softmargin support vector machines, the model space is assumed not to contain the true model, and these three classification algorithms are instead argued to be "global learners". Here we argue that most classifier learning algorithms including those just discussed may or may not be affected by sample selection bias; this will depend on the dataset as well as the heuristics or inductive bias implied by the classifier learning algorithm and their appropriateness to the particular dataset. We make use of our earlier experimental results and produce additional results to illustrate our claims.
Domain Adaptation
, 2011
"... This is to certify that I have examined this copy of a doctoral dissertation by ..."
Abstract
 Add to MetaCart
This is to certify that I have examined this copy of a doctoral dissertation by
Automatic annotation of interactions in meetings using . . .
"... People spend many hours in meetings during their working lives. The growing need for help in keeping records in meetings and searching through them has been recognized, and several groups around the world are working on a meeting browser or a summarization tool. In this research, we propose the deve ..."
Abstract
 Add to MetaCart
People spend many hours in meetings during their working lives. The growing need for help in keeping records in meetings and searching through them has been recognized, and several groups around the world are working on a meeting browser or a summarization tool. In this research, we propose the development of a classification system that uses machine learning techniques to segment and detect meeting acts, which are highlevel interactions among meeting participants as a group (e.g. negotiation, reporting, discussion, planning). As in other datadriven tasks, this requires a large amount of data, but labeling data can be costly, timeconsuming and errorprone. To address this problem, semisupervised learning techniques are often applied, in which a small amount of data are labeled and is used to train a classifier together with a large body of unlabeled data. In this study, we propose to use and extend a novel semisupervised learning algorithm, the contrast classifier approach, which exploits the contrast between the distributions of labeled and unlabeled data. We will also present our research plan to investigate the impact of different labeling mechanisms on the performance of existing and proposed semisupervised learning techniques, especially in the presence of imbalanced class distribution.