## Learning with Labeled and Unlabeled Data (2001)

### Cached

### Download Links

Citations: | 165 - 3 self |

### BibTeX

@TECHREPORT{Seeger01learningwith,

author = {Matthias Seeger},

title = {Learning with Labeled and Unlabeled Data},

institution = {},

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as numerous suggestions for potential future work. Therefore, this work contains more speculative and partly subjective material than the reader might expect from a literature review. We give a rigorous definition of the problem and relate it to supervised and unsupervised learning. The crucial role of prior knowledge is put forward, and we discuss the important notion of input-dependent regularization. We postulate a number of baseline methods, being algorithms or algorithmic schemes which can more or less straightforwardly be applied to the problem, without the need for genuinely new concepts. However, some of them might serve as basis for a genuine method. In the literature revi...

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...inaccurate) as a way of inference in thesrst place. Vapnik tries to motivate SLT transduction by presenting bounds specically tailored for the transduction setting. While reading the formidable book [=-=95]-=-, we have been fascinated by the way Vapnik presents his results for inductive inference. He starts from philosophical principles about the nature of learning and induction, derives PAC bounds and com... |

8089 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...Expectation-maximization on a joint density model Rather than treating D u as genuinely unlabeled data, we can also view the labels on these points as missing data. Expectation-maximization (EM) (see =-=[26]-=-,[2]) is a general technique for maximum likelihood estimation in the presence of latent variables or missing data. The idea of the basic batch version of EM is simple. We can distinguish between a co... |

4828 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...he basic \drive" in the optimization is always tost the data in the best possible way. Examples of unsupervised techniques include latent subspace models like principal component analysis (PCA) (=-=e.g. [11]), fa-=-ctor analysis (e.g. [29]) or principal curves [39]. Here, we introduce a latent \compression" variable u, living in a low-dimensional space, furthermore impose a noisy functional relationship on ... |

3529 | Optimization by simulated annealing
- Gelatt, Vecchi
- 1983
(Show Context)
Citation Context ...ed by carefully choosing the initial model P , but on many models this is as dicult assnding a goodst to the data in thesrst place. A standard technique to attack such problems is simulated annealings=-=[52]-=-. In the context of EM, the basic idea is to run a sequence of EM algorithms on the data, each having its own model and data submanifold. After convergence of one algorithm, we use the solution to ini... |

2284 | A tutorial on support vector machines for pattern recognition
- Burges
- 1998
(Show Context)
Citation Context ...(x) in one of the most powerful currently available classes of discriminativesclassiers, namely kernel methods such as Gaussian processes (e.g. [100],[97],[54]) or Support Vector machines (e.g. [96],[=-=1-=-4]). In a nutshell, kernel methods are diagnostic schemes (see subsection 1.3.2) in which the prior distribution over the latent function 30 is a Gaussian process, specied by a positive denite covaria... |

1631 | Experiments with a new boosting algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ... the named entities satisfying the Co-Training requirements. The most interesting part of the paper is the development of co-boosting, namely an extension of the very powerful AdaBoost algorithm (see =-=[30-=-], [72]) for supervised classication to attack the labeled-unlabeled problem. This extension is surprisingly simple, yet very elegant, and the algorithm could prove very competitive among existing lab... |

1441 |
Making large-Scale SVM Learning Practical
- Joachims
- 1999
(Show Context)
Citation Context ...te descent algorithms. The most substantial drawback is, however, that the scheme does not seem to be \kernelizable", i.e. the algorithm cannot be used together with a feature space mapping. Joac=-=hims [49]-=- presents a greedy approximative implementation of Vapnik's transduction scheme, again for the case of linear discriminants (or SVM). The algorithm is not guaranteed (or expected) tosnd the true optim... |

1273 | Spline models for observational data - Wahba - 1990 |

1245 | Combining Labeled and Unlabeled Data with Co-training
- Blum, Mitchell
- 1998
(Show Context)
Citation Context ...m over X . The \ : : : question of how unlabeled examples can be used to augment labeled data seems a slippery one from the point of view of standard PAC assumptions" (citation from Blum and Mitc=-=hell [12]-=-). PAC bounds analyze deviations between training and generalization error for certain predictors, drawn from a hypothesis set of limited complexity. Complexity measures for hypothesis sets, such as t... |

1240 |
Statistical decision theory and Bayesian analysis. Springer series in Statistics
- Berger
- 1985
(Show Context)
Citation Context ...complex of model families describing the part of the hierarchy which species the joint prior distribution for a subset of the latent variables, is sometimes referred to as hierarchical prior. Berger [=-=9]-=- gives a good introduction into Bayesian analysis. There are two important, basic mechanisms for introducing latent variables to achieve better compression. The principle of divide and conquer states ... |

1113 |
Pattern recognition and neural networks
- Ripley
- 1996
(Show Context)
Citation Context ... about 1 INTRODUCTION 6 the relationship, the observed data does not contain any information on how to generalize to unseen data. Classication schemes can be grouped into two major classes (see [23],[=-=70]-=-), following either the diagnostic or the sampling paradigm. Methods within the diagnostic paradigm will be referred to as diagnostic methods (or discriminative methods), while schemes within the samp... |

994 |
A probabilistic theory of pattern recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ...ction, but with the posterior replaced by the MRE distribution. 34 To us, non-experts however, these dierent bounds employ quite similar techniques. For example, there exists an induction bound (see [=-=27]) wh-=-ich proceeds via a double sample, such that the original and the \ghost" sample have very dierent sizes. This seems to be very close to the situation from where a transduction bound would start. ... |

879 |
Mixture Models
- Mclachlan, Basford
- 1988
(Show Context)
Citation Context ...dels, in the latter case the model family is tightly regularized by an appropriate prior P () on the model parameter . The noise model is usually a Gaussian. Other examples are mixture models (e.g. [5=-=8-=-],[90],[69]) where the latent variable is a grouping variable from asnite set (similar to the class label in supervised classication), and the conditional models come from simple families such as Gaus... |

803 |
Estimation of Dependencies Based on Empirical Data
- Vapnik
- 1979
(Show Context)
Citation Context ... in the experimental results reported in [91]. 3.7 Transduction Transductive inference, as opposed to inductive inference, is a principle which has been introduced into learning theory by Vapnik (see =-=[94]-=-,[96]). Suppose we are given a labeled training set D l as well as a set of test points D u , 32 and we are required to predict the labels of the test points. The traditional way is to propose the exi... |

803 | Text classification from labeled and unlabeled documents using - Nigam, McCallum, et al. - 2000 |

764 | A view of the EM algorithm that justifies incremental sparse and other variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...t losing the convergence guarantees. EM is a special case of an alternating minimization procedure in the context of information geometry (see [21]), as has been observed by several authors (e.g. [2],=-=[41]-=-). Several important problems in information theory, such as computation of the capacity of a (discrete memoryless) channel or of the rate-distortion function, can be shown to be equivalent to the fol... |

724 | Hierarchical mixtures of experts and EM algorithm
- Jordan, Jacobs
- 1994
(Show Context)
Citation Context ...kept denitive (or \clamped") on the observed labels). The resulting predictor is P k P (tjx; k; ^ )P (kjx; ^ ), where P (kjx; ) / k P (xjk; ), similar to a mixture-of-experts architecture (see [=-=50]-=-, [98]). However, in the latter, the gating models for P (kjx) are diagnostic rather than generative, and the whole architecture is trained to maximize the conditional likelihood of the data rather th... |

698 | Improved Boosting Algorithms using Confidence-rated Predictions
- Schapire, Singer
- 1999
(Show Context)
Citation Context ...amed entities satisfying the Co-Training requirements. The most interesting part of the paper is the development of co-boosting, namely an extension of the very powerful AdaBoost algorithm (see [29], =-=[73]-=-) for supervised classification to attack the labeled-unlabeled problem. This extension is surprisingly simple, yet very elegant, and the algorithm could prove very competitive among existing labeled-... |

624 |
Statistical Analysis of Finite Mixture Distributions
- Titterington, Smith, et al.
- 1985
(Show Context)
Citation Context ... in the latter case the model family is tightly regularized by an appropriate prior P () on the model parameter . The noise model is usually a Gaussian. Other examples are mixture models (e.g. [58],[9=-=0-=-],[69]) where the latent variable is a grouping variable from asnite set (similar to the class label in supervised classication), and the conditional models come from simple families such as Gaussians... |

611 | Learning in Graphical Models - Jordan, editor - 1999 |

596 |
An Introduction to Computational Learning Theory
- Kearns, Vazirani
- 1994
(Show Context)
Citation Context ...r opinion they should still be classied as belonging to the diagnostic paradigm 8 . Theoretical studies of supervised learning methods within the probably approximately correct (PAC) framework (e.g. [=-=51]-=-) focus on diagnostic schemes and consequently ignore the input distribution P (x) in that they either do not restrict it at all or assume it to be uniform over X . The \ : : : question of how unlabel... |

528 | Active learning with statistical models
- Cohn, Ghahramani, et al.
- 1996
(Show Context)
Citation Context ...e a sample from P (tjx) if P (x) is very small 35 . All these points are discussed in detail in [31], section 1. MacKay [53] discusses Bayesian active learning for multi-layer perceptrons. Cohn et al =-=[19]-=- introduce the general problem, then focus on joint density models of the kind discussed in [35] (see also [38] and subsection 2.2). A very general querysltering algorithm is query by committee (QBC) ... |

505 |
Mixture densities, maximum likelihood and the EM algorithm
- Redner, Walker
- 1984
(Show Context)
Citation Context ...he latter case the model family is tightly regularized by an appropriate prior P () on the model parameter . The noise model is usually a Gaussian. Other examples are mixture models (e.g. [58],[90],[6=-=9-=-]) where the latent variable is a grouping variable from asnite set (similar to the class label in supervised classication), and the conditional models come from simple families such as Gaussians with... |

480 |
Bayesian classification (AutoClass): Theory and results
- Cheeseman, Stutz
- 1995
(Show Context)
Citation Context ... problems are supervised and unsupervised learning. We have already discussed these large classes in subsection 1.1. A very prominent project for Bayesian unsupervised learning is AutoClass (see [38],=-=[17]-=-), it might be used for a straightforward implementation of baseline methods discussed in subsection 2.1. We found the discussion in [86] of quantization (probably the most important special case of u... |

455 |
Nonparametric Regression and Generalized Linear Models: A roughness penalty approach
- Green, Silverman
- 1994
(Show Context)
Citation Context ...tion. A MAP approximation to Bayesian Gaussian process classication (e.g. [99]), also called generalized penalized maximum likelihood, can be seen as logistic regression in a feature space (see e.g. [=-=37]-=-). In such purely diagnostic settings, unlabeled data cannot help narrowing our belief in the latent function, see subsection 1.3.2. Anderson [3] circumvents this problem by choosing a parameterizatio... |

434 | The information bottleneck method
- Tishby, Pereira, et al.
- 1999
(Show Context)
Citation Context ...the recently proposed information bottleneck learning algorithm can also be regarded as such a procedure between three convex sets, therefore has the same theoretical basis than the EM algorithm (see =-=[89]). 14 We -=-do not discuss these conditions in detail here, they are usually fullled in practice. 15 To be able to talk about \smoothness" and \environments", wesrst have to impose a manifold structure ... |

433 | Unsupervised models for named entity classification
- Collins, Singer
- 1999
(Show Context)
Citation Context ...r only weakly conditionally dependent, given t, we would expect that D u can boost the performance, compared to classication based on D l alone, given somewhat weaker assumptions. Collins and Singer [=-=20-=-] apply the Co-Training paradigm to the problem of named entity classication. Here, one is interested in classifying entities which are 26 Thanks to Chris Williams for pointing this out. 3 LITERATURE ... |

397 | Exploiting generative models in discriminative classifiers
- Jaakkola, Haussler
- 1998
(Show Context)
Citation Context ... available prior knowledge into the procedures suggested in [77], but we are very interested in following up this very recent line of research. 3.5 The Fisher kernel The Fisher kernel, as proposed in =-=[46]-=-, is thesrst general and principled attempt to exploit information from a generative modelstted to the input distributionsP (x) in one of the most powerful currently available classes of discriminativ... |

397 | Mixtures of probabilistic principal component analysers
- Tipping, M
- 1999
(Show Context)
Citation Context ...s come from simple families such as Gaussians with structurally restricted covariance matrices. Combinations of mixture and latent subspace models have also been considered in numerous variants (e.g. =-=[88]-=-, [34],[93],[5]). Finally note how, within all these models, complexity can be regulated at various levels. The relations between latent and observable variables are kept simple by choosing relatively... |

368 | Convolution Kernels on Discrete Structures
- Haussler
- 1999
(Show Context)
Citation Context ... kernel methods are diagnostic schemes (see subsection 1.3.2) in which the prior distribution over the latent function 30 is a Gaussian process, specied by a positive denite covariance kernel (e.g. [4=-=0]). Th-=-e covariance kernel induces a \natural" distance in a feature space, and the Fisher kernel attempts to adapt this distance in a highly genuine and interesting way to information about the distrib... |

334 | Selective sampling using the query by committee algorithm
- Freund, Seung, et al.
- 1997
(Show Context)
Citation Context ..., the authors try to combine an EM algorithm on a joint probability model (see [65] and subsection 2.2) with an active learning strategy, namely the queryby -committee (QBC) algorithm ([82], see also =-=[31-=-] and subsection 4.1), to attack an instance of the labeled-unlabeled problem in text classication. The idea is to overcome stability problems of standard EM by injecting unlabeled points one at a tim... |

318 | Query by committee
- Seung, Opper, et al.
- 1992
(Show Context)
Citation Context ...n 3.3). In [55], the authors try to combine an EM algorithm on a joint probability model (see [65] and subsection 2.2) with an active learning strategy, namely the queryby -committee (QBC) algorithm (=-=[82-=-], see also [31] and subsection 4.1), to attack an instance of the labeled-unlabeled problem in text classication. The idea is to overcome stability problems of standard EM by injecting unlabeled poin... |

295 |
Principal curves
- Hastie, Stuetzle
- 1989
(Show Context)
Citation Context ... data in the best possible way. Examples of unsupervised techniques include latent subspace models like principal component analysis (PCA) (e.g. [11]), factor analysis (e.g. [29]) or principal curves =-=[39]. Her-=-e, we introduce a latent \compression" variable u, living in a low-dimensional space, furthermore impose a noisy functional relationship on xju. The functional relationships are represented eithe... |

280 | GTM: The generative topographic mapping
- Bishop, Svensén, et al.
- 1998
(Show Context)
Citation Context ...ly be modeled as coming from an underlying low-dimensional manifold (or, more generally, from a mixture of such manifolds), convolved with Gaussian noise 10 . The generative topographic mapping (GTM) =-=[10]-=- is a very powerful architecture in such situations, obtaining the latent manifold as a smooth nonlinear mapping of a uniform distribution over a low-dimensional space, represented by a regular grid. ... |

257 | Employing EM in pool-based active learning for text classification
- McCallum
- 1998
(Show Context)
Citation Context ...echniques, provides an extensive case study and contains a very detailed section on related work. A later paper [64] extends the case study, including more robust EM variants (see subsection 3.3). In =-=[55]-=-, the authors try to combine an EM algorithm on a joint probability model (see [65] and subsection 2.2) with an active learning strategy, namely the queryby -committee (QBC) algorithm ([82], see also ... |

229 |
Elements of information theory. Wiley series in telecommunications
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...under this conditional distribution, and then choose a new model which maximizes this criterion. To be more specic, let z v be the observed, z h be the hidden variables. By Jensen's inequality (e.g. [21]), applied to the concave log, we have log P (z v j) = log Z P (z v ; z h j) dz h E zh Q(zh ) log P (z v ; z h j) Q(z h ) (5) for any distribution Q(z h ). Forsxed z v = z v and the curren... |

226 |
Differential-Geometrical Methods in Statistics
- Amari
- 1985
(Show Context)
Citation Context ...scovery”). Here, the model family {P (x|θ) | θ ∈ Θ}, Θ ⊂ R p is assumed to be a smooth, low-dimensional manifold embedded in R p (this view on model families comes from information geometry, see e.g. =-=[1]-=-,[64]). The task is to learn the prior P (θ) which enforces this assumption, from multiple tasks. In this work, the manifold is modeled by connecting locally linear patches using kernel smoothing. Alt... |

225 | The EM algorithm for mixtures of factor analyzers
- Ghahramani, GE
- 1996
(Show Context)
Citation Context ... from simple families such as Gaussians with structurally restricted covariance matrices. Combinations of mixture and latent subspace models have also been considered in numerous variants (e.g. [88], =-=[34]-=-,[93],[5]). Finally note how, within all these models, complexity can be regulated at various levels. The relations between latent and observable variables are kept simple by choosing relatively narro... |

222 | Gaussian processes for regression
- Williams, Rasmussen
- 1996
(Show Context)
Citation Context ...om a generative modelstted to the input distributionsP (x) in one of the most powerful currently available classes of discriminativesclassiers, namely kernel methods such as Gaussian processes (e.g. [=-=100]-=-,[97],[54]) or Support Vector machines (e.g. [96],[14]). In a nutshell, kernel methods are diagnostic schemes (see subsection 1.3.2) in which the prior distribution over the latent function 30 is a Ga... |

195 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond
- Williams
- 1999
(Show Context)
Citation Context ...ikelihood of the data D l . A Bayesian approach would place a prior on the linear function and compute the posterior distribution. A MAP approximation to Bayesian Gaussian process classication (e.g. [=-=99]-=-), also called generalized penalized maximum likelihood, can be seen as logistic regression in a feature space (see e.g. [37]). In such purely diagnostic settings, unlabeled data cannot help narrowing... |

184 | Supervised learning from incomplete data via an EM approach
- Ghahramani, Jordan
- 1994
(Show Context)
Citation Context ...[31], section 1. MacKay [53] discusses Bayesian active learning for multi-layer perceptrons. Cohn et al [19] introduce the general problem, then focus on joint density models of the kind discussed in =-=[35]-=- (see also [38] and subsection 2.2). A very general querysltering algorithm is query by committee (QBC) (see [82], also [31]), which has already been mentioned in subsection 3.2. QBC is sequential in ... |

174 | Semi-supervised support vector machines
- Bennett, Demiriz
- 1999
(Show Context)
Citation Context ..., in our case y(x) = w T x + b), with the latter obviously closely related to the generalization error, we do not know of such a link between the ae-margin and the generalization error. Bennett et al =-=[7]-=- suggest a variant of Vapnik’s scheme for the case of linear discriminants (i.e. SVM). They focus on a variant of SVM which employs the 1-norm �w�1 = � j |wj| for penalization (instead of the Euclidea... |

160 | Using the Fisher kernel method to detect remote protein homologies
- Jaakkola, Diekhans, et al.
- 1999
(Show Context)
Citation Context ... might improve upon the basic Fisher kernel, although these are not yet suciently tested empirically. The Fisher kernel has been applied successfully to discrimination between protein families ([46], =-=[43]-=-), where the proteins are represented by their amino acid sequence and families arestted using hidden Markov models (HMM). It has also been applied to document retrieval [42]. Attempts to apply the Fi... |

153 | N.: Agglomerative information bottleneck
- Slonim, Tishby
- 2000
(Show Context)
Citation Context ...ject for Bayesian unsupervised learning is AutoClass (see [38],[18]), it might be used for a straightforward implementation of baseline methods discussed in subsection 2.1. We found the discussion in =-=[84]-=- of quantization (probably the most important special case of unsupervised learning) in the context of source compression and rate distortion theory very helpful. 4.1 Active learning In (pool-based) a... |

150 | Bayesian methods for adaptive models
- MacKay
- 1992
(Show Context)
Citation Context ...lysis 12 , we can still use the IRLS method to compute this. The 12 The MAP approximation to Bayesian analysis is brie y discussed (in another context) in subsection 2.4. Details can be found e.g. in =-=[53]-=-. 2 BASELINE METHODS 17 advantage of this method over labeling the clusters by assuming that k acts as separator between x and t, as discussed above in this subsection, is that clusters can be split b... |

148 | Variational inference for Bayesian mixtures of factor analysers
- Ghahramani, Beal
(Show Context)
Citation Context ...le families such as Gaussians with structurally restricted covariance matrices. Combinations of mixture and latent subspace models have also been considered in numerous variants (e.g. [88], [34],[93],=-=[5]-=-). Finally note how, within all these models, complexity can be regulated at various levels. The relations between latent and observable variables are kept simple by choosing relatively narrow model f... |

139 | Is Learning the Nth Thing Any Easier Than Learning the First
- Thrun
- 1996
(Show Context)
Citation Context ... regularization) guided by available prior knowledge or assumptions. Response coaching can be seen as a special case of the problem of learning how to learn or multitask learning (e.g. [74],[4], [14],=-=[88]-=-,[62]). The relationship x ↦→ z is a second task which is learned together with the primary one in an attempt to employ information flow through latent, shared variables. A very general approach to th... |

136 | Information geometry and alternating minimization procedures, Statistics and Decisions (supplement 1 - Csiszar, Tusnady - 1984 |

132 |
An introduction to latent variable models
- Everitt
- 1984
(Show Context)
Citation Context ...ation is always to fit the data in the best possible way. Examples of unsupervised techniques include latent subspace models like principal component analysis (PCA) (e.g. [10]), factor analysis (e.g. =-=[28]-=-) or principal curves [39]. Here, we introduce a latent “compression” variable u, living in a low-dimensional space, furthermore impose a noisy functional relationship on x|u. The functional relations... |

130 |
Self-organizing neural network that discovers surfaces in random-dot stereograms
- Becker, Hinton
- 1992
(Show Context)
Citation Context ...et very eective) idea, therefore it does not come as a surprise that related ideas have been used in earlier work on unsupervised learning. We begin by reviewing some of this work. Becker and Hinton [=-=6]-=- propose the IMAX strategy to learn coherence structure in data. Quoting [7], the approach is \to maximize some measure of agreement between the outputs of two groups of units which receive inputs phy... |