## Semi-Supervised Learning Literature Survey (2006)

### Cached

### Download Links

Citations: | 452 - 8 self |

### BibTeX

@MISC{Zhu06semi-supervisedlearning,

author = {Xiaojin Zhu},

title = {Semi-Supervised Learning Literature Survey},

year = {2006}

}

### Years of Citing Articles

### OpenURL

### Abstract

We review the literature on semi-supervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semi-supervised learning. This document is a chapter excerpt from the author’s doctoral thesis (Zhu, 2005). However the author plans to update the online version frequently to incorporate the latest development in the field. Please obtain the latest version at http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf

### Citations

9002 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...esulting classifiers are defined over the whole space. The name TSVM originates from the intention to work only on the observed data (though people use them for induction anyway), which according to (=-=Vapnik, 1998-=-) is solving a simpler problem. People sometimes use the analogy that transductive learning is take-home exam, while inductive learning is in-class exam. • In this survey semi-supervised learning refe... |

8132 | Maximum likelihood from incomplete data via the em algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ccuracy, while (c)’s is much better. 2.3 EM Local Maxima Even if the mixture model assumption is correct, in practice mixture components are identified by the Expectation-Maximization (EM) algorithm (=-=Dempster et al., 1977-=-). EM is prone to local maxima. If a local maximum is far from the global maximum, unlabeled data may again hurt learning. Remedies include smart choice of starting point by active learning (Nigam, 20... |

4286 | A tutorial on hidden Markov models and selected applications in speech recognition
- Rabiner
(Show Context)
Citation Context ...quences and trees. 10.1 Generative Models One example of generative models for semi-supervised sequence learning is the Hidden Markov Model (HMM), in particular the Baum-Welsh HMM training algorithm (=-=Rabiner, 1989-=-). It is essentially the sequence version of the EM algorithm on mixture models as mentioned in section 2. Baum-Welsh algorithm has a long history, well before the recent emergence of interest on semi... |

2611 | Normalized cuts and image segmentation
- Shi, Malik
- 2000
(Show Context)
Citation Context ...., 2004a). The previous work (Zhou et al., 2005b) is a special case with the 2-step random walk transition matrix. In the absence of labels, the algorithm is the generalization of the normalized cut (=-=Shi & Malik, 2000-=-) on directed graphs. Lu and Getoor (2003) convert the link structure in a directed graph into pernode features, and combines them with per-node object features in logistic regression. They also use a... |

2392 | Latent Dirichlet allocation
- Blei, Ng, et al.
(Show Context)
Citation Context ...Each document in turn has a fixed topic proportion (a multinomial on a higher level). However there is no link between the topic proportions in different documents. Latent Dirichlet Allocation (LDA) (=-=Blei et al., 2003-=-) is one step further. It assumes the topic proportion of each document is drawn from a Dirichlet distribution. With variational approximation, each document is represented by a posterior Dirichlet ov... |

1692 | A global geometric framework for nonlinear dimensionality reduction, Science 290 - Tenenbaum, Silva, et al. - 2000 |

1620 | Nonlinear dimensionality reduction by locally linear embedding - Roweis, Saul - 2000 |

1245 | Combining labeled and unlabeled data with co-training
- BLUM, T
- 1998
(Show Context)
Citation Context ...r specific base learners, there has been some analyzer’s on convergence. See e.g. (Haffari & Sarkar, 2007; Culp & Michailidis, 2007). 4 Co-Training and Multiview Learning 4.1 Co-Training Co-training (=-=Blum & Mitchell, 1998-=-) (Mitchell, 1999) assumes that (i) features can be split into two sets; (ii) each sub-feature set is sufficient to train a good classifier; (iii) the two sets are conditionally independent given the ... |

1103 | On spectral clustering: Analysis and an algorithm
- Ng, Jordan, et al.
(Show Context)
Citation Context ...el or constraint information, and therefore applicable for transductive classification. The data points are mapped into a new space spanned by the first k eigenvectors of the normalized Laplacian in (=-=Ng et al., 2001-=-), with special normalization. Clustering is then performed with traditional methods (like k-means) in this new space. This is very similar to kernel PCA. Fowlkes et al. (2004) use the Nyström method ... |

800 | Text classification from labeled and unlabeled documents using EM - Nigam, McCallum, et al. - 2000 |

761 | A comparison of event models for naive bayes text classification
- McCallum, Nigam
- 1998
(Show Context)
Citation Context ...ify the two components. For instance, the mixtures on the second and third line give the same p(x), but they classify x = 0.5 differently. Gaussian is identifiable. Mixture of multivariate Bernoulli (=-=McCallum & Nigam, 1998-=-a) is not identifiable. More discussions on identifiability and semi-supervised learning can be found in e.g. (Ratsaby & Venkatesh, 1995) and (Corduneanu & Jaakkola, 2001). 2.2 Model Correctness If th... |

738 | Laplacian eigenmaps for dimensionality reduction and data representation
- Belkin, Niyogi
(Show Context)
Citation Context .... Representative methods include Isomap (Tenenbaum et al., 2000), locally linear embedding (LLE) (Roweis & Saul, 2000) (Saul & Roweis, 2003), Hessian LLE (Donoho & Grimes, 2003), Laplacian eigenmaps (=-=Belkin & Niyogi, 2003-=-), and semidefinite embedding (SDE) (Weinberger & Saul, 2004) (Weinberger et al., 2004) (Weinberger et al., 2005). If one has some labeled data, for example in the form of the target low-dimensional r... |

681 | Transductive inference for text classification using support vector machines
- Joachims
- 1999
(Show Context)
Citation Context ...emiriz, 1999) (Demirez & Bennett, 2000) (Fung & Mangasarian, 1999) either cannot handle more than a few hundred unlabeled examples, or did not do so in experiments. The SVM-light TSVM implementation (=-=Joachims, 1999-=-) is the first widely used software. De Bie and Cristianini (De Bie & Cristianini, 2004; De Bie & Cristianini, 2006b) relax the TSVM training problem, and transductive learning problems in general to ... |

531 | Probabilistic latent semantic analysis
- Hofmann
- 1999
(Show Context)
Citation Context ...hen class separation is linear and along the principal component directions, and unlabeled helps by reducing the variance in estimating such directions. Probabilistic Latent Semantic Analysis (PLSA) (=-=Hofmann, 1999-=-) is an important improvement over LSI. Each word in a document is generated by a ‘topic’ (a multinomial, i.e. unigram). Different words in the document may be generated by different topics. Each docu... |

493 | Semi-supervised learning using Gaussian fields and harmonic functions
- ZHU, GHAHRAMANI, et al.
- 2003
(Show Context)
Citation Context ...y information, are applied to semi-supervised learning successfully. However dissimilarity is only briefly discussed, with many questions remain open. There is a finite form of a Gaussian process in (=-=Zhu et al., 2003-=-c), in fact a joint Gaussian distribution on the labeled and unlabeled points with the covariance 16matrix derived from the graph Laplacian. Semi-supervised learning happens in the process model, not... |

490 | Unsupervised word sense disambiguation rivaling supervised methods - Yarowsky - 1995 |

444 | Machine learning
- Mitchell
- 1996
(Show Context)
Citation Context ...g? Now let us turn our attention from machine learning to human learning. It is possible that understanding of the human cognitive model will lead to novel machine learning approaches (Langley, 2006; =-=Mitchell, 2006-=-). We ask the question: Do humans do semi-supervised learning? My hypothesis is yes. We humans accumulate ‘unlabeled’ input data, which we use (often unconsciously) to help building 4110 10 world pop... |

437 | Learning with local and global consistency - Zhou, Bousquet, et al. - 2004 |

434 | Unsupervised models for named entity classification - Collins, Singer - 1999 |

396 | Exploiting generative models in discriminative classiers
- Jaakkola, Haussler
- 1999
(Show Context)
Citation Context ...nerative model for classification, each labeled example is converted into a fixed-length Fisher score vector, i.e. the derivatives of log likelihood w.r.t. model parameters, for all component models (=-=Jaakkola & Haussler, 1998-=-). These Fisher score vectors are then used in a discriminative classifier 10like an SVM, which empirically has high accuracy. 3 Self-Training Self-training is a commonly used technique for semi-supe... |

374 | A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts - Pang, Lee - 2004 |

333 | Manifold regularization: A geometric framework for learning from labeled and unlabeled examples - Belkin, Niyogi, et al. |

320 | Segmentation using eigenvectors: a unifying view
- Weiss
- 1999
(Show Context)
Citation Context ...rs of the graph Laplacian can unfold the data manifold to form meaningful clusters. This is the intuition behind spectral clustering. There are several criteria on what constitutes a good clustering (=-=Weiss, 1999-=-). The normalized cut (Shi & Malik, 2000) seeks to minimize Ncut(A, B) = cut(A, B) cut(A, B) + assoc(A, V ) assoc(B, V ) (24) The continuous relaxation of the cluster indicator vector can be derived f... |

317 | A framework for learning predictive structures from multiple tasks and unlabeled data - Ando, Zhang - 2005 |

289 | A tutorial on spectral clustering - Luxburg |

283 |
Semi-Supervised Learning
- CHAPELLE, SCHÖLKOPF, et al.
- 2006
(Show Context)
Citation Context ...ints and the goal is clustering, is only briefly discussed later in the survey. We will follow the above convention in the survey. Q: Where can I learn more? A: A book on semi-supervised learning is (=-=Chapelle et al., 2006-=-c). An older survey can be found in (Seeger, 2001). I gave a tutorial at ICML 2007, the slides can be found at http://pages.cs.wisc.edu/ ∼ jerryzhu/icml07tutorial. html. 2 Generative Models Generative... |

266 | Learning from labeled and unlabeled data using graph mincuts - BLUM, andCHAWLA - 2001 |

256 | Employing EM in pool-based active learning for text classification
- McCallum, Nigam
- 1998
(Show Context)
Citation Context ...ify the two components. For instance, the mixtures on the second and third line give the same p(x), but they classify x = 0.5 differently. Gaussian is identifiable. Mixture of multivariate Bernoulli (=-=McCallum & Nigam, 1998-=-a) is not identifiable. More discussions on identifiability and semi-supervised learning can be found in e.g. (Ratsaby & Venkatesh, 1995) and (Corduneanu & Jaakkola, 2001). 2.2 Model Correctness If th... |

254 | Think globally, fit locally: unsupervised learning of low dimensional manifolds
- Saul, Roweis
- 2004
(Show Context)
Citation Context ... dimensional space is closely related to spectral graph semi-supervised learning. Representative methods include Isomap (Tenenbaum et al., 2000), locally linear embedding (LLE) (Roweis & Saul, 2000) (=-=Saul & Roweis, 2003-=-), Hessian LLE (Donoho & Grimes, 2003), Laplacian eigenmaps (Belkin & Niyogi, 2003), and semidefinite embedding (SDE) (Weinberger & Saul, 2004) (Weinberger et al., 2004) (Weinberger et al., 2005). If ... |

206 | Partially labeled classification with markov random walks - Szummer, Jaakkola - 2001 |

192 | Analyzing the effectiveness and applicability of co-training - Nigam, Ghani - 2000 |

191 | Spectral grouping using the Nyström method
- Fowlkes, Belongie, et al.
- 2004
(Show Context)
Citation Context ... in (Delalleau et al., 2005) the authors proposes an induction scheme to classify a new point x by ∑ i∈L∪U f(x) = wxif(xi) ∑ i∈L∪U wxi (17) This can be viewed as an application of the Nyström method (=-=Fowlkes et al., 2004-=-). Yu et al. (2004) report an early attempt on semi-supervised induction using RBF basis functions in a regularization framework. In (Belkin et al., 2004b), the function f does not have to be restrict... |

189 | Transductive learning via spectral graph partitioning
- JOACHIMS
- 2003
(Show Context)
Citation Context ... is used for approximation. The authors also propose a way to classify unseen points. This spectrum transformation is relatively simple. 6.1.8 Spectral Graph Transducer The spectral graph transducer (=-=Joachims, 2003-=-) can be viewed with a loss function and regularizer minc(f − γ) ⊤ C(f − γ) + f ⊤ Lf (14) s.t.f ⊤ 1 = 0andf ⊤ f = n (15) where γi = √ l−/l+ for positive labeled data, − √ l+/l− for negative data, l− b... |

186 | Self-taught learning: transfer learning from unlabeled data - Raina, Battle, et al. - 2007 |

184 | Diffusion kernels on graphs and other discrete input spaces
- Kondor, Lafferty
- 2002
(Show Context)
Citation Context ... the Laplacian. Chapelle et al. (2002) and Smola and Kondor (2003) both show the spectral transformation of a Laplacian results in kernels suitable for semi-supervised learning. The diffusion kernel (=-=Kondor & Lafferty, 2002-=-) corresponds to a spectrum transform of the Laplacian with r(λ) = exp(− σ2 λ) (12) 2 The regularized Gaussian process kernel ∆ + I/σ2 in (Zhu et al., 2003c) corresponds to r(λ) = 1 (13) λ + σ Similar... |

182 |
Hessian eigenmaps: locally linear embedding techniques for high-dimensional data
- Donoho, Grimes
- 2003
(Show Context)
Citation Context ...ed to spectral graph semi-supervised learning. Representative methods include Isomap (Tenenbaum et al., 2000), locally linear embedding (LLE) (Roweis & Saul, 2000) (Saul & Roweis, 2003), Hessian LLE (=-=Donoho & Grimes, 2003-=-), Laplacian eigenmaps (Belkin & Niyogi, 2003), and semidefinite embedding (SDE) (Weinberger & Saul, 2004) (Weinberger et al., 2004) (Weinberger et al., 2005). If one has some labeled data, for exampl... |

174 | Semi–supervised support vector machines
- Bennett, Demiriz
- 1998
(Show Context)
Citation Context ...um margin boundary would be the one with solid lines. However finding the exact transductive SVM solution is NP-hard. Major effort has focused on efficient approximation algorithms. Early algorithms (=-=Bennett & Demiriz, 1999-=-) (Demirez & Bennett, 2000) (Fung & Mangasarian, 1999) either cannot handle more than a few hundred unlabeled examples, or did not do so in experiments. The SVM-light TSVM implementation (Joachims, 19... |

170 | Kernels and regularization on graphs - Smola, Kondor - 2003 |

167 | Unsupervised learning of image manifolds by semidefinite programming
- Weinberger, Saul
- 2004
(Show Context)
Citation Context ... 2000), locally linear embedding (LLE) (Roweis & Saul, 2000) (Saul & Roweis, 2003), Hessian LLE (Donoho & Grimes, 2003), Laplacian eigenmaps (Belkin & Niyogi, 2003), and semidefinite embedding (SDE) (=-=Weinberger & Saul, 2004-=-) (Weinberger et al., 2004) (Weinberger et al., 2005). If one has some labeled data, for example in the form of the target low-dimensional representation for a few data points, the dimensionality redu... |

154 | Colorization using optimization,” in - Levin, Lischinski, et al. |

152 | Cluster kernels for semi-supervised learning
- Chapelle, Weston, et al.
- 2003
(Show Context)
Citation Context ...graph computation every time one encounters new points. Zhu et al. (2003c) propose that new test point be classified by its nearest neighbor in L∪U. This is sensible when U is sufficiently large. In (=-=Chapelle et al., 2002-=-) the authors approximate a new point by a linear combination of labeled and unlabeled points. Similarly in (Delalleau et al., 2005) the authors proposes an induction scheme to classify a new point x ... |

123 | J.B.: Integrating topics and syntax - Griffiths, Steyvers, et al. - 2005 |

122 | Semi-supervised classification by low density separation - Chapelle, Zien - 2005 |

122 | Maximum entropy discrimination
- JAAKKOLA, MEILA, et al.
- 1999
(Show Context)
Citation Context ...n labeled examples, yet maximally ignorant on unrelated examples. Zhang and Oles (2000) point out that TSVMs may not behave well under some circumstances. The maximum entropy discrimination approach (=-=Jaakkola et al., 1999-=-) also maximizes the margin, and is able to take into account unlabeled data, with SVM as a special case. 5.2 Gaussian Processes Lawrence and Jordan (2005) proposed a Gaussian process approach, which ... |

120 | Enhancing supervised learning with unlabeled data - Goldman, Zhou - 2000 |

116 | Learning subjective nouns using extraction pattern bootstrapping - Riloff, Wiebe, et al. |

115 | Regularization and SemiSupervised Learning on Large Graphs - Belkin, Matveeva, et al. - 2004 |

115 | Learning a kernel matrix for nonlinear dimensionality reduction
- Weinberger, Sha, et al.
- 2004
(Show Context)
Citation Context ...edding (LLE) (Roweis & Saul, 2000) (Saul & Roweis, 2003), Hessian LLE (Donoho & Grimes, 2003), Laplacian eigenmaps (Belkin & Niyogi, 2003), and semidefinite embedding (SDE) (Weinberger & Saul, 2004) (=-=Weinberger et al., 2004-=-) (Weinberger et al., 2005). If one has some labeled data, for example in the form of the target low-dimensional representation for a few data points, the dimensionality reduction problem becomes semi... |

108 | Beyond the point cloud: from transductive to semi-supervised learning
- Sindhwani, Niyogi, et al.
- 2005
(Show Context)
Citation Context ...labeled data set, and are required to make similar predictions on any given unlabeled instance. Multiview learning has a long history (de Sa, 1993). It has been applied to semi-supervised regression (=-=Sindhwani et al., 2005-=-b; Brefeld et al., 2006), and the more challenging structured output spaces (Brefeld et al., 2005; Brefeld & Scheffer, 2006). Some theoretical analysis on the value of agreement among multiple learner... |

102 |
A mixture of experts classifier with learning based on both labelled and unlabelled data
- Miller, Uyar
- 1997
(Show Context)
Citation Context ...ation a topic may contain several sub-topics, and will be better modeled by multiple multinomial instead of a single one (Nigam et al., 2000). Some other examples are (Shahshahani & Landgrebe, 1994) (=-=Miller & Uyar, 1997-=-). Another solution is to down-weighing unlabeled data (Corduneanu & Jaakkola, 2001), which is also used by Nigam et al. (2000), and by Callison-Burch et al. (2004) who estimate word alignment for mac... |