SemiSupervised Learning Literature Survey
, 2006
"... We review the literature on semisupervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole
spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semisupervised learning. This document is a chapter ..."
We review the literature on semisupervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole
spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semisupervised learning. This document is a chapter excerpt from the author’s
doctoral thesis (Zhu, 2005). However the author plans to update the online version frequently to incorporate the latest development in the field. Please obtain the latest
version at http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf
Learning to combine bottomup and topdown segmentation
 in: European Conference on Computer Vision
"... Abstract. Bottomup segmentation based only on lowlevel cues is a notoriously difficult problem. This difficulty has lead to recent topdown segmentation algorithms that are based on classspecific image information. Despite the success of topdown algorithms, they often give coarse segmentations t ..."
Abstract. Bottomup segmentation based only on lowlevel cues is a notoriously difficult problem. This difficulty has lead to recent topdown segmentation algorithms that are based on classspecific image information. Despite the success of topdown algorithms, they often give coarse segmentations that can be significantly refined using lowlevel cues. This raises the question of how to combine both topdown and bottomup cues in a principled manner. In this paper we approach this problem using supervised learning. Given a training set of ground truth segmentations we train a fragmentbased segmentation algorithm which takes into account both bottomup and topdown cues simultaneously, in contrast to most existing algorithms which train topdown and bottomup modules separately. We formulate the problem in the framework of Conditional Random Fields (CRF) and derive a feature induction algorithm for CRF, which allows us to efficiently search over thousands of candidate fragments. Whereas pure topdown algorithms often require hundreds of fragments, our simultaneous learning procedure yields algorithms with a handful of fragments that are combined with lowlevel cues to efficiently compute high quality segmentations. 1
Structured prediction, dual extragradient and Bregman projections
 Journal of Machine Learning Research
, 2006
"... We present a simple and scalable algorithm for maximummargin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convexconcave saddlepoint problem that allows us to use simple projection methods ..."
We present a simple and scalable algorithm for maximummargin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convexconcave saddlepoint problem that allows us to use simple projection methods based on the dual extragradient algorithm (Nesterov, 2003). The projection step can be solved using dynamic programming or combinatorial algorithms for mincost convex flow, depending on the structure of the problem. We show that this approach provides a memoryefficient alternative to formulations based on reductions to a quadratic program (QP). We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1 1.
Bayesian conditional random fields
 In Conference on Artificial Intelligence and Statistics (AISTATS), 2005. 193 Yuan
, 2005
"... We propose Bayesian Conditional Random Fields (BCRFs) for classifying interdependent and structured data, such as sequences, images or webs. BCRFs are a Bayesian approach to training and inference with conditional random fields, which were previously trained by maximizing likelihood (ML) (Lafferty e ..."
We propose Bayesian Conditional Random Fields (BCRFs) for classifying interdependent and structured data, such as sequences, images or webs. BCRFs are a Bayesian approach to training and inference with conditional random fields, which were previously trained by maximizing likelihood (ML) (Lafferty et al., 2001). Our framework eliminates the problem of overfitting, and offers the full advantages of a Bayesian treatment. Unlike the ML approach, we estimate the posterior distribution of the model parameters during training, and average over this posterior during inference. We apply an extension of EP method, the power EP method, to incorporate the partition function. For algorithmic stability and accuracy, we flatten the approximation structures to avoid twolevel approximations. We demonstrate the superior prediction accuracy of BCRFs over conditional random fields trained with ML or MAP on synthetic and real datasets. 1
A Review of Kernel Methods in Machine Learning
, 2006
"... We review recent methods for learning with positive definite kernels. All these methods formulate learning and estimation problems as linear tasks in a reproducing kernel Hilbert space (RKHS) associated with a kernel. We cover a wide range of methods, ranging from simple classifiers to sophisticate ..."
We review recent methods for learning with positive definite kernels. All these methods formulate learning and estimation problems as linear tasks in a reproducing kernel Hilbert space (RKHS) associated with a kernel. We cover a wide range of methods, ranging from simple classifiers to sophisticated methods for estimation with structured data.
Minimizing and learning energy functions for sidechain prediction
 In RECOMB2007
, 2007
"... Sidechain prediction is an important subproblem of the general protein folding problem. Despite much progress in sidechain prediction, performance is far from satisfactory. As an example, the ROSETTA protocol that uses simulated annealing to select the minimum energy conformations, correctly predi ..."
Sidechain prediction is an important subproblem of the general protein folding problem. Despite much progress in sidechain prediction, performance is far from satisfactory. As an example, the ROSETTA protocol that uses simulated annealing to select the minimum energy conformations, correctly predicts the first two sidechain angles for approximately 72 % of the buried residues in a standard data set. Is further improvement more likely to come from better search methods, or from better energy functions? Given that exact minimization of the energy is NP hard, it is difficult to get a systematic answer to this question. In this paper, we present a novel search method and a novel method for learning energy functions from training data that are both based on Tree Reweighted Belief Propagation (TRBP). We find that TRBP can find the global optimum of the ROSETTA energy function in a few minutes of computation for approximately 85 % of the proteins in a standard benchmark set. TRBP can also effectively bound the partition function which enables using the Conditional Random Fields (CRF) framework for learning. Interestingly, finding the global minimum does not significantly improve sidechain prediction for
Semisupervised learning for structured output variables
 ICML06, 23rd International Conference on Machine Learning
, 2006
"... The problem of learning a mapping between input and structured, interdependent output variables covers sequential, spatial, and relational learning as well as predicting recursive structures. Joint feature representations of the input and output variables have paved the way to leveraging discriminat ..."
The problem of learning a mapping between input and structured, interdependent output variables covers sequential, spatial, and relational learning as well as predicting recursive structures. Joint feature representations of the input and output variables have paved the way to leveraging discriminative learners such as SVMs to this class of problems. We address the problem of semisupervised learning in joint input output spaces. The cotraining approach is based on the principle of maximizing the consensus among multiple independent hypotheses; we develop this principle into a semisupervised support vector learning algorithm for joint input output spaces and arbitrary loss functions. Experiments investigate the benefit of semisupervised structured models in terms of accuracy and F1 score. 1.
Augmented statistical models for speech recognition
 in Proc. ICASSP
, 2006
"... Recently there has been significant interest in developing new acoustic models for speech recognition. One such model, that allows complex dependencies to be represented, is the augmented statistical model. This incorporates additional dependencies by constructing a local exponential expansion of a ..."
Recently there has been significant interest in developing new acoustic models for speech recognition. One such model, that allows complex dependencies to be represented, is the augmented statistical model. This incorporates additional dependencies by constructing a local exponential expansion of a standard HMM. Unfortunately, the resulting model often has an intractable normalisation term, rendering training difficult for all but binary classification tasks. In this paper, conditional augmented (CAug) models are proposed as an attractive alternative. Instead of modelling utterance likelihoods and inferring decision boundaries, CAug models directly model the posterior probability of class labels, conditioned on the utterance. The resulting model is easy to normalise and can be trained using conditional maximum likelihood estimation. In addition, as a convex model, the optimisation converges to a global maximum. 1.
Semisupervised learning for natural language
 MASTER’S THESIS, MIT
, 2005
"... Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available “for free ” in large quantities. Unlabeled data has shown p ..."
Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available “for free ” in large quantities. Unlabeled data has shown promise in improving the performance of a number of tasks, e.g. word sense disambiguation, information extraction, and natural language parsing. In this thesis, we focus on two segmentation tasks, namedentity recognition and Chinese word segmentation. The goal of namedentity recognition is to detect and classify names of people, organizations, and locations in a sentence. The goal of Chinese word segmentation is to find the word boundaries in a sentence that has been written as a string of characters without spaces. Our approach is as follows: In a preprocessing step, we use raw text to cluster words and calculate mutual information statistics. The output of this step is then used as features in a supervised model, specifically a global linear model trained using
Augmented Statistical Models for Classifying Sequence Data
, 2006
"... Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two ..."
Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two journal articles [36,68], two workshop papers [35,67] and a technical report [65]. The length of this thesis including appendices, bibliography, footnotes, tables and equations is approximately 60,000 words. This thesis contains 27 figures and 20 tables. i