## Self-Paced Learning for Latent Variable Models

### Cached

### Download Links

### BibTeX

@MISC{Pawan_self-pacedlearning,

author = {M. Pawan and Kumar Benjamin and Packer Daphne Koller},

title = {Self-Paced Learning for Latent Variable Models},

year = {}

}

### OpenURL

### Abstract

Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition. 1

### Citations

9054 | Maximum likelihood from incomplete data via the EM algorithm (with discussion
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...s). Learning the parameters of a latent variable model often requires solving a non-convex optimization problem. Some common approaches for obtaining an approximate solution include the well-known EM =-=[8]-=- and CCCP algorithms [9, 23, 24]. However, these approaches are prone to getting stuck in a bad local minimum with high training and generalization error. Machine learning literature is filled with sc... |

1895 | Histograms of oriented gradients for human detection
- Dalal
- 2005
(Show Context)
Citation Context ...ance ǫ. the form fw(x) = argmax y∈Y,h∈H w ⊤ Φ(x,y,h). Here, Φ(x,y,h) is the joint feature vector. For instance, in our ‘car’ model learning example, the joint feature vector can be modeled as the HOG =-=[7]-=- descriptor extracted using pixels in the bounding box h. The parameters w are learned by solving the following optimization problem: s.t. 1 min w,ξi≥0 2 ||w||2 + C n n∑ ξi, i=1 ( max w⊤ Φ(xi,yi,hi) −... |

1508 | Bayesian data analysis
- Gelman, Carlin, et al.
- 1995
(Show Context)
Citation Context ... maximize likelihood: ( ∑ ∑ max log Pr(xi,yi;w) = max log Pr(xi,yi,hi;w) − w w ∑ ) log Pr(hi|xi,yi;w) . (1) i i A common approach for this task is to use the EM method [8] or one of its many variants =-=[12]-=-. Outlined in Algorithm 1, EM iterates between finding the expected value of the latent variables h and maximizing objective (1) subject to this expectation. We refer the reader to [8] for more detail... |

1319 | Combining labeled and unlabeled data with co-training
- Blum, Mitchell
- 1998
(Show Context)
Citation Context ... chosen. Another related learning regime is co-training, which works by alternately training classifiers such that the most confidently labeled samples from one classifier are used to train the other =-=[5, 17]-=-. Our approach differs from co-training in that in our setting the latent variables are simply used to assist in predicting the target labels, which are always observed, whereas co-training deals with... |

871 |
Nonlinear Programming: Theory and Algorithms
- Bazaraa, Sherali, et al.
- 1993
(Show Context)
Citation Context ...f the other set can be obtained by solving a convex optimization problem. In our case, the two sets of variables are w and v. Biconvex problems have a vast literature, with both global [11] and local =-=[2]-=- optimization techniques. In this work, we use alternative convex search (ACS) [2], which alternatively optimizes w and v while keeping the other set of variables fixed. We found in our experiments th... |

561 | Active learning with statistical models
- Cohn, Ghahramani, et al.
- 1996
(Show Context)
Citation Context ...r in their sample selection criteria. For example, Tong and Koller [21] suggest choosing a sample that is close to the margin (a “hard” sample), corresponding to anti-curriculum learning. Cohn et al. =-=[6]-=- advocate the use of the most uncertain sample with respect to the current classifier. However, unlike our setting, in active learning the labels of all the samples are not known when the samples are ... |

552 | Support vector machine active learning with applications to text classification
- Tong, Koller
- 2002
(Show Context)
Citation Context ...lso has a similar flavor to active learning, which chooses a sample to learn from at each iteration. Active learning approaches differ in their sample selection criteria. For example, Tong and Koller =-=[21]-=- suggest choosing a sample that is close to the margin (a “hard” sample), corresponding to anti-curriculum learning. Cohn et al. [6] advocate the use of the most uncertain sample with respect to the c... |

461 | Max-margin markov networks
- Taskar, Guestrin, et al.
- 2003
(Show Context)
Citation Context ... parameters requires us to solve a convex SSVM learning problem (where the output yi is now concatenated with the hidden variable h∗ i ) for which several efficient algorithms exist in the literature =-=[14, 20, 22]-=-. Algorithm 2 The CCCP algorithm for parameter estimation of latent SSVM. input D = {(x1,y1), · · · , (xn,yn)}, w0, ǫ. 1: t ← 0 2: repeat 3: Update h∗ i = argmaxhi∈H w⊤ t Φ(xi,yi,hi). 4: Update wt+1 b... |

352 | D.: A discriminatively trained, multiscale, deformable part model
- Felzenszwalb, McAllester, et al.
- 2008
(Show Context)
Citation Context ...ers of a latent variable model often requires solving a non-convex optimization problem. Some common approaches for obtaining an approximate solution include the well-known EM [8] and CCCP algorithms =-=[9, 23, 24]-=-. However, these approaches are prone to getting stuck in a bad local minimum with high training and generalization error. Machine learning literature is filled with scenarios in which one is required... |

329 | Support vector machine learning for interdependent and structured output spaces
- Tsochantaridis, Hofmann, et al.
- 2004
(Show Context)
Citation Context ... parameters requires us to solve a convex SSVM learning problem (where the output yi is now concatenated with the hidden variable h∗ i ) for which several efficient algorithms exist in the literature =-=[14, 20, 22]-=-. Algorithm 2 The CCCP algorithm for parameter estimation of latent SSVM. input D = {(x1,y1), · · · , (xn,yn)}, w0, ǫ. 1: t ← 0 2: repeat 3: Update h∗ i = argmaxhi∈H w⊤ t Φ(xi,yi,hi). 4: Update wt+1 b... |

280 | 2002b. Improving machine learning approaches to coreference resolution
- Ng, Cardie
(Show Context)
Citation Context ...le object. This task was formulated within the SSVM framework in [10] and extended to include latent variables in [23]. Formally, the input vector x consists of the pairwise features xij suggested in =-=[16]-=- between all pairs of noun phrases i and j in the document. The output y represents a clustering of the nouns. A hidden variable h specifies a forest over the nouns such that each tree in the forest c... |

210 | Analyzing the Effectiveness and Applicability of Co-Training
- NIGAM, R
(Show Context)
Citation Context ... chosen. Another related learning regime is co-training, which works by alternately training classifiers such that the most confidently labeled samples from one classifier are used to train the other =-=[5, 17]-=-. Our approach differs from co-training in that in our setting the latent variables are simply used to assist in predicting the target labels, which are always observed, whereas co-training deals with... |

184 |
Cutting-plane training of structural SVMs
- Joachims, Finley, et al.
(Show Context)
Citation Context ... parameters requires us to solve a convex SSVM learning problem (where the output yi is now concatenated with the hidden variable h∗ i ) for which several efficient algorithms exist in the literature =-=[14, 20, 22]-=-. Algorithm 2 The CCCP algorithm for parameter estimation of latent SSVM. input D = {(x1,y1), · · · , (xn,yn)}, w0, ǫ. 1: t ← 0 2: repeat 3: Update h∗ i = argmaxhi∈H w⊤ t Φ(xi,yi,hi). 4: Update wt+1 b... |

160 |
Numerical Continuation Methods: An Introduction
- Allgower, Georg
- 1990
(Show Context)
Citation Context ...related to curriculum learning in that both regimes suggest processing the samples in a meaningful order. Bengio et al. [3] noted that curriculum learning can be seen as a type of continuation method =-=[1]-=-. However, in their work, they circumvented the challenge of obtaining such an ordering by using datasets where there is a clear distinction between easy and hard samples (for example, classifying equ... |

122 | Learning structural svms with latent variables
- Yu, Joachims
- 2009
(Show Context)
Citation Context ...ers of a latent variable model often requires solving a non-convex optimization problem. Some common approaches for obtaining an approximate solution include the well-known EM [8] and CCCP algorithms =-=[9, 23, 24]-=-. However, these approaches are prone to getting stuck in a bad local minimum with high training and generalization error. Machine learning literature is filled with scenarios in which one is required... |

101 | The concaveconvex procedure - Yuille, Rangarajan - 2003 |

69 |
Tangent prop: a formalism for specifying selected invariances in adaptive networks
- Simard, Victorri, et al.
- 1991
(Show Context)
Citation Context .... In other words, Y = {0, 1, · · · , 9}. It is well-known that the accuracy of digit recognition can be greatly improved by explicitly modeling the deformations present in each image, for example see =-=[18]-=-. For simplicity, we assume that the deformations are restricted to an arbitrary rotation of the image, where the angle of rotation is not known beforehand. This angle (which takes a value from a fini... |

68 | Supervised clustering with support vector machines
- Finley, Joachims
- 2005
(Show Context)
Citation Context ...ouns in a document, the goal of noun phrase coreference is to provide a clustering of the nouns such that each cluster refers to a single object. This task was formulated within the SSVM framework in =-=[10]-=- and extended to include latent variables in [23]. Formally, the input vector x consists of the pairwise features xij suggested in [16] between all pairs of noun phrases i and j in the document. The o... |

50 | Curriculum learning
- Bengio, Louradour, et al.
(Show Context)
Citation Context ...d, for example, by testing on a validation set). However, this approach is adhoc and computationally expensive as one may be required to use several runs to obtain an accurate solution. Bengio et al. =-=[3]-=- recently proposed an alternative method for training with non-convex objectives, called curriculum learning. The idea is inspired by the way children are taught: start with easier concepts (for examp... |

44 | A primalrelated dual global optimization approach
- Floudas, Visweswaran
- 1993
(Show Context)
Citation Context ...optimal value of the other set can be obtained by solving a convex optimization problem. In our case, the two sets of variables are w and v. Biconvex problems have a vast literature, with both global =-=[11]-=- and local [2] optimization techniques. In this work, we use alternative convex search (ACS) [2], which alternatively optimizes w and v while keeping the other set of variables fixed. We found in our ... |

43 | Maximum margin clustering made practical
- Zhang, Tsang, et al.
- 2009
(Show Context)
Citation Context ...an image of size 28 × 28). For efficiency, we use PCA to reduce the dimensionality of each sample to 10. We perform binary classification on four difficult digit pairs (1-7, 2-7, 3-8, and 8-9), as in =-=[25]-=-. The training standard dataset size for each digit ranges from 5, 851 to 6, 742, and the test sets range from 974 to 1, 135 digits. The rotation modeled by the hidden variable can take one of 11 disc... |

33 |
Gradient based learning applied to document recognition
- LeCun, Bottou, et al.
- 1998
(Show Context)
Citation Context ...dden variables simply involves a search over a discrete set of angles. Similar to the motif finding experiment, we use the standard 0-1 classification loss. Dataset. We use the standard MNIST dataset =-=[15]-=-, which represents each handwritten digit as a vector of length 784 (that is, an image of size 28 × 28). For efficiency, we use PCA to reduce the dimensionality of each sample to 10. We perform binary... |

21 | On the convergence of the concave-convex procedure
- Sriperumbudur, Lanckriet
- 2009
(Show Context)
Citation Context ...a convex and a concave function. This observation leads to a concave-convex procedure (CCCP) [24] outlined in Algorithm 2, which has been shown to converge to a local minimum or saddle point solution =-=[19]-=-. The algorithm has two main steps: (i) imputing the hidden variables (step 3), which corresponds to approximating the concave function by a linear upper bound; and (ii) updating the value of the para... |

12 | Shape-based object localization for descriptive classification
- Heitz, Elidan, et al.
- 2009
(Show Context)
Citation Context ...n ∆(y, ˆy) is again the standard 0-1 classification loss. Dataset. We use images of 6 different mammals (approximately 45 images per mammal) that have been previously employed for object localization =-=[13]-=-. We split the images of each category into approximately 90% for training and 10% for testing. Results. We use five different folds to compare our method with the state of the art CCCP algorithm. For... |

6 |
et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences
- Berger, Badis, et al.
- 2008
(Show Context)
Citation Context ... hidden variables simply involves a search for the starting position of the motif. The loss function ∆ is the standard 0-1 classification loss. Dataset. We use the publicly available UniProbe dataset =-=[4]-=- that provides positive and negative DNA sequences for 177 proteins. For this work, we chose five proteins at random. The total number of sequences per protein is roughly 40, 000. For all the sequence... |