## Learning Structural SVMs with Latent Variables

### Cached

### Download Links

Citations: | 134 - 2 self |

### BibTeX

@MISC{Yu_learningstructural,

author = {Chun-nam John Yu and Thorsten Joachims},

title = {Learning Structural SVMs with Latent Variables},

year = {}

}

### OpenURL

### Abstract

It is well known in statistics and machine learning that the combination of latent (or hidden) variables and observed variables offer more expressive power than models with observed variables alone. Latent variables

### Citations

10429 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...ctured output f ⃗w(x) differs from the correct output. We want to learn a prediction rule that incurs small average loss on future inputs. According to the Empirical Risk Minimization (ERM) principle =-=[9]-=-, we should search for a parameter vector ⃗w with low empirical risk ∑n i=1 ∆(yi, f ⃗w(xi)). But in general this is very difficult due to non-convexity and discontinuity of the loss function ∆ and the... |

9486 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ... ˆy, ˆ ] lem while treating the latent variables as completely h) + ∆(yi, ˆy)] observed. This is similar to the iterative process of ] [ ⃗w · Φ(xi, ˆy, ˆ h) + ∆(yi, ˆy)] Expectation Maximization (EM) =-=[2]-=-. But unlike EM which maximizes the expected log likelihood under the marginal distribution of the latent variables, we are minimizing the loss against a single latent variable h ∗ i that best explain... |

978 | Optimizing search engines using clickthrough data
- Joachims
- 2002
(Show Context)
Citation Context ...ed model is used as the initial weight vector to optimize for precision@k. We can see from Table 3 that our Latent Structural SVM approach performs better than the Ranking SVM (Herbrich et al., 2000; =-=Joachims, 2002-=-) on precision@1,3,5,10, one of the stronger baselines in the LETOR 3.0 benchmark. We also essentially tie with ListNet (Cao et al., 2007), one of the best overall ranking method in the LETOR 3.0 benc... |

609 |
Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment
- Lawrence, Altschul, et al.
- 1993
(Show Context)
Citation Context ...itive, otherwise they are labeled as negative. Altogether we have 124 positive examples and 75 negative examples. Popular methods for motif finding includes methods based on EM [1] and Gibbs-sampling =-=[3]-=-. For this particular yeast dataset we believe a discriminative approach, especially one incorporating large-margin separation, is beneficial because of the close relationship and DNA sequence similar... |

376 | A discriminatively trained, multiscale, deformable part model
- Felzenszwalb, McAllester, et al.
- 2008
(Show Context)
Citation Context ...to obtain tighter non-convex loss bounds on structured learning (Chapelle et al., 2008). In the computer vision community there are recent works on training Hidden CRF using the max-margin criterion (=-=Felzenszwalb et al., 2008-=-; Wang & Mori, 2008). In these works they focus on classification problems only and their training problem formulations are a special case of our proposal below. Interestingly, the algorithm in (Felze... |

347 | Support vector machine learning for interdependent and structured output spaces
- Tsochantaridis, Hofmann, et al.
- 2004
(Show Context)
Citation Context ...ze the non-convex objective in training. The use of latent variables is less well-explored in the case of large-margin structured output learning such as Max-Margin Markov Networks or Structural SVMs =-=[7, 8]-=-. These models are non-probabilistic and offer excellent performance in many structured prediction tasks in the fully observed case. Currently, they do not support the use of latent variables, which e... |

338 |
Large margin rank boundaries for ordinal regression
- Herbrich, Graepel, et al.
- 2000
(Show Context)
Citation Context ...for each fold the trained model is used as the initial weight vector to optimize for precision@k. We can see from Table 3 that our Latent Structural SVM approach performs better than the Ranking SVM (=-=Herbrich et al., 2000-=-; Joachims, 2002) on precision@1,3,5,10, one of the stronger baselines in the LETOR 3.0 benchmark. We also essentially tie with ListNet (Cao et al., 2007), one of the best overall ranking method in th... |

291 | Improving machine learning approaches to co-reference resolution
- Ng, Cardie
- 2002
(Show Context)
Citation Context ...erent directly without scanning through the whole text. Following the intuition that humans might determine if two noun phrases are coreferent by reasoning transitively over strong coreference links (=-=Ng & Cardie, 2002-=-), we model the problem of noun phrase coreference as a single-link agglomerative clustering problem. Each input x contains all n noun phrases in a document, and all the pairwise features xij between ... |

242 | Unsupervised learning of multiple motifs in biolpolymers using expectation maximization
- Baily, Elkan
- 1995
(Show Context)
Citation Context ...xamples and 75 negative examples. In addition we have 6460 sequences from the yeast intergenic regions for background model estimation. Popular methods for motif finding includes methods based on EM (=-=Bailey & Elkan, 1995-=-) and Gibbssampling. For this particular yeast dataset we believe a discriminative approach, especially one incorporating large-margin separation, is beneficial because of the close relationship and D... |

169 |
A model-theoretic coreference scoring scheme
- Vilain, Burger, et al.
- 1995
(Show Context)
Citation Context ...inley & Joachims, 2005). Pair loss is the proportion of all O(n 2 ) edges incorrectly classified. MITRE loss is a loss proposed for evaluating noun phrase coreference that is related to the F1-score (=-=Vilain et al., 1995-=-). We can see from the first two lines in the table that our method performs well on the Pair loss but worse on the MITRE loss when compared with the SVM correlation clustering approach. Error analysi... |

159 | Learning to rank: from pairwise approach to listwise approach
- Cao, Qin, et al.
- 2007
(Show Context)
Citation Context ...ach performs better than the Ranking SVM (Herbrich et al., 2000; Joachims, 2002) on precision@1,3,5,10, one of the stronger baselines in the LETOR 3.0 benchmark. We also essentially tie with ListNet (=-=Cao et al., 2007-=-), one of the best overall ranking method in the LETOR 3.0 benchmark. As a sanity check, we also report the performance of the initial weight vectors used for initializing the CCCP. The Latent Structu... |

113 | H (2007) LETOR:benchmark dataset for research on learning to rank for information retrieval. In: SIGIR ’07 workshop on learning to rank for information retrieval
- Liu, Xu, et al.
(Show Context)
Citation Context ...label y has to be respected (so that xi comes before xj when either yi > yj, or yi==yj and w · xi>w · xj). To evaluate our algorithm, we ran experiments on the OHSUMED tasks of the LETOR 3.0 dataset (=-=Liu et al., 2007-=-). We use the per-query-normalized version of the features in all our training and testing below, and employ exactly the same training, test, and validation sets split as given. For this application i... |

106 |
Proximity control in bundle methods for convex nondifferentiable minimization
- Kiwiel
- 1990
(Show Context)
Citation Context ...regularized loss against a single latent variable h∗ i that best explains (xi,yi). In our implementation, we used an improved version of the cutting plane algorithm called the proximal bundle method (=-=Kiwiel, 1990-=-) to solve the standard Structural SVM problem in Equation (7). In our experience the proximal bundle method usually converges using fewer iterations than the cutting plane algorithm (Joachims et al.,... |

106 |
A.: The concave-convex procedure
- Yuille, Rangarajan
- 2003
(Show Context)
Citation Context ... annotations in a discriminative manner (Petrov & Klein, 2007). The non-convex likelihood functions of these problems are usually optimized using gradient-based methods. The Concave-Convex Procedure (=-=Yuille & Rangarajan, 2003-=-) employed in our work is a general framework for minimizing non-convex functions which falls into the class of DC (Difference of Convex) programming. In recent years there have been numerous appli-L... |

75 | Supervised clustering with support vector machines
- Finley, Joachims
- 2005
(Show Context)
Citation Context ...tent Structural SVM. 5.2. Noun Phrase Coreference via Clustering In noun phrase coreference resolution we would like to determine which noun phrases in a text refer to the same real-world entity. In (=-=Finley & Joachims, 2005-=-) the task is formulated as a correlation clustering problem trained with Structural SVMs. In correlation clustering the objective function maximizes the sum of pairwise similarities. However this mig... |

66 |
T.: Hidden conditional random fields for gesture recognition
- Wang, Quattoni, et al.
(Show Context)
Citation Context .... In structured output prediction, there have been various applications of latent variable models. For example, it has been used for capturing interesting substructures or parts in object recognition =-=[10]-=-, for automatic refinement of grammars in parsing [6], and for dimensionality reduction in people tracking [4]. Almost all of these latent variable models are probabilistic in nature and use EM or gra... |

61 | Trading convexity for scalability
- COLLOBERT, SINZ, et al.
(Show Context)
Citation Context ... In recent years there have been numerous appli-Learning Structural SVMs with Latent Variables cations of the algorithm in machine learning, including training non-convex SVMs and transductive SVMs (=-=Collobert et al., 2006-=-). The approach in (Smola et al., 2005) employs CCCP to handle missing data in SVMs and Gaussian Processes and is closely related to our work. However our approach is non-probabilistic and avoids the ... |

57 | Kernel methods for missing variables
- Smola, Vishwanathan, et al.
- 2005
(Show Context)
Citation Context ...appli-Learning Structural SVMs with Latent Variables cations of the algorithm in machine learning, including training non-convex SVMs and transductive SVMs (Collobert et al., 2006). The approach in (=-=Smola et al., 2005-=-) employs CCCP to handle missing data in SVMs and Gaussian Processes and is closely related to our work. However our approach is non-probabilistic and avoids the computation of partition functions, wh... |

40 | Discriminative log-linear grammars with latent variables
- Petrov, Klein
- 2008
(Show Context)
Citation Context ...ious applications of latent variable models. For example, it has been used for capturing interesting substructures or parts in object recognition [10], for automatic refinement of grammars in parsing =-=[6]-=-, and for dimensionality reduction in people tracking [4]. Almost all of these latent variable models are probabilistic in nature and use EM or gradient-based methods to optimize the non-convex object... |

40 | Max-margin hidden conditional random fields for human action recognition
- Wang, Mori
- 2009
(Show Context)
Citation Context ...x loss bounds on structured learning (Chapelle et al., 2008). In the computer vision community there are recent works on training Hidden CRF using the max-margin criterion (Felzenszwalb et al., 2008; =-=Wang & Mori, 2008-=-). In these works they focus on classification problems only and their training problem formulations are a special case of our proposal below. Interestingly, the algorithm in (Felzenszwalb et al., 200... |

28 |
C.: People Tracking with the Laplacian Eigenmaps Latent Variable Model
- Lu, Carreira-Perpinan, et al.
- 2008
(Show Context)
Citation Context ... it has been used for capturing interesting substructures or parts in object recognition [10], for automatic refinement of grammars in parsing [6], and for dimensionality reduction in people tracking =-=[4]-=-. Almost all of these latent variable models are probabilistic in nature and use EM or gradient-based methods to optimize the non-convex objective in training. The use of latent variables is less well... |

14 | T.: Transductive support vector machines for structured variables
- Zien, Brefeld, et al.
(Show Context)
Citation Context ... are not part of the output. The loss that we are interested in for these tasks do not depend on the latent variables. (iii) it distinguishes our approach from transductive structured output learning =-=[12]-=-. When the loss function ∆ depends only on the fully observed label yi, it rules out the possibility of transductive learning, but the restriction also results in simpler optimization problems compare... |

14 | Tighter bounds for structured estimation
- Chapelle, Do, et al.
- 2008
(Show Context)
Citation Context ...artition functions, which is particularly attractive for structured prediction. Very recently the CCCP algorithm has also been applied to obtain tighter non-convex loss bounds on structured learning (=-=Chapelle et al., 2008-=-). In the computer vision community there are recent works on training Hidden CRF using the max-margin criterion (Felzenszwalb et al., 2008; Wang & Mori, 2008). In these works they focus on classifica... |

9 |
Unsupervised Learning of Multiple Motifs
- Bailey, Elkan
- 1995
(Show Context)
Citation Context ...they are labeled as positive, otherwise they are labeled as negative. Altogether we have 124 positive examples and 75 negative examples. Popular methods for motif finding includes methods based on EM =-=[1]-=- and Gibbs-sampling [3]. For this particular yeast dataset we believe a discriminative approach, especially one incorporating large-margin separation, is beneficial because of the close relationship a... |

5 |
GIMSAN: a Gibbs motif finder with significance analysis
- Ng, Keich
(Show Context)
Citation Context ...odel is re-trained 5 times using 5 different random seeds. We picked models having the best accuracy on the validation fold and report its accuracy on the test fold. As control we ran a Gibbs sampler =-=[5]-=- on the same dataset. It reports good results on motif lengths l = 11 and l = 17, which we compare our algorithm against. The Gibbs sampler is given the unfair advantage that it has access to a separa... |