## Direct Loss Minimization for Structured Prediction

Citations: | 18 - 2 self |

### BibTeX

@MISC{Mcallester_directloss,

author = {David Mcallester and Tamir Hazan and Joseph Keshet},

title = {Direct Loss Minimization for Structured Prediction},

year = {}

}

### OpenURL

### Abstract

In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance such as the BLEU score in machine translation or the intersection-over-union score in the PAS-CAL segmentation evaluation. We propose here a perceptron-like learning method based on computing a difference of feature vectors between two inferred output values where at least one of the outputs is inferred by lossadjusted inference. The main contribution of this paper is a theorem directly relating updates of this form to the gradient of the given loss function with respect to the system parameters. This provides a theoretical foundation for certain training methods which have already gained widespread use in machine translation. Empirical results on phonetic alignment are also given here surpassing all previously reported results on this problem. 1.

### Citations

518 | Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms - Collins |

471 | Graphical models, exponential families, and variational inference
- Wainwright, Jordan
- 2003
(Show Context)
Citation Context ...ach to an intractable optimization problem is to define a convex relaxation of the objective function. In the case of graphical models this can be done by defining a relaxation of a marginal polytope =-=[11]-=-. The details of the relaxation are not important here. At a very abstract level the resulting approximate inference problem can be defined as follows where the set R is a relaxation of the set Y, and... |

462 | Max-margin Markov networks
- Taskar, Guestrin, et al.
- 2003
(Show Context)
Citation Context ...ace the objective in (2) with some form of convex relaxation. For example, several forms of convex relaxations are used in corresponding forms of structured support vector machines (structured SVMs) (=-=Taskar et al., 2003-=-; Tsochantaridis et al., 2005) as described in section 2. But is should be noted that replacing the objective in (2) with a convex relaxation leads to inconsistency — the optimum of the relaxation is ... |

398 | Large margin methods for structured and interdependent output variables
- Tsochantaridis, Joachims, et al.
- 2006
(Show Context)
Citation Context ...s, say 0-1 loss, and an arbitrary distribution ρ. In spite of the lack of approximation guarantees, it is common to replace the objective in (2) with a convex relaxation such as structural hinge loss =-=[8, 10]-=-. It should be noted that replacing the objective in (2) with structural hinge loss leads to inconsistency — the optimum of the relaxation is different from the optimum of (2). An alternative to a con... |

272 |
Speaker-independent Phone Recognition Using Hidden Markov Models
- Lee, Hon
- 1989
(Show Context)
Citation Context ...t et al., 2007). The classifier used a Gaussian kernel with σ2 = 19 and a trade-off parameter C = 5.0. The complete set of 61 TIMIT phoneme symbols were mapped into 39 phoneme symbols as proposed by (=-=Lee & Hon, 1989-=-), and was used throughout the training process. The seven dimensional weight vector w was trained on the second set of 1796 aligned utterances. We trained twice, once for τ-sensitive loss and once fo... |

126 | An end-to-end discriminative approach to machine translation
- Liang, Bouchard-Côté, et al.
- 2006
(Show Context)
Citation Context ...put space is discrete but the input space is continuous. After formulating our method we discovered that closely related methods have recently become popular for training machine translation systems (=-=Liang et al., 2006-=-; Chiang et al., 2009). Although machine translation has discrete inputs as well as discrete outputs, the training method we propose can still be used, although without theoretical guarantees. We also... |

85 | 11,001 new features for statistical machine translation
- Chiang, Knight, et al.
- 2009
(Show Context)
Citation Context ...e but the input space is continuous. After formulating our method we discovered that closely related methods have recently become popular for training machine translation systems (Liang et al., 2006; =-=Chiang et al., 2009-=-). Although machine translation has discrete inputs as well as discrete outputs, the training method we propose can still be used, although without theoretical guarantees. We also present empirical re... |

41 |
Automatic Segmentation and Labeling of Speech Based on Hidden Markov Models
- Brugnara, Falavigna, et al.
- 1993
(Show Context)
Citation Context ...en the predicted alignment sequence and the manual alignment sequence is greater than τ. This loss with different values of τ was used to measure the performance of the learned alignment function in (=-=Brugnara et al., 1993-=-; Toledano et al., 2003; Hosom, 1998). The second loss, called τ-insensitive loss was proposed in (Keshet et al., 2007) as is defined as follows. L τ-insensitive (¯y, ¯y ′ ) = 1 |¯y| max {|yk − y ′ k|... |

6 |
Speaker-independent phoneme alignment using transition-dependent states
- Hosom
- 2009
(Show Context)
Citation Context ...nual alignment sequence is greater than τ. This loss with different values of τ was used to measure the performance of the learned alignment function in (Brugnara et al., 1993; Toledano et al., 2003; =-=Hosom, 1998-=-). The second loss, called τ-insensitive loss was proposed in (Keshet et al., 2007) as is defined as follows. L τ-insensitive (¯y, ¯y ′ ) = 1 |¯y| max {|yk − y ′ k| − τ, 0} (17) This loss measures the... |

6 |
Automatic phoneme segmentation
- Toledano, Gomez, et al.
- 2003
(Show Context)
Citation Context ...een the predicted alignment sequence and the manual alignment sequence is greater than τ. This loss with different values of τ was used to measure the performance of the learned alignment function in =-=[1, 9, 4]-=-. The second loss, called τ-insensitive loss was proposed in [5] as is defined as follows. L τ-insensitive (¯y, ¯y ′ ) = 1 |¯y| max {|yk − y ′ k| − τ, 0} (17) This loss measures the average disagreeme... |

3 |
A large margin algorithm for speech and audio segmentation
- Keshet, Shalev-Shwartz, et al.
- 2007
(Show Context)
Citation Context ... of τ was used to measure the performance of the learned alignment function in (Brugnara et al., 1993; Toledano et al., 2003; Hosom, 1998). The second loss, called τ-insensitive loss was proposed in (=-=Keshet et al., 2007-=-) as is defined as follows. L τ-insensitive (¯y, ¯y ′ ) = 1 |¯y| max {|yk − y ′ k| − τ, 0} (17) This loss measures the average disagreement between all the boundaries of the desired alignment sequence... |

1 | Revised version of the paper that appeared at IWPT - Kluwer - 2001 |

1 | 2003. Loss Minimization for Structured Prediction Tsochantaridis - Speech, Proc - 2005 |