## A Unified Approach to Minimum Risk Training and Decoding

Citations: | 2 - 0 self |

### BibTeX

@MISC{Arun_aunified,

author = {Abhishek Arun and Barry Haddow},

title = {A Unified Approach to Minimum Risk Training and Decoding},

year = {}

}

### OpenURL

### Abstract

We present a unified approach to performing minimum risk training and minimum Bayes risk (MBR) decoding with BLEU in a phrase-based model. Key to our approach is the use of a Gibbs sampler that allows us to explore the entire probability distribution and maintain a strict probabilistic formulation across the pipeline. We also describe a new sampling algorithm called corpus sampling which allows us at training time to use BLEU instead of an approximation thereof. Our approach is theoretically sound and gives better (up to +0.6%BLEU) and more stable results than the standard MERT optimization algorithm. By comparing our approach to lattice MBR, we are also able to gain crucial insights about both methods. 1

### Citations

3781 |
Stochastic relaxation, Gibbs distributions, and the Bayesian restauration of images
- Geman, Geman
- 1984
(Show Context)
Citation Context ...tempt to solve the former. 2.1 Gibbs sampling for phrase-based MT An alternate approximate inference method for phrase-based MT without any of the previously mentioned drawbacks is the Gibbs sampler (=-=Geman and Geman, 1984-=-) of Arun et al. (2009) which draws samples from the posterior distribution of the translation model. For the work presented in this paper, we use this sampler. The sampler produces a sequence of samp... |

1488 | BLEU: A method for automatic evaluation of machine translation
- Papineni, Roukos, et al.
- 2002
(Show Context)
Citation Context ...s its risk (expected loss). This solution is often referred to as the Minimum Bayes Risk (MBR) solution (Kumar and Byrne, 2004). Since machine translation (MT) models are typically evaluated by BLEU (=-=Papineni et al., 2002-=-), a loss function which rewards partial matches, the MBR solution is to be preferred to the Maximum A Posteriori (MAP) solution. In most statistical MT (SMT) systems, MBR is implemented as a reranker... |

893 | Moses: Open source toolkit for statistical machine translation
- Koehn, Hoang, et al.
- 2007
(Show Context)
Citation Context ...table, we used the association-score technique suggested by Johnson et al. (2007). Translation quality is reported using case-insensitive BLEU. 5.2 Baseline Our baseline system is phrase-based Moses (=-=Koehn et al., 2007-=-) with feature weights trained using MERT. Moses and the Gibbs sampler use identical feature sets. 4 The MERT optimization algorithm uses multiple random restarts to avoid getting stuck in a poor loca... |

642 | Statistical phrasebased translation
- Koehn, Och, et al.
- 2003
(Show Context)
Citation Context ...der any probability estimates heavily biased (Blunsom and Osborne, 2008; Bouchard-Côté et al., 2009). Here, we present a unified approach to training and decoding in a phrase-based translation model (=-=Koehn et al., 2003-=-) which keeps the objective constant across the translation pipeline and so obviates the need for any extra hyper-parameter fitting. We use the phrase-based Gibbs sampler of Arun et al. (2009) at trai... |

380 | Hierarchical phrase-based translation - Chiang - 2007 |

250 | Deterministic annealing for clustering, compression, classification, regression, and related optimization problems
- Rose
- 1998
(Show Context)
Citation Context ...formance on held-out data to quickly increase to a maximum and then plateau. Hypothesizing that we were being trapped in local maxima as G is non-convex, we decided to employ deterministic annealing (=-=Rose, 1998-=-) to smooth the objective function to ensure that the optimizer explored as large a region as possible of the space before it settled on an optimal weight set. Our instantiation of deterministic annea... |

118 | 2004, ‘Minimum Bayes-Risk Decoding for Statistical Machine Translation
- Kumar, Byrne
(Show Context)
Citation Context ...ision theory, the optimal decision rule for any statistical model is the solution that minimizes its risk (expected loss). This solution is often referred to as the Minimum Bayes Risk (MBR) solution (=-=Kumar and Byrne, 2004-=-). Since machine translation (MT) models are typically evaluated by BLEU (Papineni et al., 2002), a loss function which rewards partial matches, the MBR solution is to be preferred to the Maximum A Po... |

59 | Local Gain Adaptation in Stochastic Gradient Descent
- Schraudolph
- 1999
(Show Context)
Citation Context ...Because of the noise introduced by the sampler, we used stochastic gradient descent (SGD), with a learning rate that gets updated after each step proportionally to difference in successive gradients (=-=Schraudolph, 1999-=-). While our initial formulation of minimum risk training is similar to that of Arun et al. (2009), in preliminary experiments we observed a tendency for translation performance on held-out data to qu... |

57 | 2007a, ‘Improving Translation Quality by Discarding Most of the Phrasetable - Johnson, Martin, et al. |

55 |
2006, ‘Minimum Risk Annealing for Training Log-Linear Models
- Smith, Eisner
(Show Context)
Citation Context ...ribution, whereas a lower value of T pushes the optimizer towards a more peaked distribution. We perform 10 to 20 iterations of SGD at each temperature. In their deterministic annealing formulation, (=-=Smith and Eisner, 2006-=-; Li and Eisner, 2009), express the parameterization of the distribution θ as γ ˆθ (where γ is the scaling factor) and perform optimization in two steps, the first optimizing ˆθ and the second optimiz... |

47 | 2008, ‘A Discriminative Latent Variable Model for Statistical Machine Translation
- Blunsom, Cohn, et al.
(Show Context)
Citation Context ...t high temperature settings. In the presence of diversity, the benefits of marginalization over derivations is clear: MaxTrans does better than MaxDeriv and MBR does best, confirm recent findings of (=-=Blunsom et al., 2008-=-; Arun et al., 2009) that MaxTrans improves over MaxDeriv decoding for models trained to account for multiple derivations. As the temperature decreases to zero, the model sharpens, effectively intent ... |

32 | second-order expectation semirings with applications to minimum-risk training on translation forests - First- |

21 | Efficient minimum error rate training and minimum Bayes-risk decoding for translation hypergraphs and lattices
- Kumar, Macherey, et al.
- 2009
(Show Context)
Citation Context ...sterior probability distribution since the approximation to Z is an unbounded term and the scaling factor is an artificial way of inducing a probability distribution. Recently, (Tromble et al., 2008; =-=Kumar et al., 2009-=-) have shown that using a search lattice to improve the estimation of the true probability distribution can lead to improved MBR performance. However, these approaches still rely on MERT for training ... |

21 | 2009, ‘Variational Decoding for Statistical Machine Translation - Li, Eisner, et al. |

19 |
Probabilistic inference for machine translation
- Blunsom, Osborne
- 2008
(Show Context)
Citation Context ...hese techniques tractable is quite drastic, and is in addition to the pruning already performed during the search. Such extensive pruning is liable to render any probability estimates heavily biased (=-=Blunsom and Osborne, 2008-=-; Bouchard-Côté et al., 2009). Here, we present a unified approach to training and decoding in a phrase-based translation model (Koehn et al., 2003) which keeps the objective constant across the trans... |

18 | Fast consensus decoding over translation forests
- DeNero, Chiang, et al.
- 2009
(Show Context)
Citation Context ... in spirit to expected BLEU training, but aimed to maximize the expected counts of n-grams appearing in reference translations. This training criterion is used in conjunction with consensus decoding (=-=DeNero et al., 2009-=-), a linear-time approximation of MBR. In contrast to the approaches above, the algorithms presented in this paper are able to explore an unpruned search space. By using corpus sampling, we can perfor... |

9 | DeNeefe, Yee Seng Chan, and Hwee Tou Ng. 2008. Decomposability of translation metrics for improved evaluation and efficient algorithms - Chiang, Steve |

8 | Consensus training for consensus decoding in machine translation - Pauls, Denero, et al. - 2009 |

7 | 2009, ‘Monte Carlo inference and maximization for phrase-based translation
- Arun, Dyer, et al.
(Show Context)
Citation Context ...sticsand at test time we use it to estimate the posterior distribution required by MBR (Section 3). We experimented with two different objective functions for training (Section 4). First, following (=-=Arun et al., 2009-=-), we define our objective at the sentence-level using a sentence-level variant of BLEU. Then, in order to reduce the mismatch between training and test loss functions, we also tried directly optimisi... |

3 |
Online large-margin training of syntactic and structural translation features
- 2008b
(Show Context)
Citation Context ...s follows: 1 Ep(a,e|f)[h] = lim N→∞ N 3 Decoding N∑ h(ai, ei, f) (3) i=1 In this work, we are interested in performing MBR decoding with BLEU. We define the MBR decision rule following Tromble et al. =-=(2008)-=-: e ∗ = arg max e∈ɛH ∑ e ′ ∈ɛE BLEUe(e ′ )p(e ′ |f) (4) where ɛH refers to the hypothesis space from which translations are chosen, ɛE refers to the evidence space used for calculating risk and BLEUe(... |

1 | pruning: Efficiently calculating expectations in large dynamic programs - Randomized |