## Variational Decoding for Statistical Machine Translation

### Cached

### Download Links

Citations: | 23 - 1 self |

### BibTeX

@MISC{Li_variationaldecoding,

author = {Zhifei Li and Jason Eisner and Sanjeev Khudanpur},

title = {Variational Decoding for Statistical Machine Translation},

year = {}

}

### OpenURL

### Abstract

Statistical models in machine translation exhibit spurious ambiguity. That is, the probability of an output string is split among many distinct derivations (e.g., trees or segmentations). In principle, the goodness of a string is measured by the total probability of its many derivations. However, finding the best string (e.g., during decoding) is then computationally intractable. Therefore, most systems use a simple Viterbi approximation that measures the goodness of a string using only its most probable derivation. Instead, we develop a variational approximation, which considers all the derivations but still allows tractable decoding. Our particular variational distributions are parameterized as n-gram models. We also analytically show that interpolating these n-gram models for different n is similar to minimumrisk decoding for BLEU (Tromble et al., 2008). Experiments show that our approach improves the state of the art. 1

### Citations

1692 | BLEU: a method for automatic evaluation of machine translation
- Papineni, Roukos, et al.
- 2001
(Show Context)
Citation Context ...ed as favoring n-grams that are likely to appear in the reference translation (because they are likely in the derivation forest). However, in order to score well on the BLEU metric for MT evaluation (=-=Papineni et al., 2001-=-), which gives partial credit, we would also like to favor lower-order ngrams that are likely to appear in the reference, even if this means picking some less-likely highorder n-grams. For this reason... |

949 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1998
(Show Context)
Citation Context ... the NIST MT evaluation using a sampling method based on the ngram matches between training and test sets in the foreign side. We also used a 5-gram language model with modified Kneser-Ney smoothing (=-=Chen and Goodman, 1998-=-), trained on a data set consisting of a 130M words in English Gigaword (LDC2007T07) and the English side of the parallel corpora. We use GIZA++ (Och and Ney, 2000), a suffix-array (Lopez, 2007), SRIL... |

872 | An introduction to variational methods for graphical models
- Jordan, Ghahramani, et al.
- 1998
(Show Context)
Citation Context ..., 2006). Stochastic techniques such as Markov Chain Monte Carlo are exact in the limit of infinite runtime, but tend to be too slow for large problems. By contrast, deterministic variational methods (=-=Jordan et al., 1999-=-), including messagepassing (Minka, 2005), are inexact but scale up well. They approximate the original intractable distribution with one that factorizes better or has a specific parametric form (e.g.... |

528 | Minimum error rate training for statistical machine translation
- Och
- 2003
(Show Context)
Citation Context ...y|, corresponding to a conventional word penalty feature. In the geometric interpolation above, the weight θn controls the relative veto power of the n-gram approximation and can be tuned using MERT (=-=Och, 2003-=-) or a minimum risk procedure (Smith and Eisner, 2006). Lastly, note that Viterbi and variational approximation are different ways to approximate the exact probability p(y | x), and each of them has p... |

518 | Improved statistical alignment models
- Och, Ney
- 2000
(Show Context)
Citation Context ... modified Kneser-Ney smoothing (Chen and Goodman, 1998), trained on a data set consisting of a 130M words in English Gigaword (LDC2007T07) and the English side of the parallel corpora. We use GIZA++ (=-=Och and Ney, 2000-=-), a suffix-array (Lopez, 2007), SRILM (Stolcke, 2002), and risk-based deterministic annealing (Smith and Eisner, 2006) 17 to obtain word alignments, translation models, language models, and the optim... |

489 |
Pattern recognition and machine learning
- Bishop
(Show Context)
Citation Context ..., which considers all the derivations but still allows tractable decoding. 3.1 Approximate Inference There are several popular approaches to approximate inference when exact inference is intractable (=-=Bishop, 2006-=-). Stochastic techniques such as Markov Chain Monte Carlo are exact in the limit of infinite runtime, but tend to be too slow for large problems. By contrast, deterministic variational methods (Jordan... |

432 | Hierarchical phrasebased translation
- Chiang
- 2007
(Show Context)
Citation Context ...l latent variables— so-called nuisance variables—that are not of interest to the user. 1 For example, though machine translation (MT) seeks to output a string, typical MT systems (Koehn et al., 2003; =-=Chiang, 2007-=-) 1 These nuisance variables may be annotated in training data, but it is more common for them to be latent even there, i.e., there is no supervision as to their “correct” values. will also recover a ... |

142 | Minimum bayes-risk decoding for statistical machine translation
- Kumar, Byrne
- 2004
(Show Context)
Citation Context ... particular variational distributions are parameterized as n-gram models. We also analytically show that interpolating these n-gram models for different n is similar to minimumrisk decoding for BLEU (=-=Tromble et al., 2008-=-). Experiments show that our approach improves the state of the art. 1 Introduction Ambiguity is a central issue in natural language processing. Many systems try to resolve ambiguities in the input, f... |

105 | Computational complexity of probabilistic disambiguation by means of tree grammars - Sima'an - 1996 |

80 | An open source toolkit for parsing-based machine translation - Joshua |

61 | Efficient algorithms for parsing the dop model
- Goodman
- 1996
(Show Context)
Citation Context ... parsing (DOP), applications of Hidden Markov Models (HMMs) and mixture models, and other models with latent variables. Indeed, our methods were inspired by past work on variational decoding for DOP (=-=Goodman, 1996-=-) and for latent-variable parsing (Matsuzaki et al., 2005). 2 Background 2.1 Terminology In MT, spurious ambiguity occurs both in regular phrase-based systems (e.g., Koehn et al. (2003)), where differ... |

59 | Forest rescoring: Faster decoding with integrated language models
- Huang, Chiang
- 2007
(Show Context)
Citation Context ...eeds to be carried out for each member of T(x), the decoding problem of (2) turns out to be NP-hard, 4 as shown by Sima’an (1996) for a similar problem. 3 A hypergraph is analogous to a parse forest (=-=Huang and Chiang, 2007-=-). (A finite-state lattice is a special case.) It can be used to encode exponentially many hypotheses generated by a phrase-based MT system (e.g., Koehn et al. (2003)) or a syntax-based MT system (e.g... |

57 |
Minimum risk annealing for training log-linear models
- Smith, Eisner
- 2006
(Show Context)
Citation Context ... penalty feature. In the geometric interpolation above, the weight θn controls the relative veto power of the n-gram approximation and can be tuned using MERT (Och, 2003) or a minimum risk procedure (=-=Smith and Eisner, 2006-=-). Lastly, note that Viterbi and variational approximation are different ways to approximate the exact probability p(y | x), and each of them has pros and cons. Specifically, Viterbi approximation use... |

54 | A discriminative latent variable model for statistical machine translation
- Blunsom, Cohn, et al.
- 2008
(Show Context)
Citation Context ...the marginalization for a particular y would be tractable; it is used at training time in certain training objective functions, e.g., maximizing the conditional likelihood of a reference translation (=-=Blunsom et al., 2008-=-).2.3 Viterbi Approximation To approximate the intractable decoding problem of (2), most MT systems (Koehn et al., 2003; Chiang, 2007) use a simple Viterbi approximation, y ∗ = argmax y∈T(x) = argmax... |

40 | Hierarchical phrase-based translation with suffix arrays
- Lopez
- 2007
(Show Context)
Citation Context ... and Goodman, 1998), trained on a data set consisting of a 130M words in English Gigaword (LDC2007T07) and the English side of the parallel corpora. We use GIZA++ (Och and Ney, 2000), a suffix-array (=-=Lopez, 2007-=-), SRILM (Stolcke, 2002), and risk-based deterministic annealing (Smith and Eisner, 2006) 17 to obtain word alignments, translation models, language models, and the optimal weights for combining these... |

28 | N-gram Posterior Probabilities for Statistical Machine Translation - Zens, Ney - 2006 |

22 | A better nbest list: Practical determinization of weighted finite tree automata - May, Knight - 2006 |

20 | Computational complexity of problems on probabilistic grammars and transducers
- Casacuberta, Higuera
- 2000
(Show Context)
Citation Context ...ss of a possible MT output string should be measured by summing up the probabilities of all its derivations. Unfortunately, finding the best string is then computationally intractable (Sima’an, 1996; =-=Casacuberta and Higuera, 2000-=-). 2 Therefore, most systems merely identify the single most probable derivation and report the corresponding string. This corresponds to a Viterbi approximation that measures the goodness of an outpu... |

19 | Fast consensus decoding over translation forests - DeNero, Chiang, et al. - 2009 |

6 | A general technique to train language models on language models - Nederhof - 2005 |