## Lattice Rescoring Methods for Statistical Machine Translation

### Cached

### Download Links

### BibTeX

@MISC{Blackwood_latticerescoring,

author = {Graeme Blackwood},

title = {Lattice Rescoring Methods for Statistical Machine Translation},

year = {}

}

### OpenURL

### Abstract

This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings (Blackwood et al., 2008a; Blackwood et al., 2008b; Blackwood et al., 2009; Kurimo et al. (2009)) and a journal article (de Gispert et al., 2010), or accepted for publication in forthcoming conference proceedings (Blackwood and Byrne, 2010). The length of this thesis including appendices, references, footnotes, tables and equations is approximately 53,000 words and it contains 56 figures and 58 tables. i Summary Modern statistical machine translation (SMT) systems include multiple interrelated components, statistical models, and processes. Translation is often factored as a cascaded series of modules such that the output of one module serves as the input to the next; this is the SMT pipeline. Simplifying assumptions, limited training data, and pruning during search mean that the maximum likelihood hypothesis may not represent the best translation. Since any errors will be propagated through the SMT pipeline, it is better to avoid hard decisions by

### Citations

8905 | Maximum Likelihood from Incomplete Data via the EM Algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...Even though the word alignment between source and target sentences in the parallel data is not explicit, the alignment probabilities can still be learned using the expectation-maximisation algorithm (=-=Dempster et al., 1977-=-). 4.1.3 Phrase-Based Statistical Machine Translation In the word-based generative model of statistical machine translation (Brown et al., 1993), words are inserted, deleted, translated and reordered ... |

1622 | Bleu: a method for automatic evaluation of machine translation,” IBM Research Report RC22176 - Papineni, Roukos - 2001 |

1376 | A systematic comparison of various statistical alignment models,” Computational Linguistics
- Och, Ney
- 2003
(Show Context)
Citation Context ...English Europarl translation. The difference in the number of English words for the two tracks is a result of limitations in the word alignment algorithm. Word alignments were generated using GIZA++ (=-=Och and Ney, 2003-=-) over a stemmed version of the parallel text. Stems for each source language were obtained using the Snowball stemmer. 1 After unioning the Viterbi alignments and replacing stems with their original ... |

1273 | The mathematics of statistical machine translation: Parameter estimation
- Brown, Pietra, et al.
- 1993
(Show Context)
Citation Context ...learned using the expectation-maximisation algorithm (Dempster et al., 1977). 4.1.3 Phrase-Based Statistical Machine Translation In the word-based generative model of statistical machine translation (=-=Brown et al., 1993-=-), words are inserted, deleted, translated and reordered according to distributions learned from the alignments. Phrase-based statistical machine translation (Koehn et al., 2003), developed from the A... |

1146 | A maximum entropy approach to natural language processing
- Pietra, Pietra
- 1996
(Show Context)
Citation Context ...ion rules may simplify translation decoding. Instead of inverting and decomposing via Bayes’ rule, the posterior probability of a candidate translation can be modelled directly using maximum entropy (=-=Berger et al., 1996-=-; Papineni et al., 1998). The maximum entropy model is defined by a set of M feature functions hm(e,f) and associated feature weights λm for m = 1,... ,M. The direct translation probability is given b... |

1011 | Head-Driven Statistical Models for Natural Language Parsing
- Collins
- 1999
(Show Context)
Citation Context ...), and long-span language models estimated over massive monolingual text collections such as the Google Books 1 project. Hypothesis space constraints derived from statistical parsing (Charniak, 1997; =-=Collins, 1999-=-) of partial hypotheses may also lead to higher levels of SMT quality and fluency. An alternative method for improving SMT fluency is to augment or replace the MBR hypothesis space with new hypotheses... |

990 | Moses: Open source toolkit for statistical machine translation
- Koehn, Hoang, et al.
- 2007
(Show Context)
Citation Context ...In practice, approximations are used that render the search tractable at the expense of search errors. Most phrase-based statistical machine translation decoders such as Pharaoh (Koehn, 2004), Moses (=-=Koehn et al., 2007-=-), and the decoder of Och and Ney (2004) generate translation hypotheses from left-to-right in target language word order. The search space has the form of a directed acyclic graph where states encode... |

927 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1998
(Show Context)
Citation Context ...a more powerful language model (LM) than is normally possible in first-pass translation. This thesis shows how WFSTs can be used for efficient SMT lattice rescoring with sentence-specific n-gram LMs (=-=Chen and Goodman, 1998-=-; Huang et al., 2001) estimated over multi-billion word training corpora; significant improvements in BLEU score are observed with respect to the baseline system (Blackwood et al., 2009). Phrasal segm... |

853 | SRILM - An Extensible Language Model Toolkit,” http://www.speech.sri.com/projects/srilm/ P. Koehn, ”PHARAOH: A beam search decoder for phrase-based statistical machine translation models,” http://www.isi.edu/publications/licensed-sw/pharaoh
- Stolcke
- 2003
(Show Context)
Citation Context ...anguage models estimated over large corpora result in a huge number of model parameters; it is often impossible to store all of these probabilities in memory during decoding. Count frequency cutoffs (=-=Stolcke, 2002-=-), probability quantisation, entropy-based pruning (Stolcke, 1998), and Bloom filters (Talbot and Osborne, 2007) can be used to reduce the memory requirements of a language model, but some of these te... |

786 | Statistical methods for speech recognition - Jelinek - 1997 |

781 | Introduction to algorithms. Second edition
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...the top translation hypotheses can be generated by recording at each state back-pointers to previous partial hypotheses (Koehn, 2010). Exact decoding can be implemented using an A ∗ search heuristic (=-=Cormen et al., 2001-=-). In practice, however, this is prohibitively slow so a beam search is used instead. In order to apply pruning fairly, the states in the search space are organised into stacks or priority queues so t... |

712 | Statistical phrasebased translation
- Koehn, Och, et al.
- 2003
(Show Context)
Citation Context ...l machine translation (Brown et al., 1993), words are inserted, deleted, translated and reordered according to distributions learned from the alignments. Phrase-based statistical machine translation (=-=Koehn et al., 2003-=-), developed from the Alignment Template approach of Och and Ney (2004), uses phrases instead of single words as the fundamental unit of translation.CHAPTER 4. STATISTICAL MACHINE TRANSLATION 24 In p... |

700 | Estimation of Probabilities from Sparse Data for Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...nly for n-grams with counts lower than the cutoff threshold. nrCHAPTER 3. STATISTICAL LANGUAGE MODELLING 15 3.2.2.1 Backoff Models Most n-gram language models used in ASR and SMT are backoff models (=-=Katz, 1987-=-). The general form of a backoff n-gram language model defines the conditional probability PBO(wi|w i−1 i−n+1 ) of word wi given history w i−1 i−n+1 recursively as PBO(wi|w i−1 { α(wi|w i−n+1 ) = i−1 ... |

642 |
Text compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ...iscounting or linear discounting subtract a small fixed constant or scale the observed counts so that the discounted probability mass can be reassigned to unobserved n-grams. Witten-Bell discounting (=-=Bell et al., 1990-=-) computes discount coefficients that are proportional to the number of distinct words that follow the n-gram history. Good-Turing discounting (Good, 1953) adjusts the observed frequencies such that a... |

636 | A Statistical Approach to Machine Translation
- Brown, Cocke, et al.
- 1990
(Show Context)
Citation Context ...of Statistical Machine Translation The first influential framework for statistical machine translation described the process of translating between two languages in terms of the source-channel model (=-=Brown et al., 1990-=-, 1993). Foreign sentences are considered to be English sentences that have passed through a noisy communication channel corrupting their surface form. The task of translation is to recover the hidden... |

511 | Minimum error rate training for statistical machine translation
- Och
- 2003
(Show Context)
Citation Context ...90). Significant advances include the move to translation models based on phrases (Och, 2002; Koehn, 2010), the incorporation of discriminative training and parameter optimisation (Och and Ney, 2002; =-=Och, 2003-=-), and the introduction of synchronous context-free grammars capable of supporting sophisticated reordering and movement of phrases (Chiang, 2005, 2007). Depending on the genre and nature of the trans... |

417 | Discriminative Training and Maximum Entropy Models for Statistical Machine Translation
- Och, Ney
- 2002
(Show Context)
Citation Context ...of Brown et al. (1990). Significant advances include the move to translation models based on phrases (Och, 2002; Koehn, 2010), the incorporation of discriminative training and parameter optimisation (=-=Och and Ney, 2002-=-; Och, 2003), and the introduction of synchronous context-free grammars capable of supporting sophisticated reordering and movement of phrases (Chiang, 2005, 2007). Depending on the genre and nature o... |

415 | A maximum likelihood approach to continuous speech recognition
- Bahl, Jelinek, et al.
(Show Context)
Citation Context ...eighted sum of probabilities computed from n-grams of different orders. The weights of the interpolated model can be optimised on a corpus of representative held-out data using deleted interpolation (=-=Bahl et al., 1990-=-). The interpolation weights can also be conditioned on the context. The optimised weights then indicate the reliability of the distribution at each order, given the history. 3.2.3 Language Model Smoo... |

413 | Hierarchical phrase-based translation
- Chiang
- 2007
(Show Context)
Citation Context ...) by restricting the number of source words covered by each rule application or constraining the number and positions of production non-terminals. This also reduces the problem of spurious ambiguity (=-=Chiang, 2007-=-). Maximum likelihood estimates of rule probabilities can be computed from the counts using relative frequency. Hierarchical rules capture local context and reordering in a similar way to phrases in p... |

403 | A study of translation edit rate with targeted human annotation
- Snover, Dorr, et al.
- 2006
(Show Context)
Citation Context ...(Lavie and Denkowski, 2009). 1 http://snowball.tartarus.org 2 http://wordnet.princeton.eduCHAPTER 4. STATISTICAL MACHINE TRANSLATION 38 4.4.4 TER - Translation Edit Rate Translation edit rate (TER) (=-=Snover et al., 2006-=-) is the minimum number of edits required to modify a translation hypothesis such that it exactly matches one of the references. If there are multiple references, TER is the smallest number of edits t... |

399 | A Hierarchical Phrase-Based Model for Statistical Machine Translation
- Chiang
- 2005
(Show Context)
Citation Context ...e training and parameter optimisation (Och and Ney, 2002; Och, 2003), and the introduction of synchronous context-free grammars capable of supporting sophisticated reordering and movement of phrases (=-=Chiang, 2005-=-, 2007). Depending on the genre and nature of the translation task, however, both fluency and adequacy are still often lacking in translations produced using SMT. There is certainly a great deal of ro... |

399 |
The population frequencies of species and the estimation of population parameters. Biometrica 40:237–264
- Good
- 1953
(Show Context)
Citation Context ...rved n-grams. Witten-Bell discounting (Bell et al., 1990) computes discount coefficients that are proportional to the number of distinct words that follow the n-gram history. Good-Turing discounting (=-=Good, 1953-=-) adjusts the observed frequencies such that an ngram that occurs r times in the training data is treated as if it had occurred r ∗ times. The modified counts are computed from the observed counts as ... |

386 | The alignment template approach to statistical machine translation - Och, Ney - 2004 |

381 |
The estimation of stochastic context-free grammars using the Inside-Outside algorithm’, Computer Speech and Language
- Lari, Young
- 1990
(Show Context)
Citation Context ...stic context-free grammarsCHAPTER 4. STATISTICAL MACHINE TRANSLATION 32 assign a probability to each rule; these probabilities can be estimated from training data using the inside-outside algorithm (=-=Lari and Young, 1990-=-). The probability of a complete parse is the product of the probabilities of rules used in the derivation. Multiple derivations may yield the same string; the probability of a string is the sum of th... |

356 | A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER
- Fiscus
- 1997
(Show Context)
Citation Context ...k decoding has been successfully applied to combine multiple system outputs in automatic speech recognition using recogniser output voting error reduction (ROVER) based on simple confidence measures (=-=Fiscus, 1997-=-). More recently, consensus decoding techniques have been demonstrated to improve the quality of machine translation (Matusov et al., 2006; Rosti et al., 2007a,b; Sim et al., 2007). The importance of ... |

345 |
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
- Doddington
- 2002
(Show Context)
Citation Context ...the same order when computing precisions, even though some words and phrases are obviously more important than others. This is particularly true when considering translation adequacy. The NIST score (=-=Doddington, 2002-=-) uses information weights to distinguish between informative and uninformative n-gram matches. These weights are computed for each n-gram u = w1...wn using counts in the reference translations: c(w1.... |

317 | Finite-state transducers in language and speech processing
- Mohri
- 1997
(Show Context)
Citation Context ...(Kumar et al., 2006) is a generative model of translation that applies a series of transformations specified by conditional probability distributions and encoded as weighted finite-state transducers (=-=Mohri, 1997-=-; Mohri et al., 2008). The TTM is based on the generative source-channel model of SMT (Brown et al., 1990) so in the following discussion the ‘target language sentence’ refers to the input sentence in... |

313 | Europarl: A Parallel Corpus for Statistical Machine Translation
- Koehn
- 2005
(Show Context)
Citation Context ... quality of statistical machine translation. Although parallel text collections such as the proceedings of the United Nations (Graff, 1994), Canadian Hansard (Germann, 2001), and European Parliament (=-=Koehn, 2005-=-) are of paramount importance in training the parameters of statistical translation models, parallel data is expensive to produce and therefore usually only available in limited quantities. Large Arab... |

305 |
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
- Huang, Acero, et al.
- 2001
(Show Context)
Citation Context ... model (LM) than is normally possible in first-pass translation. This thesis shows how WFSTs can be used for efficient SMT lattice rescoring with sentence-specific n-gram LMs (Chen and Goodman, 1998; =-=Huang et al., 2001-=-) estimated over multi-billion word training corpora; significant improvements in BLEU score are observed with respect to the baseline system (Blackwood et al., 2009). Phrasal segmentation models base... |

302 |
Speech and Language Processing: An Introduction to
- Jurafsky, Martin
- 2008
(Show Context)
Citation Context ...-free grammars, also known as phrase-structure grammars, define a formal language of strings and associated hierarchical structure in terms of non-overlapping constituents (Manning and Schutze, 1999; =-=Jurafsky and Martin, 2008-=-). The rules or productions of the grammar define a relation that specifies how the single (i.e. context-free) non-terminal left-hand side of a rule can be rewritten as a mixed string of non-terminals... |

299 |
Improved backing-off for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ...sum over wi : r > 0 since, for higher-order n, it is much more efficient to sum over the observed n-grams than the more numerous unobserved n-grams. 3.2.3.3 Kneser-Ney Smoothing Kneser-Ney smoothing (=-=Kneser and Ney, 1995-=-) is one of the most popular smoothing methods in modern ASR and SMT systems. A modified version that includes interpolation with lowerorder distributions has been demonstrated to provide improved LM ... |

157 | METEOR: An automatic metric for MT evaluation with improved correlation with human judgments
- Banerjee, Lavie
- 2005
(Show Context)
Citation Context ...an BLEU and to correlate better with human assessments of adequacy for a range of different languages and test sets. 4.4.3 METEOR METEOR (Metric for Evaluation of Translation with Explicit Ordering) (=-=Banerjee and Lavie, 2005-=-) compensates for weaknesses in BLEU by combining both precision and recall computed with respect to an explicit one-to-one word alignment of the system output with each available reference. One of it... |

138 | Minimum Bayes-Risk Decoding for Statistical Machine Translation
- Kumar, Byrne
- 2004
(Show Context)
Citation Context ...ith Weighted Finite-State Transducers Minimum Bayes-risk (MBR) decoding has been found useful in many areas of natural language processing (Duda et al., 2000; Goel and Byrne, 2000; Goel et al., 2004; =-=Kumar and Byrne, 2004-=-). This chapter describes the use of MBR decoding to improve the quality of large-scale statistical machine translation systems. The general form of the MBR decoder is first defined and described. A l... |

136 |
Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models
- Koehn
- 2004
(Show Context)
Citation Context ...nnot be implemented. In practice, approximations are used that render the search tractable at the expense of search errors. Most phrase-based statistical machine translation decoders such as Pharaoh (=-=Koehn, 2004-=-), Moses (Koehn et al., 2007), and the decoder of Och and Ney (2004) generate translation hypotheses from left-to-right in target language word order. The search space has the form of a directed acycl... |

133 | A smorgasbord of features for statistical machine translation
- Och, Gildea, et al.
- 2004
(Show Context)
Citation Context ...ings amongst the vast number of hypotheses encoded in SMT lattices. Oracle BLEU scores computed over k-best lists have shown that many high quality hypotheses are produced by first-pass SMT decoding (=-=Och et al., 2004-=-). The difficulty of enhancing the fluency of complete hypotheses is reduced by first identifying regions of high-confidence in the ML translation and using these to guide the fluency refinement proce... |

127 | Entropy-based pruning of backoff language models
- Stolcke
- 1998
(Show Context)
Citation Context ...er of model parameters; it is often impossible to store all of these probabilities in memory during decoding. Count frequency cutoffs (Stolcke, 2002), probability quantisation, entropy-based pruning (=-=Stolcke, 1998-=-), and Bloom filters (Talbot and Osborne, 2007) can be used to reduce the memory requirements of a language model, but these techniques discard potentially useful information that may degrade the qual... |

124 | Large language models in machine translation
- Brants, Popat, et al.
- 2007
(Show Context)
Citation Context ...ntities of language model training data; the largest experiments reported in the literature use a 5-gramCHAPTER 1. INTRODUCTION 4 LM estimated over approximately 1.8 trillion tokens of English text (=-=Brants et al., 2007-=-). SMT research usually views increasing monolingual data as simply facilitating higher order n-gram language models and better parameter estimation. However, there are other complementary ways in whi... |

122 |
toolkit for statistical machine translation
- Koehn, Hoang, et al.
- 2007
(Show Context)
Citation Context ..., and the presence or absence of particular cultural conventions all combine to make high quality automatic machine translation extremely challenging. The statistical approach to machine translation (=-=Koehn, 2010-=-), driven largely by the increased availability of parallel training corpora and widespread acceptance of automatic quality metrics such as BLEU (Papineni et al., 2002b), addresses many of these issue... |

116 |
The Hitchhiker’s Guide to the Galaxy
- Adams
- 1979
(Show Context)
Citation Context ...into a different language (the target). One possible future application is a multilingual, real-time speech-to-speech translation device such as the Babelfish in The Hitchhiker’s Guide to the Galaxy (=-=Adams, 1979-=-). Such a sophisticated translation device, however, is still very much a long-term goal; machine translation, particularly speech translation, is a highly complex task with many unresolved difficulti... |

99 | Statistical techniques for natural language parsing
- Charniak
- 1997
(Show Context)
Citation Context ... et al., 2007a,b), and long-span language models estimated over massive monolingual text collections such as the Google Books 1 project. Hypothesis space constraints derived from statistical parsing (=-=Charniak, 1997-=-; Collins, 1999) of partial hypotheses may also lead to higher levels of SMT quality and fluency. An alternative method for improving SMT fluency is to augment or replace the MBR hypothesis space with... |

81 | Re-evaluating the Role of Bleu in Machine Translation Research - Callison-Burch, Osborne, et al. - 2006 |

80 | Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop - Habash, Rambow - 2005 |

76 |
Syntax-directed transduction
- Lewis, Stearns
- 1968
(Show Context)
Citation Context ... 4.3.2 Synchronous Context-Free Grammars Hierarchical phrase-based machine translation is based on the theory of synchronous contextfree grammars, also known as syntax-directed transduction grammars (=-=Lewis and Stearns, 1968-=-). Rules in a synchronous context-free grammar consist of a non-terminal left-hand side and two sequences of terminals and non-terminals on the right-hand side, one in the source language and one in t... |

76 | Semiring frameworks and algorithms for shortest-distance problems
- Mohri
(Show Context)
Citation Context ...input label. These operations can significantly reduce the number of states and arcs. The shortest-path algorithm in the tropical semiring can be used to find the k lowest cost paths in a transducer (=-=Mohri, 2002-=-). This allows the best string(s) to be efficiently extracted and provides a generic implementation of the argmax and argmin operations in probabilistic models. The prune operation discards paths base... |

74 | OpenFst: a general and efficient weighted finite-state transducer library
- Allauzen, Riley, et al.
- 2007
(Show Context)
Citation Context ... by the intersection of two acceptors is (A1 ∩ A2)(a) = A1(a) ⊗ A2(a). Other operations that manipulate the language of strings represented by a WFST include closure, reverse, invert, and difference (=-=Allauzen et al., 2007-=-). 2.4.1 Optimisation and Search Procedures General purpose operations and algorithms are available for optimising WFSTs with respect to time and memory. Optimisation does not affect the language of a... |

66 |
2005. An overview of probabilistic tree transducers for natural language processing
- Knight, Graehl
(Show Context)
Citation Context ...be translated. It is for these reasons that phrase-based methods are such an effective paradigm for SMT research, although in some tasks, such as Chinese→English translation, syntax-based approaches (=-=Knight and Graehl, 2005-=-) now tend to dominate. Phrase-based statistical machine translation starts with the segmentation of foreign sentence f into a sequence of I phrases: ¯ f1,... , ¯ fI. The segmentation process is not u... |

62 | A generalized CYK algorithm for parsing stochastic CFG
- CHAPPELIER, RAJMAN
- 1998
(Show Context)
Citation Context ...transducers (Mohri, 1997; Mohri et al., 2008). Translation in HiFST is performed in two stages. In the first stage, the source language sentence is parsed according to a variant of the CYK algorithm (=-=Chappelier and Rajman, 1998-=-). In the second stage, the parse tree drives the generation of a target language word lattice containing all possible translations and derivations of the source sentence. The following description of... |

61 | Statistical language model adaptation: review and perspectives
- Bellegarda
- 2004
(Show Context)
Citation Context ...a for estimating the parameters of the phrasal segmentation model. When in-domain data is of limited availability, count mixing (Bacchiani et al., 2004) or other language model adaptation strategies (=-=Bellegarda, 2004-=-) may lead to improved performance. 6.3.2.2 Phrase Penalty Tuning The role of the phrase penalty ϕ is to encourage longer phrases in translation. Table 6.6 shows the effect of tuning this parameter. T... |

61 |
Statistical Machine Translation: From Single-Word Models to Alignment Templates
- Och
- 2003
(Show Context)
Citation Context ...rogress has been made in statistical machine translation since the original wordbased formulation of Brown et al. (1990). Significant advances include the move to translation models based on phrases (=-=Och, 2002-=-; Koehn, 2010), the incorporation of discriminative training and parameter optimisation (Och and Ney, 2002; Och, 2003), and the introduction of synchronous context-free grammars capable of supporting ... |

54 | Computing Consensus Translation from Multiple Machine Translation Systems Using Enhanced Hypotheses Alignment
- Matusov, Ueffing, et al.
- 2006
(Show Context)
Citation Context ...t voting error reduction (ROVER) based on simple confidence measures (Fiscus, 1997). More recently, consensus decoding techniques have been demonstrated to improve the quality of machine translation (=-=Matusov et al., 2006-=-; Rosti et al., 2007a,b; Sim et al., 2007). The importance of ensuring sufficient diversity amongst individual system outputs is shown to have a significant impact on consensus decoding performance in... |