Automatically Improved Category Labels for Syntax-Based Statistical Machine Translation (2011)
BibTeX
@MISC{Hanneman11automaticallyimproved,
author = {Greg Hanneman},
title = {Automatically Improved Category Labels for Syntax-Based Statistical Machine Translation},
year = {2011}
}
OpenURL
Abstract
A common modeling choice in syntax-based statistical machine translation is the use of synchronous context-free grammars, or SCFGs. When training a translation model in a supervised setting, an SCFG is extracted from parallel text that has been statistically word-aligned and parsed by monolingual statistical parsers. However, the set of syntactic category labels used in a monolingual statistical parser is decided upon quite independently of the machine translation task, and there is no guarantee that it is optimal for a bilingual SCFG or for machine translation at all. In this thesis, we first demonstrate that the set of category labels used in a machine translation system’s grammar strongly affects three inter-related characteristics of the system: spurious ambiguity, rule sparsity, and reordering precision. We propose using these characteristics as the basis for evaluating the properties of an SCFG both outside of and within an actual translation task. Finally, as our main work, we propose three automatic relabeling methods that will create a better set of category labels for a given language pair







