Results 1 -
7 of
7
Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation
"... We use existing tools to automatically build two parallel treebanks from existing parallel corpora. We then show that combining the data extracted from both the treebanks and the corpora into a single translation model can improve the translation quality in a baseline phrasebased statistical machine ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
We use existing tools to automatically build two parallel treebanks from existing parallel corpora. We then show that combining the data extracted from both the treebanks and the corpora into a single translation model can improve the translation quality in a baseline phrasebased statistical machine translation system.
Extraction of Syntactic Translation Models from Parallel Data using Syntax from Source and Target Languages
"... We propose a generic rule induction framework that is informed by syntax from both sides of a parsed parallel corpus, as sets of structural, boundary and labeling related constraints. Factoring syntax in this manner empowers our framework to work with independent annotations coming from multiple res ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We propose a generic rule induction framework that is informed by syntax from both sides of a parsed parallel corpus, as sets of structural, boundary and labeling related constraints. Factoring syntax in this manner empowers our framework to work with independent annotations coming from multiple resources and not necessarily a single syntactic structure. We then explore the issue of lexical coverage of translation models learned in different scenarios using syntax from one side vs. both sides. We specifically look at how the non-isomorphic nature of parse trees for the two languages affects coverage. We propose a novel technique for restructuring targetside parse trees, that generates alternate isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. We also show that combining rules extracted by restructuring syntactic trees on both sides produces significantly better translation models. The improved precision and coverage of our syntax tables particularly fill in for the lack of lexical coverage in Syntax based Machine Translation approaches. 1
Parallel Treebanks in Phrase-Based Statistical Machine Translation
"... Abstract. Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for the current state-of-the-art in machine translation (MT). Parallel treebanks are a relatively recent innovation and appear to be ideal candidate ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for the current state-of-the-art in machine translation (MT). Parallel treebanks are a relatively recent innovation and appear to be ideal candidates for MT training material. However, until recently there has been no other means to build them than by hand. In this paper, we describe how we make use of new tools to automatically build a large parallel treebank and extract a set of linguistically motivated phrase pairs from it. We show that adding these phrase pairs to the translation model of a baseline phrase-based statistical MT (PBSMT) system leads to significant improvements in translation quality. We describe further experiments on incorporating parallel treebank information into PBSMT, such as word alignments. We investigate the conditions under which the incorporation of parallel treebank data performs optimally. Finally, we discuss the potential of parallel treebanks in other paradigms of MT. 1
Exploiting Alignment Techniques in MATREX: the DCU Machine Translation System for IWSLT 2008
"... In this paper, we give a description of the machine translation (MT) system developed at DCU that was used for our third participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2008). In this participation, we focus on various techniques for word ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this paper, we give a description of the machine translation (MT) system developed at DCU that was used for our third participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2008). In this participation, we focus on various techniques for word and phrase alignment to improve system quality. Specifically, we try out our word packing and syntax-enhanced word alignment techniques for the Chinese–English task and for the English–Chinese task for the first time. For all translation tasks except Arabic–English, we exploit linguistically motivated bilingual phrase pairs extracted from parallel treebanks. We smooth our translation tables with out-of-domain word translations for the Arabic–English and Chinese–English tasks in order to solve the problem of the high number of out of vocabulary items. We also carried out experiments combining both in-domain and out-of-domain data to improve system performance and, finally, we deploy a majority voting procedure combining a language modelbased method and a translation-based method for case and punctuation restoration. We participated in all the translation tasks and translated both the single-best ASR hypotheses and the correct recognition results. The translation results confirm that our new word and phrase alignment techniques are often helpful in improving translation quality, and the data combination method we proposed can significantly improve system performance. 1.
Improving Syntax Driven Translation Models by Re-structuring Divergent and Non-isomorphic Parse Tree Structures
"... Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated th ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. This is primarily due to the highly restrictive space of constituent segmentations that the trees on two sides introduce, which adversely affects the recall of the resulting translation models. Approaches that project from trees on one side, on the other hand, have higher levels of recall, but suffer from lower precision, due to the lack of syntactically-aware word alignments. In this paper we explore the issue of lexical coverage of the translation models learned in both of these scenarios. We specifically look at how the non-isomorphic nature of the parse trees for the two languages affects recall and coverage. We then propose a novel technique for restructuring target parse trees, that generates highly isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. We evaluate the translation models learned from these restructured trees and show that they are significantly better than those learned using trees on both sides and trees on one side. 1

