Results 1 - 10
of
12
MATREX: the DCU MT System for WMT 2008
"... In this paper, we give a description of the machine translation system developed at DCU that was used for our participation in the evaluation campaign of the Third Workshop on Statistical Machine Translation at ACL 2008. We describe the modular design of our datadriven MT system with particular focu ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
In this paper, we give a description of the machine translation system developed at DCU that was used for our participation in the evaluation campaign of the Third Workshop on Statistical Machine Translation at ACL 2008. We describe the modular design of our datadriven MT system with particular focus on the components used in this participation. We also describe some of the significant modules which were unused in this task. We participated in the EuroParl task for the following translation directions: Spanish– English and French–English, in which we employed our hybrid EBMT-SMT architecture to translate. We also participated in the Czech– English News and News Commentary tasks which represented a previously untested language pair for our system. We report results on the provided development and test sets. 1
Using a maximum entropy model to build segmentation lattices for MT
- In NAACL
"... Recent work has shown that translating segmentation lattices (lattices that encode alternative ways of breaking the input to an MT system into words), rather than text in any particular segmentation, improves translation quality of languages whose orthography does not mark morpheme boundaries. Howev ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Recent work has shown that translating segmentation lattices (lattices that encode alternative ways of breaking the input to an MT system into words), rather than text in any particular segmentation, improves translation quality of languages whose orthography does not mark morpheme boundaries. However, much of this work has relied on multiple segmenters that perform differently on the same input to generate sufficiently diverse source segmentation lattices. In this work, we describe a maximum entropy model of compound word splitting that relies on a few general features that can be used to generate segmentation lattices for most languages with productive compounding. Using a model optimized for German translation, we present results showing significant improvements in translation quality in German-English, Hungarian-English, and Turkish-English translation over state-ofthe-art baselines. 1
Web-Based Machine Translation
, 2003
"... Abstract This chapter has two main aims: (i) to present the state-of-the-art in Machine Translation (MT), namely Phrase-Based Statistical MT, together with the major competing paradigms used in MT research and development today; and (ii) to provide an overview of the MT research carried out by my te ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract This chapter has two main aims: (i) to present the state-of-the-art in Machine Translation (MT), namely Phrase-Based Statistical MT, together with the major competing paradigms used in MT research and development today; and (ii) to provide an overview of the MT research carried out by my team here at DCU, characterised here in terms of ‘hybrid MT’. In addition, we provide our views on the directions that MT research might take in the near future, and conclude the chapter with lists of further reading for the interested reader.
Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation
"... We introduce a word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingu ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We introduce a word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is adapted for the specific translation task at hand by taking the corresponding source (target) language into account. Secondly, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and demonstrate that our approach scores consistently among the best results across different data conditions.
Gappy Phrasal Alignment by Agreement
"... French ne voudrais pas voyager par chemin de fer We propose a principled and efficient phraseto-phrase alignment model, useful in machine translation as well as other related natural language processing problems. In a hidden semi-Markov model, word-to-phrase and phraseto-word translations are modele ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
French ne voudrais pas voyager par chemin de fer We propose a principled and efficient phraseto-phrase alignment model, useful in machine translation as well as other related natural language processing problems. In a hidden semi-Markov model, word-to-phrase and phraseto-word translations are modeled directly by the system. Agreement between two directional models encourages the selection of parsimonious phrasal alignments, avoiding the overfitting commonly encountered in unsupervised training with multi-word units. Expanding the state space to include “gappy phrases ” (such as French ne ⋆ pas) makes the alignment space more symmetric; thus, it allows agreement between discontinuous alignments. The resulting system shows substantial improvements in both alignment quality and translation quality over word-based Hidden Markov Models, while maintaining asymptotically equivalent runtime. 1
Large-data Statistical Machine Translation with Hadoop
, 2007
"... Modern statistical machine translation (SMT) is driven by large quantities of aligned bilingual sentence pairs (so-called bitexts), from which translation models are automatically learned. I propose to develop a framework to reduce the effort involved in using extremely large quantities of training ..."
Abstract
- Add to MetaCart
Modern statistical machine translation (SMT) is driven by large quantities of aligned bilingual sentence pairs (so-called bitexts), from which translation models are automatically learned. I propose to develop a framework to reduce the effort involved in using extremely large quantities of training data to develop SMT systems. This task decomposes into several sub-problems, which can be addressed independently: generation of word alignments, estimation of a translation model, estimation of a language model, decoding a tuning set with the estimated models (this requires efficient access to models of potentially very large size), optimization of model parameters according to some loss function, and decoding of an evaluation set with the estimated model. Currently, the research community deals with large data in three ways. First, some solutions for efficiently handling large amounts of training data have been developed, for example, in the domain of language model estimation and representation [1,2]. However, since cluster architectures are quite diverse, these solutions, if publicly available at all, tend to be ad-hoc and environmentdependent. Second, and more commonly, non-parallel implementations of SMT model estimators (such as GIZA++ and the Moses training suite) are applied to large data sets resulting in extremely long experiment run-times, which limits the kinds of experiments that can be run. Finally, many researches circumvent these problems entirely by using small corpora. The use of small corpora for research is so widespread that many papers draw conclusions from systems trained on orders of magnitude less training data than is actually available (e.g., [3,4]). The first phase of this project will focus on more efficient translation model estimation since this task is particularly well suited for Hadoop and because there currently is no available distributed solution to this problem. Improvements to word alignment will also be investigated. 2. Resources The basis of this project will be the Moses decoder tool suite
MaTrEx: The DCU Machine Translation System for ICON 2008
, 2008
"... In this paper, we give a description of the machine translation system developed at DCU that was used for our participation in the NLP Tools Contest of the International Conference on Natural Language Processing (ICON 2008). This was our first ever attempt at working on any Indian language. In this ..."
Abstract
- Add to MetaCart
In this paper, we give a description of the machine translation system developed at DCU that was used for our participation in the NLP Tools Contest of the International Conference on Natural Language Processing (ICON 2008). This was our first ever attempt at working on any Indian language. In this participation, we focus on various techniques for word and phrase alignment to improve system quality. For the English–Hindi translation task we exploit source-language reordering. We also carried out experiments combining both in-domain and out-of-domain data to improve the system performance and, as a post-processing step we transliterate outof-vocabulary items.
Improving alignment for SMT by reordering and augmenting the training corpus
"... We describe the LIU systems for English-German and German-English translation in the WMT09 shared task. We focus on two methods to improve the word alignment: (i) by applying Giza++ in a second phase to a reordered training corpus, where reordering is based on the alignments from the first phase, an ..."
Abstract
- Add to MetaCart
We describe the LIU systems for English-German and German-English translation in the WMT09 shared task. We focus on two methods to improve the word alignment: (i) by applying Giza++ in a second phase to a reordered training corpus, where reordering is based on the alignments from the first phase, and (ii) by adding lexical data obtained as highprecision alignments from a different word aligner. These methods were studied in the context of a system that uses compound processing, a morphological sequence model for German, and a partof-speech sequence model for English. Both methods gave some improvements to translation quality as measured by Bleu and Meteor scores, though not consistently. All systems used both out-ofdomain and in-domain data as the mixed corpus had better scores in the baseline configuration. 1
Acquiring Translation Equivalences of Multiword Expressions by Normalized Correlation Frequencies
"... In this paper, we present an algorithm for extracting translations of any given multiword expression from parallel corpora. Given a multiword expression to be translated, the method involves extracting a short list of target candidate words from parallel corpora based on scores of normalized frequen ..."
Abstract
- Add to MetaCart
In this paper, we present an algorithm for extracting translations of any given multiword expression from parallel corpora. Given a multiword expression to be translated, the method involves extracting a short list of target candidate words from parallel corpora based on scores of normalized frequency, generating possible translations and filtering out common subsequences, and selecting the top-n possible translations using the Dice coefficient. Experiments show that our approach outperforms the word alignmentbased and other naive association-based methods. We also demonstrate that adopting the extracted translations can significantly improve the performance of the Moses machine translation system. 1

