Results 11 -
19 of
19
MACHINE TRANSLATION BY PATTERN MATCHING
, 2008
"... The best systems for machine translation of natural language are based on statistical models learned from data. Conventional representation of a statistical translation model requires substantial offline computation and representation in main memory. Therefore, the principal bottlenecks to the amoun ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The best systems for machine translation of natural language are based on statistical models learned from data. Conventional representation of a statistical translation model requires substantial offline computation and representation in main memory. Therefore, the principal bottlenecks to the amount of data we can exploit and the complexity of models we can use are available memory and CPU time, and current state of the art already pushes these limits. With data size and model complexity continually increasing, a scalable solution to this problem is central to future improvement. Callison-Burch et al. (2005) and Zhang and Vogel (2005) proposed a solution that we call translation by pattern matching, which we bring to fruition in this dissertation. The training data itself serves as a proxy to the model; rules and parameters are computed on demand. It achieves our desiderata of minimal offline computation and compact representation, but is dependent on fast pattern matching algorithms on text. They demonstrated its application to a common model based on the translation of contiguous substrings, but leave some open problems. Among these is a question: can this approach match the performance of conventional methods despite unavoidable differences that it induces in the model? We show how to answer this question affirmatively. The main
Two Tools for Creating and Visualizing Sub-sentential Alignments of Parallel Text
"... We present two web-based, interactive tools for creating and visualizing sub-sentential alignments of parallel text. Yawat is a tool to support distributed, manual word- and phrase-alignment of parallel text through an intuitive, web-based interface. Kwipc is an interface for displaying words or bil ..."
Abstract
- Add to MetaCart
We present two web-based, interactive tools for creating and visualizing sub-sentential alignments of parallel text. Yawat is a tool to support distributed, manual word- and phrase-alignment of parallel text through an intuitive, web-based interface. Kwipc is an interface for displaying words or bilingual word pairs in parallel, word-aligned context. A key element of the tools presented here is the interactive visualization: alignment information is shown only for one pair of aligned words or phrases at a time. This allows users to explore the alignment space interactively without being overwhelmed by the amount of information available. 1
Machine Learning Approaches for Dealing with Limited Bilingual Data in Statistical Machine Translation
"... Statistical machine translation (SMT) systems have made great strides in translation quality. However, high quality translation output is dependent on the availability of massive amounts of parallel text in the source and target language. There are a large number of languages that are considered “lo ..."
Abstract
- Add to MetaCart
Statistical machine translation (SMT) systems have made great strides in translation quality. However, high quality translation output is dependent on the availability of massive amounts of parallel text in the source and target language. There are a large number of languages that are considered “low-density”, either because the population speaking the language is not very large, or even if millions of people speak the language, insufficient online resources are available in that language. This tutorial covers machine learning approaches for dealing with such situations in statistical machine translation where the amount of available bilingual data is limited. A statistical translation system can be improved and/or adapted by incorporating new training data in the form of parallel text. The problem of learning from insufficient labeled training data has been dealt with in machine learning community under two general frameworks: (i) Semi-supervised Learning, and (ii) Active Learning. The goal of semi-supervised learning is to take advantage of abundant and cheap unlabeled data, together with labeled data, to build a high quality mapping from examples (the input space) to labels (the output space). On the other hand, the goal of active learning is to reduce the amount of labeled data required to learn a high
Combination of Statistical Word Alignments Based on Multiple Preprocessing Schemes
"... We present an approach to using multiple preprocessing schemes to improve statistical word alignments. We show a relative reduction of alignment error rate of about 38%. 1 ..."
Abstract
- Add to MetaCart
We present an approach to using multiple preprocessing schemes to improve statistical word alignments. We show a relative reduction of alignment error rate of about 38%. 1
Semi-Supervised Block ITG Models for Word Alignment
"... Labeled training data for the word alignment task, in the form of word-aligned sentence pairs, is hard to come by for many language-pairs. Hence, it is natural to draw upon semi-supervised learning methods ..."
Abstract
- Add to MetaCart
Labeled training data for the word alignment task, in the form of word-aligned sentence pairs, is hard to come by for many language-pairs. Hence, it is natural to draw upon semi-supervised learning methods
An Unsupervised Alignment Model for Sequence Labeling: Application to Name Transliteration
"... In this paper a new sequence alignment model is proposed for name transliteration systems. In addition, several new features are introduced to enhance the overall accuracy in a name transliteration system. Discriminative methods are used to train the model. Using this model, we achieve improvements ..."
Abstract
- Add to MetaCart
In this paper a new sequence alignment model is proposed for name transliteration systems. In addition, several new features are introduced to enhance the overall accuracy in a name transliteration system. Discriminative methods are used to train the model. Using this model, we achieve improvements on the transliteration accuracy in comparison with the state-of-the-art alignment models. The 1-best name accuracy is also improved using a name selection method from the 10-best list based on the contents of the web. This method leads to a relative improvement of 54 % over 1-best transliteration. The experiments are conducted on an English-Persian name transliteration task. Furthermore, we reproduce the past studies results under the same conditions. Experiments conducting on English to Persian transliteration show that new features provide a relative improvement of 5 % over previous published results. 1
Alignment Models and Algorithms for Statistical Machine Translation
, 2010
"... This degree is submitted to the University of Cambridge ..."
Statistical Alignment Models for . . .
, 2007
"... The ever-increasing amount of parallel data opens a rich resource to multilingual natural language processing, enabling models to work on various translational aspects like detailed human annotations, syntax and semantics. With efficient statistical models, many cross-language applications have seen ..."
Abstract
- Add to MetaCart
The ever-increasing amount of parallel data opens a rich resource to multilingual natural language processing, enabling models to work on various translational aspects like detailed human annotations, syntax and semantics. With efficient statistical models, many cross-language applications have seen significant progresses in recent years, such as statistical machine trans-lation, speech-to-speech translation, cross-lingual information retrieval and bilingual lexicog-raphy. However, the current state-of-the-art statistical translation models rely heavily on the word-level mixture models — a bottleneck, which fails to represent the rich varieties and depen-dencies in translations. In contrast to word-based translations, phrase-based models are more robust in capturing various translation phenomena than the word-level (e.g., local word reordering), and less susceptive to the errors from preprocessing such as word segmentations and tok-enizations. Leveraging phrase level knowledge in translation models is challenging yet reward-ing: it also brings significant improvements on translation qualities. Above the phrase-level are

