Results 1 -
8 of
8
Transductive learning for statistical machine translation
- In Proc. of ACL
, 2007
"... Statistical machine translation systems are usually trained on large amounts of bilingual text and monolingual text in the target language. In this paper we explore the use of transductive semi-supervised methods for the effective use of monolingual data from the source language in order to improve ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Statistical machine translation systems are usually trained on large amounts of bilingual text and monolingual text in the target language. In this paper we explore the use of transductive semi-supervised methods for the effective use of monolingual data from the source language in order to improve translation quality. We propose several algorithms with this aim, and present the strengths and weaknesses of each one. We present detailed experimental evaluations on the French–English EuroParl data set and on data from the NIST Chinese–English largedata track. We show a significant improvement in translation quality on both tasks. 1
Parallel LFG grammars on parallel corpora: A base for practical triangulation
- In
, 2008
"... This paper presents an approach to annotation projection in a multi-parallel corpus, that is, a collection of translated texts in more than two languages. Existing analysis tools, like the LFG grammars from the ParGram project, are applied to two of the languages in the corpus and the resulting anno ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper presents an approach to annotation projection in a multi-parallel corpus, that is, a collection of translated texts in more than two languages. Existing analysis tools, like the LFG grammars from the ParGram project, are applied to two of the languages in the corpus and the resulting annotation is projected to a third language, taking advantage of the largely parallel character of f-structure. The third language can be a low-resource language. The technique can thus be particularly beneficial for corpus-based (cross-) linguistic research. We discuss a number of ways to realize automatic corpus annotation based on multi-source projection, including direct projection and approaches with an additional generalization step that employs machine learning techniques. We present a series of detailed experiments for a sample annotation task, verb argument identification, using the German and English ParGram grammars for projection to Dutch and maximum entropy models for learning generalizations. 1
Unsupervised segmentation for statistical machine translation
, 2003
"... An unsupervised approach is applied to segment German-English and French-English parallel corpora for statistical machine translation. The approach requires no language-nor domain-specific knowledge whatsoever. Segmentation is shown to effectively re-duce the number of unknown words and singletons i ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
An unsupervised approach is applied to segment German-English and French-English parallel corpora for statistical machine translation. The approach requires no language-nor domain-specific knowledge whatsoever. Segmentation is shown to effectively re-duce the number of unknown words and singletons in the corpora which helps improve the translation model. As a result, word error rates are lowered by 0.37 % and 2.15% in the translation of German to English and French to English respectively. The ben-efits of segmentation to statistical machine translation are more pronounced when the training data size is small. i Acknowledgements I would like to thank Miles Osborne and Chris Callison-Burch for their guidance; and Pronab Saha for his much-needed moral support. ii Declaration
Active Learning for Multilingual Statistical Machine Translation ∗
"... Statistical machine translation (SMT) models require bilingual corpora for training, and these corpora are often multilingual with parallel text in multiple languages simultaneously. We introduce an active learning task of adding a new language to an existing multilingual set of parallel text and co ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Statistical machine translation (SMT) models require bilingual corpora for training, and these corpora are often multilingual with parallel text in multiple languages simultaneously. We introduce an active learning task of adding a new language to an existing multilingual set of parallel text and constructing high quality MT systems, from each language in the collection into this new target language. We show that adding a new language using active learning to the EuroParl corpus provides a significant improvement compared to a random sentence selection baseline. We also provide new highly effective sentence selection methods that improve AL for phrase-based SMT in the multilingual and single language pair setting. 1
Family Compliance Office
- Department of Education
, 1974
"... The performance of a statistical machine translation system depends on the size of the available task-specific bilingual training corpus. On the other hand, acquisition of a large high-quality bilingual parallel text for the desired domain and language pair requires a lot of time and effort, and, fo ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The performance of a statistical machine translation system depends on the size of the available task-specific bilingual training corpus. On the other hand, acquisition of a large high-quality bilingual parallel text for the desired domain and language pair requires a lot of time and effort, and, for some language pairs, is not even possible. Besides, small corpora have certain advantages like low memory and time requirements for the training of a translation system, the possibility of manual corrections and even manual creation. Therefore, investigation of statistical machine translation with small amounts of bilingual training data is receiving more and more attention. This paper gives an overview of the state of the art and presents the most recent results of translation systems trained on sparse bilingual data for two language pairs: Spanish-English, already widely explored with a number of (large) bilingual training corpora available, and Serbian-English- a rarely investigated language pair with restricted bilingual resources. 1.
Machine Learning Approaches for Dealing with Limited Bilingual Data in Statistical Machine Translation
"... Statistical machine translation (SMT) systems have made great strides in translation quality. However, high quality translation output is dependent on the availability of massive amounts of parallel text in the source and target language. There are a large number of languages that are considered “lo ..."
Abstract
- Add to MetaCart
Statistical machine translation (SMT) systems have made great strides in translation quality. However, high quality translation output is dependent on the availability of massive amounts of parallel text in the source and target language. There are a large number of languages that are considered “low-density”, either because the population speaking the language is not very large, or even if millions of people speak the language, insufficient online resources are available in that language. This tutorial covers machine learning approaches for dealing with such situations in statistical machine translation where the amount of available bilingual data is limited. A statistical translation system can be improved and/or adapted by incorporating new training data in the form of parallel text. The problem of learning from insufficient labeled training data has been dealt with in machine learning community under two general frameworks: (i) Semi-supervised Learning, and (ii) Active Learning. The goal of semi-supervised learning is to take advantage of abundant and cheap unlabeled data, together with labeled data, to build a high quality mapping from examples (the input space) to labels (the output space). On the other hand, the goal of active learning is to reduce the amount of labeled data required to learn a high
Bootstrapping Multilingual Geographical Gazetteers
"... Abstract. In this paper an approach to automatically generating multilingual geographical name gazetteers via two bootstrapping loops on different corpora is presented. First, a small seed-list of geographical names is matched to an unannotated dataset in one language, and training data for a memory ..."
Abstract
- Add to MetaCart
Abstract. In this paper an approach to automatically generating multilingual geographical name gazetteers via two bootstrapping loops on different corpora is presented. First, a small seed-list of geographical names is matched to an unannotated dataset in one language, and training data for a memory-based classifier is generated. Memory-based learning is applied to extend the gazetteer. Then a cross-over to a different language is made by matching this extended gazetteer to a corpus in a different language. Again, training data for a classifier is generated and the bootstrapping process is repeated in order to extend the gazetteer further. This process is quite similar to co-training, in which information from other sources is introduced to enhance classification. To estimate the difference between the initial seed-list and the final gazetteer and thereby to evaluate the performance of the algorithm, they were matched to three datasets with manually annotated geographical entities. 1

