Results 11 - 20
of
20
Phrase-based Memory-based Machine Translation
"... This master thesis aims to investigate a phrase-based approach of Memory-based Machine Translation. This is a form of automatic translation powered by lazy-learning classifiers to translate fragments of the input sentence. A parallel corpus serves as the basis for training such a classifier. In the ..."
Abstract
- Add to MetaCart
This master thesis aims to investigate a phrase-based approach of Memory-based Machine Translation. This is a form of automatic translation powered by lazy-learning classifiers to translate fragments of the input sentence. A parallel corpus serves as the basis for training such a classifier. In the phrase-based approach the principal component of these fragments is a phrase of arbitrary length. This can be contrasted to prior research in the field in which this component was a single word. A key element in the research is a comparison of three methods of phrase extraction. A new decoder has been developed to deal with the characteristics unique to this approach, and re-assemble the translated fragments into one final translation. This research will show that one of the proposed phrase-extraction methods is capable of outperforming previous word-based approaches, even though this gain is limited and the impact
, Sudip Kumar Naskar a
"... Abstract. The Phrase-Based Statistical Machine Translation (PB-SMT) model has recently begun to include source context modeling, under the assumption that the proper lexical choice of an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic fea ..."
Abstract
- Add to MetaCart
Abstract. The Phrase-Based Statistical Machine Translation (PB-SMT) model has recently begun to include source context modeling, under the assumption that the proper lexical choice of an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic features such as words, parts-of-speech, and supertags have been explored as effective source context in SMT. In this paper, we show that position-independent syntactic dependency relations of the head of a source phrase can be modeled as useful source context to improve target phrase selection and thereby improve overall performance of PB-SMT. On a Dutch—English translation task, by combining dependency relations and syntactic contextual features (part-of-speech), we achieved a 1.0 BLEU (Papineni et al., 2002) point improvement (3.1 % relative) over the baseline.
Extending Memory-Based Machine Translation to Phrases Maarten van Gompel
"... We present a phrase-based extension to memory-based machine translation. This form of examplebased machine translation employs lazy-learning classifiers to translate fragments of the source sentence to fragments of the target sentence. Source-side fragments consist of variable-length phrases in a lo ..."
Abstract
- Add to MetaCart
We present a phrase-based extension to memory-based machine translation. This form of examplebased machine translation employs lazy-learning classifiers to translate fragments of the source sentence to fragments of the target sentence. Source-side fragments consist of variable-length phrases in a local context of neighboring words, translated by the classifier to a target-language phrase. We compare three methods of phrase extraction, and present a new decoder that reassembles the translated fragments into one final translation. Results show that one of the proposed phrase-extraction methods—the one used in Moses—leads to a translation system that outperforms context-sensitive word-based approaches. The differences, however, are small, arguably because the word-based approaches already capture phrasal context implicitly due to their source-side and target-side context sensitivity. 1
XML Schemas for Parallel Corpora
"... Abstract. Parallel corpora are resources used in Natural Language Processing and Computational Linguistics. They are defined as a set of texts, in different languages, that are translations of each other. Note that these translations do not need to cover the full document, as we might have sentences ..."
Abstract
- Add to MetaCart
Abstract. Parallel corpora are resources used in Natural Language Processing and Computational Linguistics. They are defined as a set of texts, in different languages, that are translations of each other. Note that these translations do not need to cover the full document, as we might have sentences translated just on some of the languages. When dealing with the process of sharing resources, recent years have bet on the use of XML formats. This is no different when talking about parallel corpora sharing. When visiting different projects in the web that release parallel corpora for download, we can find at least three different formats. In fact, this abundance of formats has led some projects to adopt all the three formats. This article discusses these three main formats: XML Corpus Encoding Standard, Translation Memory Exchange format and the Text Encoding Initiative. We will compare their formal definition and their XML
An Italian to Catalan RBMT system reusing data from existing language pairs ∗
"... This paper presents an Italian→Catalan RBMT system automatically built by combining the linguistic data of the existing pairs Spanish–Catalan and Spanish–Italian. A lightweight manual postprocessing is carried out in order to fix inconsistencies in the automatically derived dictionaries and to add v ..."
Abstract
- Add to MetaCart
This paper presents an Italian→Catalan RBMT system automatically built by combining the linguistic data of the existing pairs Spanish–Catalan and Spanish–Italian. A lightweight manual postprocessing is carried out in order to fix inconsistencies in the automatically derived dictionaries and to add very frequent words that are missing according to a corpus analysis. The system is evaluated on the KDE4 corpus and outperforms Google Translate by approximately ten absolute points in terms of both TER and GTM. 1
Creating a reusable English – Afrikaans parallel corpora for bilingual dictionary construction
"... This paper investigates the possibilities in creating a bilingual English – Afrikaans dictionary by building a parallel corpus and using the Uplug tool to process it. The resulting parallel corpus with approximately 400,000 words per language was created partly from texts collected from the South Af ..."
Abstract
- Add to MetaCart
This paper investigates the possibilities in creating a bilingual English – Afrikaans dictionary by building a parallel corpus and using the Uplug tool to process it. The resulting parallel corpus with approximately 400,000 words per language was created partly from texts collected from the South African government and partly from the OPUS corpus. The recall and accuracy of the bilingual dictionary was evaluated based on the statistical data collected. Samples of translations were generated, compiled as questionnaires and then assessed by English – Afrikaans speaking respondents. The results yielded an accuracy of 87.2 percent and a recall of 67.3 percent for the processed dictionary. Our English – Afrikaans parallel corpora can be found at the following address:

