Results 1 -
9 of
9
Synchronizing Translated Movie Subtitles
"... This paper addresses the problem of synchronizing movie subtitles, which is necessary to improve alignment quality when building a parallel corpus out of translated subtitles. In particular, synchronization is done on the basis of aligned anchor points. Previous studies have shown that cognate filte ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper addresses the problem of synchronizing movie subtitles, which is necessary to improve alignment quality when building a parallel corpus out of translated subtitles. In particular, synchronization is done on the basis of aligned anchor points. Previous studies have shown that cognate filters are useful for the identification of such points. However, this restricts the approach to related languages with similar alphabets. Here, we propose a dictionary-based approach using automatic word alignment. We can show an improvement in alignment quality even for related languages compared to the cognate-based approach. 1.
Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
"... Example sentences provide an intuitive means of grasping the meaning of a word, and are frequently used to complement conventional word definitions. When a word has multiple meanings, it is useful to have example sentences for specific senses (and hence definitions) of that word rather than indiscri ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Example sentences provide an intuitive means of grasping the meaning of a word, and are frequently used to complement conventional word definitions. When a word has multiple meanings, it is useful to have example sentences for specific senses (and hence definitions) of that word rather than indiscriminately lumping all of them together. In this paper, we investigate to what extent such sense-specific example sentences can be extracted from parallel corpora using lexical knowledge bases for multiple languages as a sense index. We use word sense disambiguation heuristics and a cross-lingual measure of semantic similarity to link example sentences to specific word senses. From the sentences found for a given sense, an algorithm then selects a smaller subset that can be presented to end users, taking into account both representativeness and diversity. Preliminary results show that a precision of around 80 % can be obtained for a reasonable number of word senses, and that the subset selection yields convincing results.
Linguistics Gesellschaft für Linguistische Datenverarbeitung e. V. (GLDV)
"... 2 Hefte im Jahr, halbjährlich zum 31. Mai und 31. Oktober. Preprints und redaktionelle Planungen sind über die Website der GLDV einsehbar ..."
Abstract
- Add to MetaCart
2 Hefte im Jahr, halbjährlich zum 31. Mai und 31. Oktober. Preprints und redaktionelle Planungen sind über die Website der GLDV einsehbar
Martin Volk The Automatic Translation of Film Subtitles. A Machine Translation Success Story?
"... Every so often one hears the complaint that 50 years of research in Machine Translation (MT) has not resulted in much progress, and that current MT systems are still unsatisfactory. A closer look reveals that web-based general-purpose MT systems are used by thousands of users every day. And, on the ..."
Abstract
- Add to MetaCart
Every so often one hears the complaint that 50 years of research in Machine Translation (MT) has not resulted in much progress, and that current MT systems are still unsatisfactory. A closer look reveals that web-based general-purpose MT systems are used by thousands of users every day. And, on the other hand, special-purpose MT systems have
Prospects and Trends in Data-Driven Machine Translation
"... In the past decade we have seen an amazing revival of machine translation (MT) as the major field of research in computational linguistics. Many reasons can be mentioned to explain this phenomenon: Globalization and the success of the Internet may be one of them forcing companies and individuals to ..."
Abstract
- Add to MetaCart
In the past decade we have seen an amazing revival of machine translation (MT) as the major field of research in computational linguistics. Many reasons can be mentioned to explain this phenomenon: Globalization and the success of the Internet may be one of them forcing companies and individuals to adapt to a multilingual
Edited by
, 2009
"... workshop of its kind. Many things have happened since 2005. The last few years have witnessed a decline in example-based machine translation (EBMT) research and statistical machine translation (SMT) has almost completely taken over the corpus-based machine translation arena, with many EBMT practitio ..."
Abstract
- Add to MetaCart
workshop of its kind. Many things have happened since 2005. The last few years have witnessed a decline in example-based machine translation (EBMT) research and statistical machine translation (SMT) has almost completely taken over the corpus-based machine translation arena, with many EBMT practitioners moving into hybrid approaches integrating EBMT with other approaches, mostly (but not only) SMT. Not having a clear definition of what EBMT is has also contributed to this lack of visibility. In fact, research that would have been considered EBMT has been published without the EBMT label. Is the success of SMT due to the fact that it is the best way to do corpus-based machine translation or is it because many SMT software packages are readily available to researchers under free/open-source licences that allow use as well as collaborative improvement? Shouldn’t EBMT practitioners start to think about putting together their tools, their engines and their data and releasing them under open licenses to extend their use both in academia and industry? The pressure on machine translation researchers to prove their results through detailed empirical evaluation is growing. But the validity of empirical results hinges on reproducibility. Turning our experimental research into packages and tools that other researchers can use and
Iterative Sentence–Pair Extraction from Quasi–Parallel Corpora for Machine Translation
"... This paper addresses parallel data extraction from the quasi–parallel corpora generated in a crowd-sourcing project where ordinary people watch tv shows and movies and transcribe/translate what they hear, creating document pools in different languages. Since they do not have guidelines for naming an ..."
Abstract
- Add to MetaCart
This paper addresses parallel data extraction from the quasi–parallel corpora generated in a crowd-sourcing project where ordinary people watch tv shows and movies and transcribe/translate what they hear, creating document pools in different languages. Since they do not have guidelines for naming and performing translations, it is often not clear which documents are the translations of the same show/movie and which sentences are the translations of the each other in a given document pair. We introduce a method for automatically pairing documents in two languages and extracting parallel sentences from the paired documents. The method consists of three steps: i) document pairing, ii) sentence pair alignment of the paired documents, and iii) context extrapolation to boost the sentence pair coverage. Human evaluation of the extracted data shows that 95 % of the extracted sentences carry useful information for translation. Experimental results also show that using the extracted data provides significant gains over the baseline statistical machine translation system built with manually annotated data. Index Terms: data extraction, comparable data, machine translation
unknown title
"... We describe the preparation of parallel corpora based on professional quality subtitles in seven European language pairs. The main focus is the effect of the processing steps on the size and quality of the final corpora. 1 ..."
Abstract
- Add to MetaCart
We describe the preparation of parallel corpora based on professional quality subtitles in seven European language pairs. The main focus is the effect of the processing steps on the size and quality of the final corpora. 1

