Results 1 - 10
of
49
Europarl: A Parallel Corpus for Statistical Machine Translation
"... We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web 1. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine translat ..."
Abstract
-
Cited by 158 (0 self)
- Add to MetaCart
We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web 1. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine translation (SMT). We trained SMT systems for 110 language pairs, which reveal interesting clues into the challenges ahead.
Improving machine translation performance by exploiting non-parallel corpora
- Computational Linguistics
, 2005
"... We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large ..."
Abstract
-
Cited by 56 (2 self)
- Add to MetaCart
We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available. 1.
Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora
, 2000
"... This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish. Exi ..."
Abstract
-
Cited by 34 (1 self)
- Add to MetaCart
This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish. Existing text analysis tools for English are applied to bilingual text corpora and their output projected onto the second language via statistically derived word alignments. Simple direct annotation projection is quite noisy, however, even with optimal alignments. Thus this paper presents noise-robust tagger, bracketer and lemmatizer training procedures capable of accurate system bootstrapping from noisy and incomplete initial projections. Performance of the induced stand-alone part-of-speech tagger applied to French achieves 96% core part-of-speech (POS) tag accuracy, and the corresponding induced noun-phrase bracketer exceeds 91% F-measure. The induced morphological analyzer achieves over 99% lemmatization accuracy on the complete French verbal system. This achievement is particularly noteworthy in that it required absolutely no hand-annotated training data in the given language, and virtually no language-specific knowledge or resources beyond raw text. Performance also significantly exceeds that obtained by direct annotation projection. Keywords multilingual, text analysis, part-of-speech tagging, noun phrase bracketing, named entity, morphology, lemmatization, parallel corpora 1. TASK OVERVIEW A fundamental roadblock to developing statistical taggers, bracketers and other analyzers for many of the world's 200 major languages is the shortage or absence of annotated training data for the large majority of these languages. Ideally, one would like to lever- . [ ] [ ] IN N...
Sentence Alignment for Monolingual Comparable Corpora
- In Proceedings of the 2003 conference on Empirical methods in natural language processing
, 2003
"... We address the problem of sentence alignment for monolingual corpora, a phenomenon distinct from alignment in parallel corpora. Aligning large comparable corpora automatically would provide a valuable resource for learning of text-totext rewriting rules. We incorporate context into the search for an ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
We address the problem of sentence alignment for monolingual corpora, a phenomenon distinct from alignment in parallel corpora. Aligning large comparable corpora automatically would provide a valuable resource for learning of text-totext rewriting rules. We incorporate context into the search for an optimal alignment in two complementary ways: learning rules for matching paragraphs using topic structure and further refining the matching through local alignment to find good sentence pairs. Evaluation shows that our alignment method outperforms state-of-the-art systems developed for the same task.
Towards a Unified Approach to Memory- and Statistical-Based Machine Translation
, 2001
"... We present a set of algorithms that enable us to translate natural language sentences by exploiting both a translation memory and a statistical-based translation model. Our results show that an automatically derived translation memory can be used within a statistical framework to often find t ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
We present a set of algorithms that enable us to translate natural language sentences by exploiting both a translation memory and a statistical-based translation model. Our results show that an automatically derived translation memory can be used within a statistical framework to often find translations of higher probability than those found using solely a statistical model.
Identifying Cognates by Phonetic and Semantic Similarity
, 2001
"... I present a method of identifying cognates in the vocabularies of related languages. I show that a measure of phonetic similarity based on multivalued features performs better than "orthographic" measures, such as the Longest Common Subsequence Ratio (LCSR) or Dice's coefficient. I introduce a proce ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
I present a method of identifying cognates in the vocabularies of related languages. I show that a measure of phonetic similarity based on multivalued features performs better than "orthographic" measures, such as the Longest Common Subsequence Ratio (LCSR) or Dice's coefficient. I introduce a procedure for estimating semantic similarity of glosses that employs keyword selection and WordNet. Tests performed on vocabularies of four Algonquian languages indicate that the method is capable of discovering on average nearly 75% percent of cognates at 50% precision.
Identification of Confusable Drug Names: A New Approach and Evaluation Methodology
- In Proceedings of COLING 2004
, 2004
"... This paper addresses the mitigation of medical errors due to the confusion of sound-alike and look-alike drug names. Our approach involves application of two new methods--- one based on orthographic similarity ("lookalike ") and the other based on phonetic similarity ("sound-alike"). We presen ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
This paper addresses the mitigation of medical errors due to the confusion of sound-alike and look-alike drug names. Our approach involves application of two new methods--- one based on orthographic similarity ("lookalike ") and the other based on phonetic similarity ("sound-alike"). We present a new recall-based evaluation methodology for determining the effectiveness of different similarity measures on drug names. We show that the new orthographic measure (BI-SIM) outperforms other commonly used measures of similarity on a set containing both look-alike and sound-alike pairs, and that the feature-based phonetic approach (ALINE) outperforms orthographic approaches on a test set containing solely sound-alike confusion pairs. However, an approach that combines several different measures achieves the best results on both test sets.
Improved sentence alignment for movie subtitles
- In Proceedings of RANLP, Borovets
, 2007
"... Sentence alignment is an essential step in building a parallel corpus. In this paper a specialized approach for the alignment of movie subtitles based on time overlaps is introduced. It is used for creating an extensive multilingual parallel subtitle corpus currently containing about 21 million alig ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Sentence alignment is an essential step in building a parallel corpus. In this paper a specialized approach for the alignment of movie subtitles based on time overlaps is introduced. It is used for creating an extensive multilingual parallel subtitle corpus currently containing about 21 million aligned sentence fragments in 29 languages. Our alignment approach yields significantly higher accuracies compared to standard length-based approaches on this data. Furthermore, we can show that simple heuristics for subtitle synchronization can be used to improve the alignment accuracy even further.
Champollion: A Robust Parallel Text Sentence Aligner
- In Proceedings of LREC-2006
, 2006
"... This paper describes Champollion, a lexicon-based sentence aligner designed for robust alignment of potential noisy parallel text. Champollion increases the robustness of the alignment by assigning greater weights to less frequent translated words. Experiments on a manually aligned Chinese – English ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
This paper describes Champollion, a lexicon-based sentence aligner designed for robust alignment of potential noisy parallel text. Champollion increases the robustness of the alignment by assigning greater weights to less frequent translated words. Experiments on a manually aligned Chinese – English parallel corpus show that Champollion achieves high precision and recall on noisy data. Champollion can be easily ported to new language pairs. It’s freely available to the public. 1.
METER: MEasuring TExt Reuse
"... In this paper we present results from the METER (MEasuring TExt Reuse) project whose aim is to explore issues pertaining to text reuse and derivation, especially in the context of newspapers using newswire sources. Although the reuse of text by journalists has been studied in linguistics, we are not ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
In this paper we present results from the METER (MEasuring TExt Reuse) project whose aim is to explore issues pertaining to text reuse and derivation, especially in the context of newspapers using newswire sources. Although the reuse of text by journalists has been studied in linguistics, we are not aware of any investigation using existing computational methods for this particular task. We investigate the classification of newspaper articles according to their degree of dependence upon, or derivation from, a newswire source using a simple 3-level scheme designed by journalists.

