Results 1 - 10
of
20
Learning a Translation Lexicon from Monolingual Corpora
- In Proceedings of ACL Workshop on Unsupervised Lexical Acquisition
, 2002
"... This paper presents work on the task of constructing a word-level translation lexicon purely from unrelated monolingual corpora. We combine various clues such as cognates, similar context, preservation of word similarity, and word frequency. Experimental results for the construction of a German-Engl ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
This paper presents work on the task of constructing a word-level translation lexicon purely from unrelated monolingual corpora. We combine various clues such as cognates, similar context, preservation of word similarity, and word frequency. Experimental results for the construction of a German-English noun lexicon are reported.
Base noun phrase translation using web data and the EM algorithm
- In Proceedings of CoLing
, 2002
"... We consider here the problem of Base Noun Phrase translation. We propose a new method to perform the task. For a given Base NP, we first search its translation candidates from the web. We next determine the possible translation(s) from among the candidates using one of the two methods that we have d ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
We consider here the problem of Base Noun Phrase translation. We propose a new method to perform the task. For a given Base NP, we first search its translation candidates from the web. We next determine the possible translation(s) from among the candidates using one of the two methods that we have developed. In one method, we employ an ensemble of Naïve Bayesian Classifiers constructed with the EM Algorithm. Inthe other method, we use TF-IDF vectors also constructed with the EM Algorithm. Experimental results indicate that the coverage and accuracy of our method are significantly better than those of the baseline methods relying on existing technologies. 1.
Knowledge Sources for Word-Level Translation Models
- In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing
, 2001
"... We present various methods to train word-level translation models for statistical machine translation systems that use widely different knowledge sources ranging from parallel corpora and a bilingual lexicon to only monolingual corpora in two languages. Some novel methods are presented and previousl ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
We present various methods to train word-level translation models for statistical machine translation systems that use widely different knowledge sources ranging from parallel corpora and a bilingual lexicon to only monolingual corpora in two languages. Some novel methods are presented and previously published methods are reviewed. Also, a common evaluation metric enables the first quantitative comparison of these approaches.
A geometric view on bilingual lexicon extraction from comparable corpora
- In Proceedings of ACL-04
, 2004
"... We present a geometric view on bilingual lexicon extraction from comparable corpora, which allows to re-interpret the methods proposed so far and identify unresolved problems. This motivates three new methods that aim at solving these problems. Empirical evaluation shows the strengths and weaknesses ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
We present a geometric view on bilingual lexicon extraction from comparable corpora, which allows to re-interpret the methods proposed so far and identify unresolved problems. This motivates three new methods that aim at solving these problems. Empirical evaluation shows the strengths and weaknesses of these methods, as well as a significant gain in the accuracy of extracted lexicons. 1
Inc. Java Remote Method Invocation Specification
- Proceedings of ACL2006
, 1996
"... We present a novel method for extracting parallel sub-sentential fragments from comparable, non-parallel bilingual corpora. By analyzing potentially similar sentence pairs using a signal processinginspired approach, we detect which segments of the source sentence are translated into segments in the ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We present a novel method for extracting parallel sub-sentential fragments from comparable, non-parallel bilingual corpora. By analyzing potentially similar sentence pairs using a signal processinginspired approach, we detect which segments of the source sentence are translated into segments in the target sentence, and which are not. This method enables us to extract useful machine translation training data even from very non-parallel corpora, which contain no parallel sentence pairs. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. 1
Living off the land: The Web as a source of practice texts for learners of less prevalent languages
- Proceedings of LREC 2002, Third International Conference on Language Resources and Evaluation, Las Palmas: ELRA
, 2002
"... This study focuses on how to automatically locate text sources published on the World Wide Web in order to produce adequate and upto -date learning materials for second language learners of Nordic languages. The Web is an excellent source of authentic text materials. However, the large amount of inf ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
This study focuses on how to automatically locate text sources published on the World Wide Web in order to produce adequate and upto -date learning materials for second language learners of Nordic languages. The Web is an excellent source of authentic text materials. However, the large amount of information available on the Web makes search services necessary. Hence, we are developing Squirrel, a prototype Web meta-search service, described in this paper, which collects text material in the Nordic languages according to language, topic and difficulty level. Our primary target group consists of exchange students to Nordic institutions of higher education, and their language teachers, although in the longer perspective, we would also like to be able to do something for minority language communities. We describe the basic implementation of Squirrel, and present preliminary results from trying it out. Finally we discuss the (lack of) Web resources in less prevalent languages, and how we imagine that applications like Squirrel could fit into a second or foreign language learning situation.
Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases
"... Untranslated words still constitute a major problem for Statistical Machine Translation (SMT), and current SMT systems are limited by the quantity of parallel training texts. Augmenting the training data with paraphrases generated by pivoting through other languages alleviates this problem, especial ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Untranslated words still constitute a major problem for Statistical Machine Translation (SMT), and current SMT systems are limited by the quantity of parallel training texts. Augmenting the training data with paraphrases generated by pivoting through other languages alleviates this problem, especially for the so-called “low density ” languages. But pivoting requires additional parallel texts. We address this problem by deriving paraphrases monolingually, using distributional semantic similarity measures, thus providing access to larger training resources, such as comparable and unrelated monolingual corpora. We present what is to our knowledge the first successful integration of a collocational approach to untranslated words with an end-to-end, state of the art SMT system demonstrating significant translation improvements in a low-resource setting.
Processing Comparable Corpora With Bilingual Suffix Trees
- In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2002
, 2002
"... We introduce Bilingual Suffix Trees (BST), a data structure that is suitable for exploiting comparable corpora. We discuss algorithms that use BSTs in order to create parallel corpora and learn translations of unseen words from comparable corpora. Starting with a small bilingual dictionary that was ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We introduce Bilingual Suffix Trees (BST), a data structure that is suitable for exploiting comparable corpora. We discuss algorithms that use BSTs in order to create parallel corpora and learn translations of unseen words from comparable corpora. Starting with a small bilingual dictionary that was derived automatically from a corpus of 5.000 parallel sentences, we have automatically extracted a corpus of 33.926 parallel phrases of size greater than 3, and learned 9 new word translations from a comparable corpus of 1.3M words (100.000 sentences).
Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval
- Artificial Intelligence in Medicine, 33(2
, 2005
"... We present in this article experiments on Multi-Language Information Extraction and Access in the medical domain. Methods for extracting bilingual lexicons from parallel and comparable corpora are described and their use in Multi-Language Information Access is illustrated. Our experiments show that ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We present in this article experiments on Multi-Language Information Extraction and Access in the medical domain. Methods for extracting bilingual lexicons from parallel and comparable corpora are described and their use in Multi-Language Information Access is illustrated. Our experiments show that these automatically extracted bilingual lexicons are accurate enough for semi-automatically enriching mono- or bilingual thesauri (such as UMLS), and that their use in Cross-language Information Retrieval (CLIR) significantly improves the retrieval performance and clearly outperforms existing bilingual lexicon resources (both general lexicons and specialized ones).
Improved word alignments using the web as a corpus
- In Proceedings of RANLP’07
, 2007
"... We propose a novel method for improving word alignments in a parallel sentence-aligned bilingual corpus based on the idea that if two words are translations of each other then so should be many words in their local contexts. The idea is formalised using the Web as a corpus, a glossary of known word ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We propose a novel method for improving word alignments in a parallel sentence-aligned bilingual corpus based on the idea that if two words are translations of each other then so should be many words in their local contexts. The idea is formalised using the Web as a corpus, a glossary of known word translations (dynamically augmented from the Web using bootstrapping), the vector space model, linguistically motivated weighted minimum edit distance, competitive linking, and the IBM models. Evaluation results on a Bulgarian-Russian corpus show a sizable improvement both in word alignment and in translation quality.

