Results 1 - 10
of
27
Learning a Translation Lexicon from Monolingual Corpora
- In Proceedings of ACL Workshop on Unsupervised Lexical Acquisition
, 2002
"... This paper presents work on the task of constructing a word-level translation lexicon purely from unrelated monolingual corpora. We combine various clues such as cognates, similar context, preservation of word similarity, and word frequency. Experimental results for the construction of a German-Engl ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
This paper presents work on the task of constructing a word-level translation lexicon purely from unrelated monolingual corpora. We combine various clues such as cognates, similar context, preservation of word similarity, and word frequency. Experimental results for the construction of a German-English noun lexicon are reported.
Cognates Can Improve Statistical Translation Models
- In Proceedings of HLT-NAACL 2003 (companion volume
, 2003
"... We report results of experiments aimed at improving the translation quality by incorporating the cognate information into translation models. ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
We report results of experiments aimed at improving the translation quality by incorporating the cognate information into translation models.
Evaluation of several phonetic similarity algorithms on the task of cognate identification
- In COLING-ACL Workshop on Linguistic Distances
, 2006
"... We investigate the problem of measuring phonetic similarity, focusing on the identification of cognates, words of the same origin in different languages. We compare representatives of two principal approaches to computing phonetic similarity: manually-designed metrics, and learning algorithms. In pa ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
We investigate the problem of measuring phonetic similarity, focusing on the identification of cognates, words of the same origin in different languages. We compare representatives of two principal approaches to computing phonetic similarity: manually-designed metrics, and learning algorithms. In particular, we consider a stochastic transducer, a Pair HMM, several DBN models, and two constructed schemes. We test those approaches on the task of identifying cognates among Indoeuropean languages, both in the supervised and unsupervised context. Our results suggest that the averaged context DBN model and the Pair HMM achieve the highest accuracy given a large training set of positive examples. 1
Identifying Complex Sound Correspondences in Bilingual Wordlists
- In Proceedings of CICLing 2003: 4th International Conference on Computational Linguistics and Intelligent Text Processing
, 2003
"... The determination of recurrent sound correspondences between languages is crucial for the identi cation of cognates, which are often employed in statistical machine translation for sentence and word alignment. In this paper, an algorithm designed for extracting non-compositional compounds from ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
The determination of recurrent sound correspondences between languages is crucial for the identi cation of cognates, which are often employed in statistical machine translation for sentence and word alignment. In this paper, an algorithm designed for extracting non-compositional compounds from bitexts is shown to be capable of determining complex sound correspondences in bilingual wordlists. In experimental evaluation, a C++ implementation of the algorithm achieves approximately 90% recall and precision on authentic language data.
Automatic Identification of Cognates, False Friends, and Partial Cognates
, 2006
"... Cognates are words in different languages that have similar spelling and meaning. They can help second-language learners with vocabulary expansion and reading comprehension tasks. Special attention needs to be paid to pairs of words that appear similar but are in fact false friends: they have differ ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Cognates are words in different languages that have similar spelling and meaning. They can help second-language learners with vocabulary expansion and reading comprehension tasks. Special attention needs to be paid to pairs of words that appear similar but are in fact false friends: they have different meanings in all contexts. Partial cognates are pairs of words in two languages that have the same meaning in some, but not all, contexts. Detecting the actual meaning of a partial cognate in context can be useful for Machine Translation and Computer-Assisted Language Learning tools. Our research on cognate and false-friend words between two pair of languages (French and English in our case) consists in automatically classifying a pair of words from two languages as cognates or false friends. We use Machine Learning techniques with several measures of orthographic similarity as features for classification. We study the impact of selecting different features, averaging them, and combining them through Machine Learning techniques. The methods work on different pair of languages as long as a small amount of annotated pairs of words is provided as training data. In addition to the work done on cognate and false-friend identification we propose a
Cross-lingual bootstrapping for semantic lexicons
- In Proceedings of the Spring Symposia of the American Association for Artificial Intelligence (AAAI
, 2005
"... This paper considers the problem of unsupervised semantic lexicon acquisition. We introduce a fully automatic approach which exploits parallel corpora, relies on shallow text properties, and is relatively inexpensive. Given the English FrameNet lexicon, our method exploits word alignments to generat ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper considers the problem of unsupervised semantic lexicon acquisition. We introduce a fully automatic approach which exploits parallel corpora, relies on shallow text properties, and is relatively inexpensive. Given the English FrameNet lexicon, our method exploits word alignments to generate frame candidate lists for new languages, which are subsequently pruned automatically using a small set of linguistically motivated filters. Evaluation shows that our approach can produce high-precision multilingual FrameNet lexicons without recourse to bilingual dictionaries or deep syntactic and semantic analysis.
Determining Recurrent Sound Correspondences by Inducing Translation Models
- In Proceedings of COLING 2002: 19th International Conference on Computational Linguistics
, 2002
"... I present a novel approach to the determination of recurrent sound correspondences in bilingual wordlists. The idea is to relate correspondences between sounds in wordlists to translational equivalences between words in bitexts (bilingual corpora). My method induces models of sound correspondence th ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
I present a novel approach to the determination of recurrent sound correspondences in bilingual wordlists. The idea is to relate correspondences between sounds in wordlists to translational equivalences between words in bitexts (bilingual corpora). My method induces models of sound correspondence that are similar to models developed for statistical machine translation. The experiments show that the method is able to determine recurrent sound correspondences in bilingual wordlists in which less than 30% of the pairs are cognates. By employing the discovered correspondences, the method can identify cognates with higher accuracy than the previously reported algorithms.
Combining Evidence in Cognate Identification
, 2004
"... Cognates are words of the same origin that belong to distinct languages. The problem of automatic identification of cognates arises in language reconstruction and bitext-related tasks. The evidence of cognation may come from various information sources, such as phonetic similarity, semantic similari ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Cognates are words of the same origin that belong to distinct languages. The problem of automatic identification of cognates arises in language reconstruction and bitext-related tasks. The evidence of cognation may come from various information sources, such as phonetic similarity, semantic similarity, and recurrent sound correspondences. I discuss ways of defining the measures of the various types of similarity and propose a method of combining then into an integrated cognate identification program. The new method requires no manual parameter tuning and performs well when tested on the Indoeuropean and Algonquian lexical data.
Improving Word Alignment with Bridge Languages
"... We describe an approach to improve Statistical Machine Translation (SMT) performance using multi-lingual, parallel, sentence-aligned corpora in several bridge languages. Our approach consists of a simple method for utilizing a bridge language to create a word alignment system and a procedure for com ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We describe an approach to improve Statistical Machine Translation (SMT) performance using multi-lingual, parallel, sentence-aligned corpora in several bridge languages. Our approach consists of a simple method for utilizing a bridge language to create a word alignment system and a procedure for combining word alignment systems from multiple bridge languages. The final translation is obtained by consensus decoding that combines hypotheses obtained using all bridge language word alignments. We present experiments showing that multilingual, parallel text in Spanish, French, Russian, and Chinese can be utilized in this framework to improve translation performance on an Arabic-to-English task. 1
Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences
"... This paper presents novel improvements to the induction of translation lexicons from monolingual corpora using multilingual dependency parses. We introduce a dependency-based context model that incorporates long-range dependencies, variable context sizes, and reordering. It provides a 16 % relative ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This paper presents novel improvements to the induction of translation lexicons from monolingual corpora using multilingual dependency parses. We introduce a dependency-based context model that incorporates long-range dependencies, variable context sizes, and reordering. It provides a 16 % relative improvement over the baseline approach that uses a fixed context window of adjacent words. Its Top 10 accuracy for noun translation is higher than that of a statistical translation model trained on a Spanish-English parallel corpus containing 100,000 sentence pairs. We generalize the evaluation to other word-types, and show that the performance can be increased to 18 % relative by preserving part-of-speech equivalencies during translation. 1

