Results 1 - 10
of
25
Automatic Identification of Word Translations from Unrelated English and German Corpora
, 1999
"... Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is ..."
Abstract
-
Cited by 112 (1 self)
- Add to MetaCart
Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is
A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora
- Parallel Text Processing
, 1998
"... . We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
. We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method--Convec. Convec is based on context information of a word to be translated. We show a 30% to 76% precision when top-one to top-20 translation candidates are considered. Most of the top-20 candidates are either collocations or words rela...
Noun-Noun Compound Machine Translation: A Feasibility Study on Shallow Processing
, 2003
"... The translation of compound nouns is a major issue in machine translation due to their frequency of occurrence and high productivity. ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
The translation of compound nouns is a major issue in machine translation due to their frequency of occurrence and high productivity.
2004b. Multilevel bootstrapping for extracting parallel sentences from a quasi-comparable corpus
- In COLING 2004
, 2004
"... We propose a completely unsupervised method for mining parallel sentences from quasi-comparable bilingual texts which have very different sizes, and which include both in-topic and off-topic documents. We discuss and analyze different bilingual corpora with various levels of comparability. We propos ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We propose a completely unsupervised method for mining parallel sentences from quasi-comparable bilingual texts which have very different sizes, and which include both in-topic and off-topic documents. We discuss and analyze different bilingual corpora with various levels of comparability. We propose that while better document matching leads to better parallel sentence extraction, better sentence matching also leads to better document matching. Based on this, we use multi-level bootstrapping to improve the alignments between documents, sentences, and bilingual word pairs, iteratively. Our method is the first method that does not rely on any supervised training data, such as a sentence-aligned corpus, or temporal information, such as the publishing date of a news article. It is validated by experimental results that show a 23% improvement over a method without multilevel bootstrapping. 1
Looking for Candidate Translational Equivalents in Specialized, Comparable
"... Previous attempts at identifying translational equivalents in comparable corpora have dealt with very large `general language' corpora and words. We address this task in a specialized domain, medicine, starting from smaller non-parallel, comparable corpora and an initial bilingual medical lexicon. W ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Previous attempts at identifying translational equivalents in comparable corpora have dealt with very large `general language' corpora and words. We address this task in a specialized domain, medicine, starting from smaller non-parallel, comparable corpora and an initial bilingual medical lexicon. We compare the distributional contexts of source and target words, testing several weighting factors and similarity measures. On a test set of frequently occurring words, for the best combination (the Jaccard similarity measure with or without tf:idf weighting) , the correct translation is ranked first for 20% of our test words, and is found in the top 10 candidates for 50% of them. An additional reverse-translation filtering step improves the precision of the top candidate translation up to 74%, with a 33% recall.
Cross-lingual Information Retrieval using Hidden Markov Models
- IN PROCEEDINGS OF THE 2000 JOINT SIGDAT CONFERENCE
, 2000
"... This paper presents empirical results in cross-lingual information retrieval using English queries to access Chinese documents (TREC-5 and TREC-6) and Spanish documents (TREC4). Since our interest is in languages where resources may be minimal, we use an integrated probabilistic model that requires ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper presents empirical results in cross-lingual information retrieval using English queries to access Chinese documents (TREC-5 and TREC-6) and Spanish documents (TREC4). Since our interest is in languages where resources may be minimal, we use an integrated probabilistic model that requires only a bilingual dictionary as a resource. We explore how a combined probability model of term translation and retrieval can reduce the effect of translation ambiguity. In addition, we estimate an upper bound on performance, if translation ambiguity were a solved problem. We also measure performance as a function of bilingual dictionary size.
Measuring the Similarity between Compound Nouns in Different Languages Using Non-Parallel Corpora
, 2002
"... This paper presents a method that measures the similarity between compound nouns in di#erent languages to locate translation equivalents from corpora. The method uses information from unrelated corpora in di#erent languages that do not have to be parallel. This means that many corpora can be used. T ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
This paper presents a method that measures the similarity between compound nouns in di#erent languages to locate translation equivalents from corpora. The method uses information from unrelated corpora in di#erent languages that do not have to be parallel. This means that many corpora can be used. The method compares the contexts of target compound nouns and translation candidates in the word or semantic attribute level. In this paper, we show how this measuring method can be applied to select the best English translation candidate for Japanese compound nouns in more than 70% of the cases. 1
Bilingual Parallel Corpora and Language Engineering
- IN IN PROC. OF WORKSHOP ON LANGUAGE ENGINEERING FOR SOUTH-ASIAN LANGUAGES
, 2001
"... ..."
Mixed Language Query Disambiguation
- IN ACL-99. THE 37TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS
, 1999
"... We propose a mixed language query disambiguation approach by using co-occurrence information from monolingual data only. A mixed language query consists of words in a primary language and a secondary language. Our method translates the query into monolingual queries in either language. Two novel fea ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We propose a mixed language query disambiguation approach by using co-occurrence information from monolingual data only. A mixed language query consists of words in a primary language and a secondary language. Our method translates the query into monolingual queries in either language. Two novel features for disambiguation, namely contextual word voting and 1-best contextual word, are introduced and compared to a baseline feature, the nearest neighbor. Average query translation accuracy for the two features are 81.37% and 83.72%, compared to the baseline accuracy of 75.50%.
Extracting loanwords from Mongolian corpora and producing a Japanese-Mongolian bilingual dictionary
- In COLING-ACL
, 2006
"... Japanese-Mongolian bilingual dictionary This paper proposes methods for extracting loanwords from Cyrillic Mongolian corpora and producing a Japanese–Mongolian bilingual dictionary. We extract loanwords from Mongolian corpora using our own handcrafted rules. To complement the rule-based extraction, ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Japanese-Mongolian bilingual dictionary This paper proposes methods for extracting loanwords from Cyrillic Mongolian corpora and producing a Japanese–Mongolian bilingual dictionary. We extract loanwords from Mongolian corpora using our own handcrafted rules. To complement the rule-based extraction, we also extract words in Mongolian corpora that are phonetically similar to Japanese Katakana words as loanwords. In addition, we correspond the extracted loanwords to Japanese words and produce a bilingual dictionary. We propose a stemming method for Mongolian to extract loanwords correctly. We verify the effectiveness of our methods experimentally. 1

