• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Finding terminology translations from non-parallel corpora (1997)

by Pascale Fung, Kathleen Mckeown
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 25
Next 10 →

Automatic Identification of Word Translations from Unrelated English and German Corpora

by Reinhard Rapp , 1999
"... Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is ..."
Abstract - Cited by 112 (1 self) - Add to MetaCart
Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is

A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora

by Pascale Fung - Parallel Text Processing , 1998
"... . We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method ..."
Abstract - Cited by 48 (3 self) - Add to MetaCart
. We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method--Convec. Convec is based on context information of a word to be translated. We show a 30% to 76% precision when top-one to top-20 translation candidates are considered. Most of the top-20 candidates are either collocations or words rela...

Noun-Noun Compound Machine Translation: A Feasibility Study on Shallow Processing

by Takaaki Tanaka, Timothy Baldwin , 2003
"... The translation of compound nouns is a major issue in machine translation due to their frequency of occurrence and high productivity. ..."
Abstract - Cited by 12 (2 self) - Add to MetaCart
The translation of compound nouns is a major issue in machine translation due to their frequency of occurrence and high productivity.

2004b. Multilevel bootstrapping for extracting parallel sentences from a quasi-comparable corpus

by Pascale Fung, Percy Cheung - In COLING 2004 , 2004
"... We propose a completely unsupervised method for mining parallel sentences from quasi-comparable bilingual texts which have very different sizes, and which include both in-topic and off-topic documents. We discuss and analyze different bilingual corpora with various levels of comparability. We propos ..."
Abstract - Cited by 9 (0 self) - Add to MetaCart
We propose a completely unsupervised method for mining parallel sentences from quasi-comparable bilingual texts which have very different sizes, and which include both in-topic and off-topic documents. We discuss and analyze different bilingual corpora with various levels of comparability. We propose that while better document matching leads to better parallel sentence extraction, better sentence matching also leads to better document matching. Based on this, we use multi-level bootstrapping to improve the alignments between documents, sentences, and bilingual word pairs, iteratively. Our method is the first method that does not rely on any supervised training data, such as a sentence-aligned corpus, or temporal information, such as the publishing date of a news article. It is validated by experimental results that show a 23% improvement over a method without multilevel bootstrapping. 1

Looking for Candidate Translational Equivalents in Specialized, Comparable

by Corpora Yun-Chuang Chiao, Yun-chuang Chiao, Pierre Zweigenbaum, Département De Biomathématiques, Université Paris
"... Previous attempts at identifying translational equivalents in comparable corpora have dealt with very large `general language' corpora and words. We address this task in a specialized domain, medicine, starting from smaller non-parallel, comparable corpora and an initial bilingual medical lexicon. W ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
Previous attempts at identifying translational equivalents in comparable corpora have dealt with very large `general language' corpora and words. We address this task in a specialized domain, medicine, starting from smaller non-parallel, comparable corpora and an initial bilingual medical lexicon. We compare the distributional contexts of source and target words, testing several weighting factors and similarity measures. On a test set of frequently occurring words, for the best combination (the Jaccard similarity measure with or without tf:idf weighting) , the correct translation is ranked first for 20% of our test words, and is found in the top 10 candidates for 50% of them. An additional reverse-translation filtering step improves the precision of the top candidate translation up to 74%, with a 33% recall.

Cross-lingual Information Retrieval using Hidden Markov Models

by Jinxi Xu, Ralph Weischedel - IN PROCEEDINGS OF THE 2000 JOINT SIGDAT CONFERENCE , 2000
"... This paper presents empirical results in cross-lingual information retrieval using English queries to access Chinese documents (TREC-5 and TREC-6) and Spanish documents (TREC4). Since our interest is in languages where resources may be minimal, we use an integrated probabilistic model that requires ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
This paper presents empirical results in cross-lingual information retrieval using English queries to access Chinese documents (TREC-5 and TREC-6) and Spanish documents (TREC4). Since our interest is in languages where resources may be minimal, we use an integrated probabilistic model that requires only a bilingual dictionary as a resource. We explore how a combined probability model of term translation and retrieval can reduce the effect of translation ambiguity. In addition, we estimate an upper bound on performance, if translation ambiguity were a solved problem. We also measure performance as a function of bilingual dictionary size.

Measuring the Similarity between Compound Nouns in Different Languages Using Non-Parallel Corpora

by Takaaki Tanaka , 2002
"... This paper presents a method that measures the similarity between compound nouns in di#erent languages to locate translation equivalents from corpora. The method uses information from unrelated corpora in di#erent languages that do not have to be parallel. This means that many corpora can be used. T ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
This paper presents a method that measures the similarity between compound nouns in di#erent languages to locate translation equivalents from corpora. The method uses information from unrelated corpora in di#erent languages that do not have to be parallel. This means that many corpora can be used. The method compares the contexts of target compound nouns and translation candidates in the word or semantic attribute level. In this paper, we show how this measuring method can be applied to select the best English translation candidate for Japanese compound nouns in more than 70% of the cases. 1

Bilingual Parallel Corpora and Language Engineering

by Harold Somers - IN IN PROC. OF WORKSHOP ON LANGUAGE ENGINEERING FOR SOUTH-ASIAN LANGUAGES , 2001
"... ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
Abstract not found

Mixed Language Query Disambiguation

by Pascale Fung , Xiaohu Liu, Chi Shun Cheung - IN ACL-99. THE 37TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS , 1999
"... We propose a mixed language query disambiguation approach by using co-occurrence information from monolingual data only. A mixed language query consists of words in a primary language and a secondary language. Our method translates the query into monolingual queries in either language. Two novel fea ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
We propose a mixed language query disambiguation approach by using co-occurrence information from monolingual data only. A mixed language query consists of words in a primary language and a secondary language. Our method translates the query into monolingual queries in either language. Two novel features for disambiguation, namely contextual word voting and 1-best contextual word, are introduced and compared to a baseline feature, the nearest neighbor. Average query translation accuracy for the two features are 81.37% and 83.72%, compared to the baseline accuracy of 75.50%.

Extracting loanwords from Mongolian corpora and producing a Japanese-Mongolian bilingual dictionary

by Badam-osor Khaltar - In COLING-ACL , 2006
"... Japanese-Mongolian bilingual dictionary This paper proposes methods for extracting loanwords from Cyrillic Mongolian corpora and producing a Japanese–Mongolian bilingual dictionary. We extract loanwords from Mongolian corpora using our own handcrafted rules. To complement the rule-based extraction, ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Japanese-Mongolian bilingual dictionary This paper proposes methods for extracting loanwords from Cyrillic Mongolian corpora and producing a Japanese–Mongolian bilingual dictionary. We extract loanwords from Mongolian corpora using our own handcrafted rules. To complement the rule-based extraction, we also extract words in Mongolian corpora that are phonetically similar to Japanese Katakana words as loanwords. In addition, we correspond the extracted loanwords to Japanese words and produce a bilingual dictionary. We propose a stemming method for Mongolian to extract loanwords correctly. We verify the effectiveness of our methods experimentally. 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University