A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora (1998)
| Venue: | Parallel Text Processing |
| Citations: | 48 - 3 self |
BibTeX
@INPROCEEDINGS{Fung98astatistical,
author = {Pascale Fung},
title = {A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora},
booktitle = {Parallel Text Processing},
year = {1998},
pages = {1--17},
publisher = {Springer}
}
Years of Citing Articles
OpenURL
Abstract
. We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method--Convec. Convec is based on context information of a word to be translated. We show a 30% to 76% precision when top-one to top-20 translation candidates are considered. Most of the top-20 candidates are either collocations or words rela...







