Results 1 - 10
of
19
Models of Translational Equivalence among Words
- Computational Linguistics
, 2000
"... This article presents methods for biasing statistical translation models to reflect these properties. Evaluation with respect to independent human judgments has confirmed that translation models biased in this fashion are significantly more accurate than a baseline knowledge-free model. This article ..."
Abstract
-
Cited by 121 (2 self)
- Add to MetaCart
This article presents methods for biasing statistical translation models to reflect these properties. Evaluation with respect to independent human judgments has confirmed that translation models biased in this fashion are significantly more accurate than a baseline knowledge-free model. This article also shows how a statistical translation model can take advantage of preexisting knowledge that might be available about particular language pairs. Even the simplest kinds of languagespecific knowledge, such as the distinction between content words and function words, are shown to reliably boost translation model performance on some tasks. Statistical models that reflect knowledge about the model domain combine the best of both the rationalist and empiricist paradigms
A Word-to-Word Model of Translational Equivalence
, 1997
"... Many multilingual NLP applications need to translate words between different languages, but cannot afford the computational expense of inducing or applying a full translation model. For these applications, we have designed a fast algorithm for estimating a partial translation model, which accounts f ..."
Abstract
-
Cited by 73 (6 self)
- Add to MetaCart
Many multilingual NLP applications need to translate words between different languages, but cannot afford the computational expense of inducing or applying a full translation model. For these applications, we have designed a fast algorithm for estimating a partial translation model, which accounts for translational equivalence only at the word level . The model's precision /recall trade-off can be directly controlled via one threshold parameter. This feature makes the model more suitable for applications that are not fully statistical. The model's hidden parameters can be easily conditioned on information extrinsic to the model, providing an easy way to integrate pre-existing knowledge such as part-of-speech, dictionaries, word order, etc.. Our model can link word tokens in parallel texts as well as other translation models in the literature. Unlike other translation models, it can automatically produce dictionarysized translation lexicons, and it can do so with over 99% accuracy.
Automatic Construction Of Clean Broad-Coverage Translation Lexicons
- In Proceedings of the 2nd Conference of the Association for Machine Translation in the Americas
"... Word-level translational equivalences can be extracted from parallel texts by surprisingly simple statistical techniques. However, these techniques are easily fooled by indirect associations --- pairs of unrelated words whose statistical properties resemble those of mutual translations. Indirect ass ..."
Abstract
-
Cited by 55 (9 self)
- Add to MetaCart
Word-level translational equivalences can be extracted from parallel texts by surprisingly simple statistical techniques. However, these techniques are easily fooled by indirect associations --- pairs of unrelated words whose statistical properties resemble those of mutual translations. Indirect associations pollute the resulting translation lexicons, drastically reducing their precision. This paper presents an iterative lexicon cleaning method. On each iteration, most of the remaining incorrect lexicon entries are filtered out, without significant degradation in recall. This lexicon cleaning technique can produce translation lexicons with recall and precision both exceeding 90%, as well as dictionary-sized translation lexicons that are over 99% correct. 1 Introduction Translation lexicons are explicit representations of translational equivalence at the word level. They are central to any machine translation system, and play a vital role in other multilingual applications, including ...
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora
- IN PROCEEDINGS OF THE 33RD ANNUAL CONFERENCE OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS
, 1995
"... We present a pattern matching method for compiling a bilingual lexicon of nouns and proper nouns from unaligned, noisy parallel texts of Asian/IndcEuropean language pairs. Tagging information of one guage is used. Word frequency and position information for high and low frequency words are represent ..."
Abstract
-
Cited by 54 (5 self)
- Add to MetaCart
We present a pattern matching method for compiling a bilingual lexicon of nouns and proper nouns from unaligned, noisy parallel texts of Asian/IndcEuropean language pairs. Tagging information of one guage is used. Word frequency and position information for high and low frequency words are represented in two different vector forms for pattern matching. New anchor point finding and noise elimination techniques are introduced. We obtained a 73.1% precision. We also show how the results can be used in the compilation of domain-specific noun phrases.
Compiling Bilingual Lexicon Entries from a Non-Parallel English-Chinese Corpus
- Proceedings of the Third Workshop on Very Large Corpora
"... We propose a novel context heterogeneity similarity measure between words and their translations in helping to compile bilingual lexicon entries from a non-parallel English-Chinese corpus. Current algorithms for bilingual lexicon compilation rely on occurrence frequencies, length or positional sta ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
We propose a novel context heterogeneity similarity measure between words and their translations in helping to compile bilingual lexicon entries from a non-parallel English-Chinese corpus. Current algorithms for bilingual lexicon compilation rely on occurrence frequencies, length or positional statistics derived from parallel texts. There is little correlation between such statistics of a word and its translation in non-parallel corpora. On the other hand, we suggest that words with productive context in one language translate to words with productive context in another language, and words with rigid context translate into words With rigid context. Context heterogeneity measures how productive the context of a word is in a given domain, independent of its absolute occurrence frequency in the text. Based on this information, we derive statistics of bilingual word pairs from a non-parallel corpus. These statistics can be used to bootstrap a bilingual dictionary compilation algorithm.
Automatic Construction of Semantic Lexicons for Learning Natural Language Interfaces
, 1999
"... This paper describes a system, Wolfie (WOrd Learning From Interpreted Examples), that acquires a semantic lexicon from a corpus of sentences paired with semantic representations. The lexicon learned consists of words paired with meaning representations. Wolfie is part of an integrated system that l ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
This paper describes a system, Wolfie (WOrd Learning From Interpreted Examples), that acquires a semantic lexicon from a corpus of sentences paired with semantic representations. The lexicon learned consists of words paired with meaning representations. Wolfie is part of an integrated system that learns to parse novel sentences into semantic representations, such as logical database queries. Experimental results are presented demonstrating Wolfie's ability to learn useful lexicons for a database interface in four different natural languages. The lexicons learned by Wolfie are compared to those acquired by a similar system developed by Siskind (1996). Content areas: Machine Learning and Discovery, Tasks or Problems, supervised learning; Natural Language Processing, Tasks or Problems, understanding Introduction & Overview The application of learning methods to naturallanguage processing (NLP) has drawn increasing attention in recent years. Using machine learning to help automate the ...
Acquiring word-meaning mappings for natural language interfaces
- Journal of Artificial Intelligence Research
, 2003
"... This paper focuses on a system, Wolfie (WOrd Learning From Interpreted Examples), that acquires a semantic lexicon from a corpus of sentences paired with semantic representations. The lexicon learned consists of phrases paired with meaning representations. Wolfie is part of an integrated system that ..."
Abstract
-
Cited by 21 (7 self)
- Add to MetaCart
This paper focuses on a system, Wolfie (WOrd Learning From Interpreted Examples), that acquires a semantic lexicon from a corpus of sentences paired with semantic representations. The lexicon learned consists of phrases paired with meaning representations. Wolfie is part of an integrated system that learns to parse representations such as logical database queries. Experimental results are presented demonstrating Wolfie’s ability to learn useful lexicons for a database interface in four different natural languages. The usefulness of the lexicons learned by Wolfie are compared to those acquired by a similar system developed by Siskind (1996), with results favorable to Wolfie. A second set of experiments demonstrates Wolfie’s ability to scale to larger and more difficult, albeit artificially generated, corpora. In natural language acquisition, it is difficult to gather the annotated data needed for supervised learning; however, unannotated data is fairly plentiful. Active learning methods (Cohn, Atlas, & Ladner, 1994) attempt to select for annotation and training only the most informative examples, and therefore are potentially very useful in natural language applications. However, most results to date for active learning have only considered standard classification tasks. To reduce annotation effort while maintaining accuracy, we apply active learning to semantic lexicons. We show that active learning can significantly reduce the number of annotated examples required to achieve a given level of performance. 1.
Extracting Word Correspondences from Bilingual Corpora Based on Word Co-occurrence Information
- In Proceedings of the 16th International Conference on Computational Linguistics
, 1996
"... A new method has been developed for extracttug word correspondences from a biliugual corpus. First, the co-occurrence info,'mation tbr each word in both languages is cxlracted li'om the corpus. Then, the correlations between the co-occurreuce features of the words are calculated pairwisely with the ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
A new method has been developed for extracttug word correspondences from a biliugual corpus. First, the co-occurrence info,'mation tbr each word in both languages is cxlracted li'om the corpus. Then, the correlations between the co-occurreuce features of the words are calculated pairwisely with the assistance of a basic word bilingual dictionary. Finally, the pairs of words with the highes! correlations are output selectively. This method is applicable to rather small, unaligned corpora; it can extract correspondeuces between compound words as well as simple words. An experiment using bilingual patent-specification corpora achieved 28% recall and 76% precision; this demonstrates that the method effectively reduces the cost of bilingual dictionary augmentation.
Automatic Extraction of Word Sequence Correspondences in Parallel Corpora
- PROC. OF THE 4TH ANNUAL WORKSHOP ON VERY LARGE CORPORA (WVLC-4
, 1996
"... This paper proposes a method of finding correspondences of arbitrary length word sequences in aligned parallel corpora of Japanese and English. Translation candidates of word sequences are evaluated by a similarity measure between the sequences defined by the co-occurrence frequency and independent ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
This paper proposes a method of finding correspondences of arbitrary length word sequences in aligned parallel corpora of Japanese and English. Translation candidates of word sequences are evaluated by a similarity measure between the sequences defined by the co-occurrence frequency and independent frequency of the word sequences. The similarity measure is an extension of Dice coefficient. An iterative method with groAuM threshold lowering is proposed for getting a high quality translation dictionary. The method is tested with parallel corpora of three distinct domains and achieved over 80% accuracy.
High-Performance Bilingual Text Alignment Using Statistical And Dictionary Information
- In Proceedings of Annual Conference of the Association for Computational Linguistics
, 1996
"... This paper describes an accurate and robust text alignment system for structurally different languages. Among structurally different languages such as Japanese and English, there is a limitation on the amount of word correspondences that can be statistically acquired. The proposed method make ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
This paper describes an accurate and robust text alignment system for structurally different languages. Among structurally different languages such as Japanese and English, there is a limitation on the amount of word correspondences that can be statistically acquired. The proposed method makes use of two kinds of word correspondences in aligning bilingual texts. One is a bilingual dictionary of general use. The other is the word correspondences that are statistically acquired in the alignment process. Our method gradually determines sentence pairs (anchors) that correspond to each other by relaxing parameters. The method, by combining two kinds of word correspondences, achieves adequate word correspondences for complete alignment. As a result, texts of various length and of various genres in structurally different languages can be aligned with high precision. Experimental results show our system outperforms conventional methods for various kinds of Japanese-English texts.

