Results 1 - 10
of
12
Identifying Word Translations in Non-Parallel Texts
, 1995
"... Common algorithms for sentence and word-alignment allow the automatic identification of word translations from parallel texts. This study suggests that the identification of word translations should also be possible with non-parallel and even unrelated texts. The method proposed is based on the assu ..."
Abstract
-
Cited by 59 (1 self)
- Add to MetaCart
Common algorithms for sentence and word-alignment allow the automatic identification of word translations from parallel texts. This study suggests that the identification of word translations should also be possible with non-parallel and even unrelated texts. The method proposed is based on the assumption that there is a correlation between the patterns of word cooccurrences in texts of different languages. 1 Introduction In a number of recent studies it has been shown that word translations can be automatically derived from the statistical distribution of words in bilingual parallel texts (e. g. Catizone, Russell & Warwick, 1989; Brown et al., 1990; Dagan, Church & Gale, 1993; Kay & Roscheisen, 1993). Most of the proposed algorithms first conduct an alignment of sentences, i. e. those pairs of sentences are located that are translations of each other. In a second step a word alignment is performed by analyzing the correspondences of words in each pair of sentences. The results achie...
Combining Corpus and Machine-Readable Dictionary Data for Building Bilingual Lexicons
, 1996
"... . This paper describes and discusses some theoretical and practical problems arising from developing a system to combine the structured but incomplete information from machine readable dictionaries (MRDs) with the unstructured but more complete information available in corpora for the creation of a ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
. This paper describes and discusses some theoretical and practical problems arising from developing a system to combine the structured but incomplete information from machine readable dictionaries (MRDs) with the unstructured but more complete information available in corpora for the creation of a bilingual lexical data base, presenting a methodology to integrate information from both sources into a single lexical data structure. The bicord system (BIlingual CORpus-enhanced Dictionaries) involves linking entries in Collins English-French and FrenchEnglish bilingual dictionary with a large English-French and French-English bilingual corpus. We have concentrated on the class of action verbs of movement, building on earlier work on lexical correspondences specific to this verb class between languages (Klavans and Tzoukermann, 1989), (Klavans and Tzoukermann, 1990a), (Klavans and Tzoukermann, 1990b). 1 We first examine the way prototypical verbs of movement are translated in the Collin...
Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation
"... In this paper we investigate unsupervised name transliteration using comparable corpora, corpora where texts in the two languages deal in some of the same topics — and therefore share references to named entities — but are not translations of each other. We present two distinct methods for translite ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In this paper we investigate unsupervised name transliteration using comparable corpora, corpora where texts in the two languages deal in some of the same topics — and therefore share references to named entities — but are not translations of each other. We present two distinct methods for transliteration, one approach using an unsupervised phonetic transliteration method, and the other using the temporal distribution of candidate pairs. Each of these approaches works quite well, but by combining the approaches one can achieve even better results. We believe that the novelty of our approach lies in the phonetic-based scoring method, which is based on a combination of carefully crafted phonetic features, and empirical results from the pronunciation errors of second-language learners of English. Unlike previous approaches to transliteration, this method can in principle work with any pair of languages in the absence of a training dictionary, provided one has an estimate of the pronunciation of words in text. 1
2005b. A hybrid approach to align sentences and words in English-Hindi parallel corpora
- In Proceedings of the ACL Workshop on ”Building and Exploiting Parallel Texts”, Ann Arbor
"... In this paper we describe an alignment system that aligns English-Hindi texts at the sentence and word level in parallel corpora. We describe a simple sentence length approach to sentence alignment and a hybrid, multi-feature approach to perform word alignment. We use regression techniques in order ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In this paper we describe an alignment system that aligns English-Hindi texts at the sentence and word level in parallel corpora. We describe a simple sentence length approach to sentence alignment and a hybrid, multi-feature approach to perform word alignment. We use regression techniques in order to learn parameters which characterise the relationship between the lengths of two sentences in parallel text. We use a multi-feature approach with dictionary lookup as a primary technique and other methods such as local word grouping, transliteration similarity (edit-distance) and a nearest aligned neighbours approach to deal with many-to-many word alignment. Our experiments are based on the EMILLE (Enabling Minority Language Engineering) corpus. We obtained 99.09 % accuracy for many-to-many sentence alignment and 77 % precision and 67.79 % recall for many-to-many word alignment. 1
Bilingual Parallel Corpora and Language Engineering
- IN IN PROC. OF WORKSHOP ON LANGUAGE ENGINEERING FOR SOUTH-ASIAN LANGUAGES
, 2001
"... ..."
Mining comparable bilingual text corpora for cross-language information integration
- In KDD
, 2005
"... Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-lingual text mining method that does not rely on ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-lingual text mining method that does not rely on any of these resources, but can exploit comparable bilingual text corpora to discover mappings between words and documents in different languages. Comparable text corpora are collections of text documents in different languages that are about similar topics; such text corpora are often naturally available (e.g., news articles in different languages published in the same time period). The main idea of our method is to exploit frequency correlations of words in different languages in the comparable corpora and discover mappings between words in different languages. Such mappings can then be used to further discover mappings between documents in different languages, achieving cross-lingual information integration. Evaluation of the proposed method on a 120MB Chinese-English comparable news collection shows that the proposed method is effective for mapping words and documents in English and Chinese. Since our method only relies on naturally available comparable corpora, it is generally applicable to any language pairs as long as we have comparable corpora.
Example-Based Machine Translation: A New Paradigm
- Translation and Information Technology
"... this article, we will give an overview of the EBMT technology. In the next section we will review the history of EBMT, with a focus on the main ideas. Since a comprehensive review of EBMT can be found in Somers (2000a), we will focus on the discussion of our viewpoints of the EBMT framework. Then we ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
this article, we will give an overview of the EBMT technology. In the next section we will review the history of EBMT, with a focus on the main ideas. Since a comprehensive review of EBMT can be found in Somers (2000a), we will focus on the discussion of our viewpoints of the EBMT framework. Then we will define the notion of example and examine the major issues involved in EBMT, covering mainly the four major stages of EBMT, namely, example acquisition, example base management, example application and target sentence generation. Some of our current work in lexical-based text alignment for example acquisition is also discussed, highlighting the formulation of a similarity measure and alignment algorithm, before concluding our discussion in the last section
Clause Alignment for Hong Kong Legal Texts: A Lexical-based Approach
- International Journal of Corpus Linguistics
, 2004
"... In this paper we report on our recent work in clause alignment for English-Chinese legal texts using available lexical resources including a bilingual legal glossary and a bilingual dictionary, for the purpose of acquiring examples at various linguistic levels for example-based machine translation ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper we report on our recent work in clause alignment for English-Chinese legal texts using available lexical resources including a bilingual legal glossary and a bilingual dictionary, for the purpose of acquiring examples at various linguistic levels for example-based machine translation. We present our formulation of an appropriate measure for the similarity of a candidate pair of clauses with respect to matched lexical items and the corresponding implementation of an e#ective algorithm for clause alignment based on this similarity measure. Experimental results show that the similarity measure and the lexical-based clause alignment algorithm, though very simple, are very e#ective, with a performance of 94.6% alignment accuracy. It confirms our intuition that lexical information gives a reliable indication of correct alignment. The significance of this lexical-based approach lies in both its simplicity and e#ectiveness.
Corpus-based annotated test set for Machine Translation evaluation by an Industrial User
, 1996
"... This article is concerned with the building of a test data set for assisting the industrial user in machine translation evaluation. The emphasis is laid on the interest of an approach based on the study of bilingual corpus pragmatic characteristics. The study of one chapter of the maintenance manual ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This article is concerned with the building of a test data set for assisting the industrial user in machine translation evaluation. The emphasis is laid on the interest of an approach based on the study of bilingual corpus pragmatic characteristics. The study of one chapter of the maintenance manual of the Super Puma helicopter made it possible to identify the pragmatic characteristics relevant in the choice of the morpho-syntactic structures and translation processes actually used. The textual test set consists in a SGML file including the source text sequences aligned with the reference translation sequences and also including the pragmatic, formal and translational characteristics in the form of annotations (labels and formal descriptions).
Maximum Entropy Model Learning of the Translation Rules
- In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics
, 1997
"... This paper proposes a learning method of translation rules from parallel corpora. This method applies the maximum entropy principle to a probabilistic model of translation rules. First, we define feature functions which express statistical properties of this model. Next, in order to optimize the mod ..."
Abstract
- Add to MetaCart
This paper proposes a learning method of translation rules from parallel corpora. This method applies the maximum entropy principle to a probabilistic model of translation rules. First, we define feature functions which express statistical properties of this model. Next, in order to optimize the model, the system iterates following steps: (1) se- lects a feature function which maximizes log- likelihood, and (2) adds this function to the model incrementally. As computational cost associated with this model is too expensive, we propose several methods to suppress the overhead in order to realize the system. The result shows that it attained 69.54% recall rate.

