Results 1 - 10
of
27
A Word-to-Word Model of Translational Equivalence
, 1997
"... Many multilingual NLP applications need to translate words between different languages, but cannot afford the computational expense of inducing or applying a full translation model. For these applications, we have designed a fast algorithm for estimating a partial translation model, which accounts f ..."
Abstract
-
Cited by 73 (6 self)
- Add to MetaCart
Many multilingual NLP applications need to translate words between different languages, but cannot afford the computational expense of inducing or applying a full translation model. For these applications, we have designed a fast algorithm for estimating a partial translation model, which accounts for translational equivalence only at the word level . The model's precision /recall trade-off can be directly controlled via one threshold parameter. This feature makes the model more suitable for applications that are not fully statistical. The model's hidden parameters can be easily conditioned on information extrinsic to the model, providing an easy way to integrate pre-existing knowledge such as part-of-speech, dictionaries, word order, etc.. Our model can link word tokens in parallel texts as well as other translation models in the literature. Unlike other translation models, it can automatically produce dictionarysized translation lexicons, and it can do so with over 99% accuracy.
Automatic Discovery of Non-Compositional Compounds in Parallel Data
, 1997
"... Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word- ..."
Abstract
-
Cited by 58 (1 self)
- Add to MetaCart
Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word-for-word. This paper presents an efficient automatic method for discover- ing sequences of words that are translated as a unit. The method proceeds by comparing pairs of statistical translation models induced from parallel texts in two languages. It can discover hundreds of noncompositional compounds on each iteration, and constructs longer compounds out of shorter ones. Objective evaluation on a simple machine translation task has shown the method's potential to improve the quality of MT output. The method makes few assumptions about the data, so it can be applied to parallel data other than parallel texts, such as word spellings and pronunci- ations.
Fast and Accurate Sentence Alignment of Bilingual Corpora
- In Stephen D
, 2002
"... Abstract. We present a new method for aligning sentences with their translations in a parallel bilingual corpus. Previous approaches have generally been based either on sentence length or word correspondences. Sentence-length-based methods are relatively fast and fairly accurate. Word-correspondence ..."
Abstract
-
Cited by 41 (1 self)
- Add to MetaCart
Abstract. We present a new method for aligning sentences with their translations in a parallel bilingual corpus. Previous approaches have generally been based either on sentence length or word correspondences. Sentence-length-based methods are relatively fast and fairly accurate. Word-correspondence-based methods are generally more accurate but much slower, and usually depend on cognates or a bilingual lexicon. Our method adapts and combines these approaches, achieving high accuracy at a modest computational cost, and requiring no knowledge of the languages or the corpus beyond division into words and sentences. 1
MULTEXT (Multilingual Text Tools and Corpora)
- IN PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS, COLING'94
, 1994
"... MULTEXT (Multilingual Text Tools and Corpora) is the largest project funded in the Commission of European Communities Linguistic Research and Engineering Program. The project will contribute to the development of generally usable software tools to manipulate and analyse text corpora and to create ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
MULTEXT (Multilingual Text Tools and Corpora) is the largest project funded in the Commission of European Communities Linguistic Research and Engineering Program. The project will contribute to the development of generally usable software tools to manipulate and analyse text corpora and to create multi-lingual text corpora with structural and linguistic markup. It will attempt to establish conventions for the encoding of such corpora, building on and contributing to the preliminary recommendations of the relevant international and European standardization initiatives. MULTEXT will also work towards establishing a set of guidelines for text software development, which will be widely published in order to enable future development by others. All tools and data developed within the project will be made freely and publicly available.
Extracting Word Correspondences from Bilingual Corpora Based on Word Co-occurrence Information
- In Proceedings of the 16th International Conference on Computational Linguistics
, 1996
"... A new method has been developed for extracttug word correspondences from a biliugual corpus. First, the co-occurrence info,'mation tbr each word in both languages is cxlracted li'om the corpus. Then, the correlations between the co-occurreuce features of the words are calculated pairwisely with the ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
A new method has been developed for extracttug word correspondences from a biliugual corpus. First, the co-occurrence info,'mation tbr each word in both languages is cxlracted li'om the corpus. Then, the correlations between the co-occurreuce features of the words are calculated pairwisely with the assistance of a basic word bilingual dictionary. Finally, the pairs of words with the highes! correlations are output selectively. This method is applicable to rather small, unaligned corpora; it can extract correspondeuces between compound words as well as simple words. An experiment using bilingual patent-specification corpora achieved 28% recall and 76% precision; this demonstrates that the method effectively reduces the cost of bilingual dictionary augmentation.
A Matching Technique in Example-Based Machine Translation
, 1994
"... This paper addresses an important problem in Example-Based Machine Translation (EBMT), namely how to measure similarity between a sentence fragment and a set of stored examples. A new method is proposed that measures similarity according to both surthce structure and content. A second contribution i ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
This paper addresses an important problem in Example-Based Machine Translation (EBMT), namely how to measure similarity between a sentence fragment and a set of stored examples. A new method is proposed that measures similarity according to both surthce structure and content. A second contribution is the use of clustering to make retrieval of the best matching example from the database more efficient. Results on a large number of test cases from the CELEX database are presented.
Bi-Textual Aids for Translators
- University of Waterloo
, 1992
"... While machine translation can successfully tackle some highly restricted sublanguages, it is in most cases more productive to turn to support tools for human translators. The functions taken over by existing translator's workstations are rather peripheral with respect to the core aspects of the tran ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
While machine translation can successfully tackle some highly restricted sublanguages, it is in most cases more productive to turn to support tools for human translators. The functions taken over by existing translator's workstations are rather peripheral with respect to the core aspects of the translation task. However, recent developments show that it is possible to automatically produce explicit (partial) representations of the translation correspondences that link pairs of source and target texts. These representations called bitexts provide the foundation required for the design of support tools that delve deeper into the realm of translation proper, such as: a) a translation memory that can be accessed by various means, including bilingual concordancing; b) translation critiquing tools capable of detecting correspondence errors such as omissions or deceptive cognates; and c) translator-oriented speech recognition systems capable of taking advantage of correspondence contraints wi...
Aligning parallel texts: Do methods developed for EnglishFrench generalize to Asian languages
- In Proceedings of Pacific Asia Conference on Formal and Computational Linguistics
, 1993
"... ..."
Rapid Development of an Afrikaans-English Speech-to-Speech Translator
, 2005
"... In this paper we investigate the rapid deployment of a twoway Afrikaans to English Speech-to-Speech Translation system. We discuss the approaches and amount of work involved to port a system to a new language pair, i.e. the steps required to rapidly adapt ASR, MT and TTS component to Afrikaans under ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In this paper we investigate the rapid deployment of a twoway Afrikaans to English Speech-to-Speech Translation system. We discuss the approaches and amount of work involved to port a system to a new language pair, i.e. the steps required to rapidly adapt ASR, MT and TTS component to Afrikaans under limited time and data constraints. The resulting system represents the first prototype built for Afrikaans to English speech translation. 1.
Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval
- Artificial Intelligence in Medicine, 33(2
, 2005
"... We present in this article experiments on Multi-Language Information Extraction and Access in the medical domain. Methods for extracting bilingual lexicons from parallel and comparable corpora are described and their use in Multi-Language Information Access is illustrated. Our experiments show that ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We present in this article experiments on Multi-Language Information Extraction and Access in the medical domain. Methods for extracting bilingual lexicons from parallel and comparable corpora are described and their use in Multi-Language Information Access is illustrated. Our experiments show that these automatically extracted bilingual lexicons are accurate enough for semi-automatically enriching mono- or bilingual thesauri (such as UMLS), and that their use in Cross-language Information Retrieval (CLIR) significantly improves the retrieval performance and clearly outperforms existing bilingual lexicon resources (both general lexicons and specialized ones).

