Results 1 -
8 of
8
Introduction to the Special Issue on Computational Linguistics using Large Corpora
- Computational Linguistics
, 1993
"... ..."
Termight: Identifying and Translating Technical Terminology
, 1994
"... We propose a semi-automatic tool, termight, that helps professional translators and terminologists identify technical terms and their translations. The tool makes use of part-of-speech tagging and word-alignment programs to extract candidate terms and their translations. Although the extraction prog ..."
Abstract
-
Cited by 80 (1 self)
- Add to MetaCart
We propose a semi-automatic tool, termight, that helps professional translators and terminologists identify technical terms and their translations. The tool makes use of part-of-speech tagging and word-alignment programs to extract candidate terms and their translations. Although the extraction programs are far from perfect, it isn't too hard for the user to filter out the wheat from the chaff. The extraction algorithms emphasize completeness. Alter-native proposals are likely to miss important but infrequent terms/translations. To reduce the burden on the user during the filtering phase, candidates are presented in a convenient order, along with some useful concordance evidence, in an interface that is designed to minimize keystrokes. Termight is currently being used by the trans-
Robust Bilingual Word Alignment for Machine Aided Translation
- In Proceedings of the Workshop on Very Large Corpora
, 1993
"... We have developed a new program called word_align for aligning parallel text, text such as the Canadian Hansards that are available in two or more languages. The program takes the output of char_align (Church, 1993), a robust alternative to sentence-based alignment pro- grams, and applies word-level ..."
Abstract
-
Cited by 64 (2 self)
- Add to MetaCart
We have developed a new program called word_align for aligning parallel text, text such as the Canadian Hansards that are available in two or more languages. The program takes the output of char_align (Church, 1993), a robust alternative to sentence-based alignment pro- grams, and applies word-level constraints us- ing a version of Brown et al.'s Model 2 (Brown et al., 1993), modified and extended to deal with robustness issues. Word_align was tested on a subset of Canadian Itansards supplied by Simard (Simard et al., 1992). The combination of word_align plus char_align reduces the variance (average square error) by a factor of 5 over char_align alone. More importantly, because word_align and char_align were designed to work robustly on texts that are smaller and more noisy than the 1tansards, it has been pos- sible to successfully deploy the programs at AT&T Language Line Services, a commercial translation service, to help them with difficult terminology.
Building Probabilistic Models for Natural Language
, 1996
"... Building models of language is a central task in natural language processing. Traditionally, language has been modeled with manually-constructed grammars that describe which strings are grammatical and which are not; however, with the recent availability of massive amounts of on-line text, statistic ..."
Abstract
-
Cited by 60 (1 self)
- Add to MetaCart
Building models of language is a central task in natural language processing. Traditionally, language has been modeled with manually-constructed grammars that describe which strings are grammatical and which are not; however, with the recent availability of massive amounts of on-line text, statistically-trained models are an attractive alternative. These models are generally probabilistic, yielding a score reflecting sentence frequency instead of a binary grammaticality judgement. Probabilistic models of language are a fundamental tool in speech recognition for resolving acoustically ambiguous utterances. For example, we prefer the transcription forbear to four bear as the former string is far more frequent in English text. Probabilistic models also have application in optical character recognition, handwriting recognition, spelling correction, part-of-speech tagging, and machine translation. In this thesis, we investigate three problems involving the probabilistic modeling of languag...
Alignment of Shared Forests for Bilingual Corpora
- In Proceedings of the 16th International Conference on Computational Linguistics
, 1996
"... Research in example-based machine translation (EBMT) has been hampered by the lack of efficient tree alignment algorithms for bilingual corpora. This paper describes an alignment algorithm for EBMT whose running time is quadratic in the size of the input parse trees. The algorithm uses dynamic progr ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Research in example-based machine translation (EBMT) has been hampered by the lack of efficient tree alignment algorithms for bilingual corpora. This paper describes an alignment algorithm for EBMT whose running time is quadratic in the size of the input parse trees. The algorithm uses dynamic programming to score all possible matching nodes between structure-sharing trees or forests. We describe the algorithm, various optimizations, and our implementation. 1 Introduction The development of a machine translation (MT) system requires the lengthy manual preparation of bilingual lexicons and transfer rules. Research over the past few years using parallel sentencealigned bilingual corpora suggests ways in which this manual effort can be partly replaced by corpus-based training. Some of this research has treated the sentences as unstructured word sequences to be aligned; this work has primarily involved the acquisition of bilingual lexical correspondences (Chen, 1993), although there has ...
Recurrent Patterns in Technical Documentation
, 1992
"... This paper addresses some of the problems involved in the production and translation of technical documentation. The techniques and methods developed within Natural Language Processing in general and Machine Translation in particular have still a long way to go before we can see any commercial produ ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
This paper addresses some of the problems involved in the production and translation of technical documentation. The techniques and methods developed within Natural Language Processing in general and Machine Translation in particular have still a long way to go before we can see any commercial products that would be general enough to automatically translate unrestricted text. Instead of merely aiming for the perfect MT system, we should also focus on how to make use of existing and simple techniques and the capacity of today's hardware to make the production of technical documentation faster, better and cheaper. Even a twenty per cent gain in efficiency compared to manual translation is considerable compared by any industry standard. In this paper I describe a tool that pre-processes the source text and gives various kind of information that forms decision support whether translation tools should be applied at all. Examples from analyses show that up to 43 per cent of a text could be r...
A Program for Aligning Sentences in Bilingual Corpora
- Computational Linguistics
, 1991
"... This paper will describe a method and a program (align) for aligning sentences based on a simple statistical model of character lengths. The program uses the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend ..."
Abstract
- Add to MetaCart
This paper will describe a method and a program (align) for aligning sentences based on a simple statistical model of character lengths. The program uses the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to be translated into shorter sentences. A probabilistic score is assigned to each proposed correspondence of sen tences, based on the scaled difference of lengths of the two sentences (in characters) and the variance of this difference. This probabilistic score is used in a dynamic programming framework to find the maximum likelihood alignment of sentences. It is remarkable that such a simple approach works as well as it does. An evaluation was performed based on a trilingual corpus of economic reports issued by the Union Bank of Switzerland (UBS) in English, French, and German. The method correctly aligned all but 4% of the sentences. Moreover, it is possible to extract a large subcorpus that has a much smaller error rate. By selecting the best-scoring 80% of the alignments, the error rate is reduced from 4% to 0.7%. There were more errors on the English-French subcorpus than on the English-German subcorpus, showing that error rates will depend on the corpus considered; however, both were small enough to hope that the method will be useful for many language pairs. To further research on bilingual corpora, a much larger sample of Canadian Hansards (approximately 90 million words, half in English and and half in French) has been aligned with the align program and will be available through the Data Collection Initiative of the Association for Computational Linguistics (ACL/DCI). In addition, in order to facilitate replication of the align program, an appendix is provided with ...

