Results 1 - 10
of
30
A Systematic Comparison of Various Statistical Alignment Models
- Computational Linguistics
, 2003
"... this article the problem of finding the word alignment of a bilingual sentence-aligned corpus by using language-independent statistical methods. There is a vast literature on this topic, and many different systems have been suggested to solve this problem. Our work follows and extends the methods in ..."
Abstract
-
Cited by 805 (22 self)
- Add to MetaCart
this article the problem of finding the word alignment of a bilingual sentence-aligned corpus by using language-independent statistical methods. There is a vast literature on this topic, and many different systems have been suggested to solve this problem. Our work follows and extends the methods introduced by Brown, Della Pietra, Della Pietra, and Mercer (1993) by using refined statistical models for the translation process. The basic idea of this approach is to develop a model of the translation process with the word alignment as a hidden variable of this process, to apply statistical estimation theory to compute the "optimal" model parameters, and to perform alignment search to compute the best word alignment
Bootstrapping Parsers via Syntactic Projection across Parallel Texts
- Natural Language Engineering
, 2005
"... Broad coverage, high quality parsers are available for only a handful of languages. A prerequisite for developing broad coverage parsers for more languages is the annotation of text with the desired linguistic representations (also known as “treebanking”). However, syntactic annotation is a labor in ..."
Abstract
-
Cited by 61 (2 self)
- Add to MetaCart
Broad coverage, high quality parsers are available for only a handful of languages. A prerequisite for developing broad coverage parsers for more languages is the annotation of text with the desired linguistic representations (also known as “treebanking”). However, syntactic annotation is a labor intensive and time-consuming process, and it is difficult to find linguistically annotated text in sufficient quantities. In this article, we explore using parallel text to help solving the problem of creating syntactic annotation in more languages. The central idea is to annotate the English side of a parallel corpus, project the analysis to the second language, and then train a stochastic analyzer on the resulting noisy annotations. We discuss our background assumptions, describe an initial study on the “projectability ” of syntactic relations, and then present two experiments in which stochastic parsers are developed with minimal human intervention via projection from English. 1
Improving machine translation performance by exploiting non-parallel corpora
- Computational Linguistics
, 2005
"... We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large ..."
Abstract
-
Cited by 56 (2 self)
- Add to MetaCart
We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available. 1.
Using Cross-Language Cues For Story-Specific Language Modeling
- In Proc. ICSLP
, 2002
"... We propose methods to exploit contemporary news articles in a resource rich language, together with cross-language information retrieval and machine translation, to sharpen language models for a news story in a language with fewer linguistic resources. We report experimental results on storyspecific ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
We propose methods to exploit contemporary news articles in a resource rich language, together with cross-language information retrieval and machine translation, to sharpen language models for a news story in a language with fewer linguistic resources. We report experimental results on storyspecific Chinese language models that use cues from a parallel corpus of English news stories. We demonstrate that even with fairly crude cross-language information retrieval, level-1 machine translation and simple linear interpolation, a significant (18%) reduction in perplexity may be obtained over a Chinese trigram model. We also demonstrate that this method of sharpening the Chinese language model is complementary to other techniques like topic dependent modeling, and the two in combination result in an even greater reduction in perplexity (28%).
Lexical triggers and latent semantic analysis for crosslingual language model adaptation
- ACM Transactions on Asian Language Information Processing
, 2004
"... In-domain texts for estimating statistical language models are not easily found for most languages of the world. We present two techniques to take advantage of in-domain text resources in other languages. First, we extend the notion of lexical triggers, which have been used monolingually for languag ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In-domain texts for estimating statistical language models are not easily found for most languages of the world. We present two techniques to take advantage of in-domain text resources in other languages. First, we extend the notion of lexical triggers, which have been used monolingually for language model adaptation, to the cross-lingual problem, permitting the construction of sharper language models for a target-language document by drawing statistics from related documents in a resource-rich language. Next, we show that cross-lingual latent semantic analysis is similarly capable of extracting useful statistics for language modeling. Neither technique requires explicit translation capabilities between the two languages! We demonstrate significant reductions in both perplexity and word error rate on a Mandarin speech recognition task by using these techniques.
Automatic Construction of English/Chinese Parallel Corpora
- Journal of the American Society for Information Science and Technology
, 2003
"... As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. In order to cross the boundaries that exist between different languages, dictionaries ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. In order to cross the boundaries that exist between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpusbased approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain-specific parallel or comparable corpora that are employed in machine translation and cross-lingual information
Multi-Align: Combining Linguistic and Statistical Techniques To Improve Alignments for Adaptable MT
- In Proceedings of AMTA’2004
, 2004
"... The continuously growing MT market faces the challenge of translating new languages, diverse genres, and di#erent domains using a variety of available linguistic resources. ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
The continuously growing MT market faces the challenge of translating new languages, diverse genres, and di#erent domains using a variety of available linguistic resources.
Induction of the Morphology of Natural Language: Unsupervised Morpheme Segmentation with Application to Automatic Speech Recognition
, 2006
"... ISBN 951-22-8210-0 (printed version) ISBN 951-22-8211-9 (electronic version) ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
ISBN 951-22-8210-0 (printed version) ISBN 951-22-8211-9 (electronic version)
Translation as Annotation
- PROCEEDINGS OF THE AI*IA 2003 WORKSHOP "TOPICS AND PERSPECTIVES OF NATURAL LANGUAGE PROCESSING IN
, 2003
"... In this paper we illustrate an approach to the creation of high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the key notion that translating a text can be seen as a linguistic annotation task which is easier than manua ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
In this paper we illustrate an approach to the creation of high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the key notion that translating a text can be seen as a linguistic annotation task which is easier than manual annotation with formal schemes. After translation, formal annotations can be automatically derived from aligned translated texts. We will show that translations can be exploited in various interesting ways to speed up and automate the linguistic annotation of texts. If none of the texts is already annotated, information from aligned texts can be exploited to carry out the annotation from scratch. On the contrary, if the texts in one language have been annotated and the others have not, annotations can be transferred from one language to the other. The transferbased method allows for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new languages with highly reduced human effort.
Exploiting Hidden Meanings Using Bilingual Text
- In A. Gelbukh (Ed.), Lecture Notes in Computer Science 2945: Computational Linguistics and Intelligent Text Processing: Fifth International Conference, CICLing 2004 Proceedings (pp. 283–299
, 2004
"... The last decade has taught computational linguists that high performance on broad-coverage natural language processing tasks is best obtained using supervised learning techniques, which require annotation of large quantities of training data. But annotated text is hard to obtain. ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The last decade has taught computational linguists that high performance on broad-coverage natural language processing tasks is best obtained using supervised learning techniques, which require annotation of large quantities of training data. But annotated text is hard to obtain.

