Results 1 - 10
of
47
A Phrase-Based, Joint Probability Model for Statistical Machine Translation
- In Proceedings of EMNLP
, 2002
"... We present a joint probability model for statistical machine translation, which automatically learns word and phrase equivalents from bilingual corpora. Translations produced with parameters estimated using the joint model are more accurate than translations produced using IBM Model 4. ..."
Abstract
-
Cited by 135 (2 self)
- Add to MetaCart
We present a joint probability model for statistical machine translation, which automatically learns word and phrase equivalents from bilingual corpora. Translations produced with parameters estimated using the joint model are more accurate than translations produced using IBM Model 4.
The Web as a Parallel Corpus
- Computational Linguistics
, 2003
"... Parallel corpora have become an essential resource for work in multilingual natural language processing. In this report, we describe our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of signif ..."
Abstract
-
Cited by 101 (3 self)
- Add to MetaCart
Parallel corpora have become an essential resource for work in multilingual natural language processing. In this report, we describe our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale.
Bootstrapping Parsers via Syntactic Projection across Parallel Texts
- Natural Language Engineering
, 2005
"... Broad coverage, high quality parsers are available for only a handful of languages. A prerequisite for developing broad coverage parsers for more languages is the annotation of text with the desired linguistic representations (also known as “treebanking”). However, syntactic annotation is a labor in ..."
Abstract
-
Cited by 61 (2 self)
- Add to MetaCart
Broad coverage, high quality parsers are available for only a handful of languages. A prerequisite for developing broad coverage parsers for more languages is the annotation of text with the desired linguistic representations (also known as “treebanking”). However, syntactic annotation is a labor intensive and time-consuming process, and it is difficult to find linguistically annotated text in sufficient quantities. In this article, we explore using parallel text to help solving the problem of creating syntactic annotation in more languages. The central idea is to annotate the English side of a parallel corpus, project the analysis to the second language, and then train a stochastic analyzer on the resulting noisy annotations. We discuss our background assumptions, describe an initial study on the “projectability ” of syntactic relations, and then present two experiments in which stochastic parsers are developed with minimal human intervention via projection from English. 1
A Noisy-Channel Approach to Question Answering
, 2003
"... We introduce a probabilistic noisychannel model for question answering and we show how it can be exploited in the context of an end-to-end QA system. Our noisy-channel system outperforms a stateof -the-art rule-based QA system that uses similar resources. We also show that the model we propos ..."
Abstract
-
Cited by 42 (3 self)
- Add to MetaCart
We introduce a probabilistic noisychannel model for question answering and we show how it can be exploited in the context of an end-to-end QA system. Our noisy-channel system outperforms a stateof -the-art rule-based QA system that uses similar resources. We also show that the model we propose is flexible enough to accommodate within one mathematical framework many QA-specific resources and techniques, which range from the exploitation of WordNet, structured, and semi-structured databases to reasoning, and paraphrasing.
Learning a Translation Lexicon from Monolingual Corpora
- In Proceedings of ACL Workshop on Unsupervised Lexical Acquisition
, 2002
"... This paper presents work on the task of constructing a word-level translation lexicon purely from unrelated monolingual corpora. We combine various clues such as cognates, similar context, preservation of word similarity, and word frequency. Experimental results for the construction of a German-Engl ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
This paper presents work on the task of constructing a word-level translation lexicon purely from unrelated monolingual corpora. We combine various clues such as cognates, similar context, preservation of word similarity, and word frequency. Experimental results for the construction of a German-English noun lexicon are reported.
A survey of statistical machine translation
, 2007
"... Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular tec ..."
Abstract
-
Cited by 30 (3 self)
- Add to MetaCart
Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular techniques have only emerged within the last few years. This survey presents a tutorial overview of state-of-the-art SMT at the beginning of 2007. We begin with the context of the current research, and then move to a formal problem description and an overview of the four main subproblems: translational equivalence modeling, mathematical modeling, parameter estimation, and decoding. Along the way, we present a taxonomy of some different approaches within these areas. We conclude with an overview of evaluation and notes on future directions.
Improving Statistical MT through Morphological Analysis
- In Proc. of Empirical Methods in Natural Language Processing (EMNLP
, 2005
"... In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of times. With a highly inflected language such as Czech, this problem can be particu ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of times. With a highly inflected language such as Czech, this problem can be particularly severe. In addition, much of the morphological variation seen in Czech words is not reflected in either the morphology or syntax of a language like English. In this work, we show that using morphological analysis to modify the Czech input can improve a Czech-English machine translation system. We investigate several different methods of incorporating morphological information, and show that a system that combines these methods yields the best results. Our final system achieves a BLEU score of.333, as compared to.270 for the baseline word-to-word system. 1
Knowledge Sources for Word-Level Translation Models
- In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing
, 2001
"... We present various methods to train word-level translation models for statistical machine translation systems that use widely different knowledge sources ranging from parallel corpora and a bilingual lexicon to only monolingual corpora in two languages. Some novel methods are presented and previousl ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
We present various methods to train word-level translation models for statistical machine translation systems that use widely different knowledge sources ranging from parallel corpora and a bilingual lexicon to only monolingual corpora in two languages. Some novel methods are presented and previously published methods are reviewed. Also, a common evaluation metric enables the first quantitative comparison of these approaches.
A backoff model for bootstrapping resources for non-English languages
- In Proceedings of HLT-EMNLP
, 2005
"... The lack of annotated data is an obstacle to the development of many natural language processing applications; the problem is especially severe when the data is non-English. Previous studies suggested the possibility of acquiring resources for non-English languages by bootstrapping from high quality ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
The lack of annotated data is an obstacle to the development of many natural language processing applications; the problem is especially severe when the data is non-English. Previous studies suggested the possibility of acquiring resources for non-English languages by bootstrapping from high quality English NLP tools and parallel corpora; however, the success of these approaches seems limited for dissimilar language pairs. In this paper, we propose a novel approach of combining a bootstrapped resource with a small amount of manually annotated data. We compare the proposed approach with other bootstrapping methods in the context of training a Chinese Part-of-Speech tagger. Experimental results show that our proposed approach achieves a significant improvement over EM and self-training and systems that are only trained on manual annotations. 1
Prague Czech-English Dependency Treebank: Syntactically Annotated Resources for Machine Translation
- In Proceedings of EAMT 10th Annual Conference
, 2004
"... This paper introduces the Prague Czech-English Dependency Treebank (PCEDT), a new Czech-English parallel resource suitable for experiments in structural machine translation. We describe the process of building the core parts of the resources – a bilingual syntactically annotated corpus and translati ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
This paper introduces the Prague Czech-English Dependency Treebank (PCEDT), a new Czech-English parallel resource suitable for experiments in structural machine translation. We describe the process of building the core parts of the resources – a bilingual syntactically annotated corpus and translation dictionaries. A part of the Penn Treebank has been translated into Czech, the dependency annotation of the Czech translation has been done automatically from plain text. The annotation of Penn Treebank has been tranformed into dependency annotation scheme. A subset of corresponding Czech and English sentences has been annotated by humans. First experiments in Czech-English machine translation using these data have already been carried out. The resources being created at Charles University in Prague are scheduled for release as Linguistic Data Consortium data collection in 2004. 1.

