• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. (2007)

by Alon Lavie, Abhaya Agarwal
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 246
Next 10 →

Improving statistical machine translation using word sense disambiguation

by Marine Carpuat, Dekai Wu - In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , 2007
"... We show for the first time that incorporating the predictions of a word sense disambigua-tion system within a typical phrase-based statistical machine translation (SMT) model consistently improves translation quality across all three different IWSLT Chinese-English test sets, as well as producing st ..."
Abstract - Cited by 128 (7 self) - Add to MetaCart
We show for the first time that incorporating the predictions of a word sense disambigua-tion system within a typical phrase-based statistical machine translation (SMT) model consistently improves translation quality across all three different IWSLT Chinese-English test sets, as well as producing sta-tistically significant improvements on the larger NIST Chinese-English MT task— and moreover never hurts performance on any test set, according not only to BLEU but to all eight most commonly used au-tomatic evaluation metrics. Recent work has challenged the assumption that word sense disambiguation (WSD) systems are useful for SMT. Yet SMT translation qual-ity still obviously suffers from inaccurate lexical choice. In this paper, we address this problem by investigating a new strat-egy for integrating WSD into an SMT sys-tem, that performs fully phrasal multi-word disambiguation. Instead of directly incor-porating a Senseval-style WSD system, we redefine the WSD task to match the ex-act same phrasal translation disambiguation task faced by phrase-based SMT systems. Our results provide the first known empir-ical evidence that lexical semantics are in-deed useful for SMT, despite claims to the contrary. ∗This material is based upon work supported in part by
(Show Context)

Citation Context

...all the very different metrics. In addition to the widely used BLEU (Papineni et al., 2002) and NIST (Doddington, 2002) scores, we also evaluate translation quality with the recently proposed Meteor (=-=Banerjee and Lavie, 2005-=-) and four edit-distance style metrics, Word Error Rate (WER), Positionindependent word Error Rate (PER) (Tillmann et 67 al., 1997), CDER, which allows block reordering (Leusch et al., 2006), and Tran...

Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability.

by Jonathan H Clark , Chris Dyer , Alon Lavie , Noah A Smith - for Computational Linguistics. , 2011
"... Abstract In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the t ..."
Abstract - Cited by 124 (15 self) - Add to MetaCart
Abstract In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the two systems on held-out data. In this paper, we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instability-an extraneous variable that is seldom controlled for-on experimental outcomes, and make recommendations for reporting results more accurately.
(Show Context)

Citation Context

...00) BLEU ↑ System A 48.4 1.6 0.2 0.5System B 49.9 1.5 0.1 0.4 MET ↑ System A 63.3 0.9 - 0.4System B 63.8 0.9 - 0.5 TER ↓ System A 30.2 1.1 - 0.6System B 28.7 1.0 - 0.2 WMT German-English (n = 50) BLEU ↑ System A 18.5 0.3 0.0 0.1System B 18.7 0.3 0.0 0.2 MET ↑ System A 49.0 0.2 - 0.2System B 50.0 0.2 - 0.1 TER ↓ System A 65.5 0.4 - 0.3System B 64.9 0.4 - 0.4 Table 1: Measured standard deviations of different automatic metrics due to test-set and optimizer variability. sdev is reported only for the tuning objective function BLEU. Results are reported using BLEU (Papineni et al., 2002), METEOR5 (Banerjee and Lavie, 2005; Denkowski and Lavie, 2010), and TER (Snover et al., 2006). 4.1 Extraneous variables in one system In this section, we describe and measure (on the example systems just described) three extraneous variables that should be considered when evaluating a translation system. We quantify these variables in terms of standard deviation s, since it is expressed in the same units as the original metric. Refer to Table 1 for the statistics. Local optima effects sdev The first extraneous variable we discuss is the stochasticity of the optimizer. As discussed above, different optimization runs find differ...

A survey of statistical machine translation

by Adam Lopez , 2007
"... Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular tec ..."
Abstract - Cited by 93 (6 self) - Add to MetaCart
Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular techniques have only emerged within the last few years. This survey presents a tutorial overview of state-of-the-art SMT at the beginning of 2007. We begin with the context of the current research, and then move to a formal problem description and an overview of the four main subproblems: translational equivalence modeling, mathematical modeling, parameter estimation, and decoding. Along the way, we present a taxonomy of some different approaches within these areas. We conclude with an overview of evaluation and notes on future directions.

Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems

by Michael Denkowski, Alon Lavie
"... This paper describes Meteor 1.3, our submission ..."
Abstract - Cited by 59 (2 self) - Add to MetaCart
This paper describes Meteor 1.3, our submission
(Show Context)

Citation Context

...nts of translation quality as well as a more balanced Tuning version shown to outperform BLEU in minimum error rate training for a phrase-based Urdu-English system. 1 Introduction The Meteor1 metric (=-=Banerjee and Lavie, 2005-=-; Denkowski and Lavie, 2010b) has been shown to have high correlation with human judgments in evaluations such as the 2010 ACL Workshop on Statistical Machine Translation and NIST Metrics MATR (Callis...

A Study of Translation Error Rate with Targeted Human Annotation

by Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, Ralph Weischedel - In Proceedings of the Association for Machine Transaltion in the Americas (AMTA 2006 , 2006
"... We define a new, intuitive measure for evaluating machine translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments. Translation Error Rate (TER) measures the amount of editing that a human would have to perform to cha ..."
Abstract - Cited by 53 (4 self) - Add to MetaCart
We define a new, intuitive measure for evaluating machine translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments. Translation Error Rate (TER) measures the amount of editing that a human would have to perform to change a system output so it exactly matches a reference translation. We also compute a human-targeted TER (or HTER), where the minimum TER of the translation is computed against a human ‘targeted reference ’ that preserves the meaning (provided by the reference translations) and is fluent, but is chosen to minimize the TER score for a particular system output. We show that: (1) The single-reference variant of TER correlates as well with human judgments of MT quality as the four-reference variant of BLEU; (2) The human-targeted HTER yields a 33 % error-rate reduction and is shown to be very well correlated with human judgments; (3) The four-reference variant of TER and the single-reference variant of HTER yield higher correlations with human judgments than BLEU; (4) HTER yields higher correlations with human judgments than METEOR or its human-targeted variant (HMETEOR); and (5) The four-reference variant of TER correlates as well with a single human judgment as a second human judgment does, while HTER, HBLEU, and HMETEOR correlate significantly better with a human judgment than a second human judgment does.

Diagnosing meaning errors in short answers to reading comprehension questions

by Stacey Bailey, Detmar Meurers - Proceedings of the 3rd Workshop on Innovative Use of NLP for Building Educational Applications, held at ACL 2008. Columbus, Ohio: Associa12 for Computational Linguistics , 2008
"... A common focus of systems in Intelligent Computer-Assisted Language Learning (ICALL) is to provide immediate feedback to language learners working on exercises. Most of this research has focused on providing feedback on the form of the learner input. Foreign language practice and second language acq ..."
Abstract - Cited by 27 (22 self) - Add to MetaCart
A common focus of systems in Intelligent Computer-Assisted Language Learning (ICALL) is to provide immediate feedback to language learners working on exercises. Most of this research has focused on providing feedback on the form of the learner input. Foreign language practice and second language acquisition research, on the other hand, emphasizes the importance of exercises that require the learner to manipulate meaning. The ability of an ICALL system to diagnose and provide feedback on the meaning conveyed by a learner response depends on how well it can deal with the response variation allowed by an activity. We focus on short-answer reading comprehension questions which have a clearly defined target response but the learner may convey the meaning of the target in multiple ways. As empirical basis of our work, we collected an English as a Second Language (ESL) learner corpus of short-answer reading comprehension questions, for which two graders provided target answers and correctness judgments. On this basis, we developed a Content-Assessment Module (CAM), which performs shallow semantic analysis to diagnose meaning errors. It reaches an accuracy of 88 % for semantic error detection and 87 % on semantic error diagnosis on a held-out test data set. 1
(Show Context)

Citation Context

... the meaning of a sentence. Thus, a response generally contains multiple concepts. 3 Note the incorrect presupposition in the cue provided by the instructor. 109machine translation evaluation (e.g., =-=Banerjee and Lavie, 2005-=-; Lin and Och, 2004), paraphrase recognition (e.g., Brockett and Dolan, 2005; Hatzivassiloglou et al., 1999), and automatic grading (e.g., Leacock, 2004; Marín, 2004). To illustrate the general idea, ...

Incremental hypothesis alignment for building confusion networks with application to machine translation system combination

by Antti-veikko I. Rosti, Bing Zhang, Spyros Matsoukas, Richard Schwartz - In Proceedings Third Workshop on Statistical Machine Translation , 2008
"... Confusion network decoding has been the most successful approach in combining outputs from multiple machine translation (MT) systems in the recent DARPA GALE and NIST Open MT evaluations. Due to the varying word order between outputs from different MT systems, the hypothesis alignment presents the b ..."
Abstract - Cited by 26 (1 self) - Add to MetaCart
Confusion network decoding has been the most successful approach in combining outputs from multiple machine translation (MT) systems in the recent DARPA GALE and NIST Open MT evaluations. Due to the varying word order between outputs from different MT systems, the hypothesis alignment presents the biggest challenge in confusion network decoding. This paper describes an incremental alignment method to build confusion networks based on the translation edit rate (TER) algorithm. This new algorithm yields significant BLEU score improvements over other recent alignment methods on the GALE test sets and was used in BBN’s submission to the WMT08 shared translation task. 1
(Show Context)

Citation Context

... the Arabic GALE Phase 2 evaluation setup are first presented. The translation quality is measured by three MT evaluation metrics: TER (Snover et al., 2006), BLEU (Papineni et al., 2002), and METEOR (=-=Lavie and Agarwal, 2007-=-). 3.1 Results on Arabic GALE Outputs For the Arabic GALE Phase 2 evaluation, nine systems were combined. Five systems were phrasebased, two hierarchical, one syntax-based, and one rule-based. All sta...

The RWTH statistical machine translation system for

by Arne Mauser, Richard Zens, Evgeny Matusov, Hermann Ney - the IWSLT 2006 evaluation,” in Proc. Int. Workshop Spoken Language Translation, 2006
"... We give an overview of the RWTH phrase-based statistical machine translation system that was used in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT) 2006. The system was ranked first with respect to the BLEU measure in all language pairs it was used Using ..."
Abstract - Cited by 24 (15 self) - Add to MetaCart
We give an overview of the RWTH phrase-based statistical machine translation system that was used in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT) 2006. The system was ranked first with respect to the BLEU measure in all language pairs it was used Using a two-pass aproach, we first generate the N best translation candidates. The second pass consists of rescoring and reranking these candidates. We will give a description of the search algorithm as well as of the models used in each pass. We will also describe our method for dealing with punctuation restoration, in order to overcome the difficulties of spoken language translation. This work also includes a brief description of the system combination done by the partners participating in the European TC-Star project. 1.
(Show Context)

Citation Context

...ll the experiments, we report the two accuracy measures BLEU [19] and NIST [20] as well as the two error rates WER and PER. For the primary submissions, we also report the two accuracy measure Meteor =-=[21]-=-. All those criteria are computed with respect to multiple references. 5.1. Primary submissions The translation results of the RWTH primary submissions are summarized in Table 4. 5.2. Analysis of the ...

Dependency-Based Automatic Evaluation for Machine Translation

by Karolina Owczarzak, Josef Genabith, Andy Way - In Proceedings of SSST, NAACLHLT/AMTA Workshop on Syntax and Structure in Statistical Translation , 2007
"... We present a novel method for evaluating the output of Machine Translation (MT), based on comparing the dependency structures of the translation and reference rather than their surface string forms. Our method uses a treebank-based, widecoverage, probabilistic Lexical-Functional Grammar (LFG) parser ..."
Abstract - Cited by 23 (1 self) - Add to MetaCart
We present a novel method for evaluating the output of Machine Translation (MT), based on comparing the dependency structures of the translation and reference rather than their surface string forms. Our method uses a treebank-based, widecoverage, probabilistic Lexical-Functional Grammar (LFG) parser to produce a set of structural dependencies for each translation-reference sentence pair, and then calculates the precision and recall for these dependencies. Our dependencybased evaluation, in contrast to most popular string-based evaluation metrics, will not unfairly penalize perfectly valid syntactic variations in the translation. In addition to allowing for legitimate syntactic differences, we use paraphrases in the evaluation process to account for lexical variation. In comparison with other metrics on 16,800 sentences of Chinese-English newswire text, our method reaches high correlation with human scores. An experiment with two translations of 4,000 sentences from Spanish-English Europarl shows that, in contrast to most other metrics, our method does not display a high bias towards statistical models of translation. 1
(Show Context)

Citation Context

... Comparing the LFG-based evaluation method with other popular metrics: BLEU, NIST, General Text Matcher (GTM) (Turian et al., 2003), Translation Error Rate (TER) (Snover et al., 2006) 1 , and METEOR (=-=Banerjee and Lavie, 2005-=-), we show that combining dependency representations with paraphrases leads to a more accurate evaluation that correlates better with human judgment. The remainder of this paper is organized as follow...

Training a Multilingual Sportscaster: Using Perceptual Context to Learn Language

by David L. Chen, Joohyun Kim, Raymond J. Mooney - Journal of Artificial Intelligence Research , 2010
"... We present a novel framework for learning to interpret and generate language using only perceptual context as supervision. We demonstrate its capabilities by developing a system that learns to sportscast simulated robot soccer games in both English and Korean without any language-specific prior know ..."
Abstract - Cited by 23 (4 self) - Add to MetaCart
We present a novel framework for learning to interpret and generate language using only perceptual context as supervision. We demonstrate its capabilities by developing a system that learns to sportscast simulated robot soccer games in both English and Korean without any language-specific prior knowledge. Training employs only ambiguous supervision consisting of a stream of descriptive textual comments and a sequence of events extracted from the simulation trace. The system simultaneously establishes correspondences between individual comments and the events that they describe while building a translation model that supports both parsing and generation. We also present a novel algorithm for learning which events are worth describing. Human evaluations of the generated commentaries indicate they are of reasonable quality and in some cases even on par with those produced by humans for our limited domain. 1.
(Show Context)

Citation Context

... using higher-order N-grams. Commentaries in our domain are often short, so there are frequently no higher-order N-gram matches between generated sentences and target NL sentences. The METEOR metric (=-=Banerjee & Lavie, 2005-=-) was designed to resolve various weaknesses of the BLEU and NIST metrics, and it is more focused on word-to-word matches between the reference 414TRAINING A MULTILINGUAL SPORTSCASTER Algorithm 5 WAS...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University