Results 1 - 10
of
16
Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics
, 2004
"... In this paper we describe two new objective automatic evaluation methods for machine translation. The first method is based on longest common subsequence between a candidate translation and a set of reference translations. Longest common subsequence takes into account sentence level structure simila ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
In this paper we describe two new objective automatic evaluation methods for machine translation. The first method is based on longest common subsequence between a candidate translation and a set of reference translations. Longest common subsequence takes into account sentence level structure similarity naturally and identifies longest co-occurring insequence n-grams automatically. The second method relaxes strict n-gram matching to skipbigram matching. Skip-bigram is any pair of words in their sentence order. Skip-bigram cooccurrence statistics measure the overlap of skip-bigrams between a candidate translation and a set of reference translations. The empirical results show that both methods correlate with human judgments very well in both adequacy and fluency.
The Significance of Recall in Automatic Metrics for MT Evaluation
- In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-2004
, 2004
"... Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that significantly better correla ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that significantly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show that correlation with human judgments is highest when almost all of the weight is assigned to recall. We also show that stemming is significantly beneficial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.
Extending the Bleu MT evaluation method with frequency weightings
- In Proceedings of ACL
, 2004
"... We present the results of an experiment on extending the automatic method of Machine Translation evaluation BLUE with statistical weights for lexical items, such as tf.idf scores. We show that this extension gives additional information about evaluated texts; in particular it allows us to measure tr ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
We present the results of an experiment on extending the automatic method of Machine Translation evaluation BLUE with statistical weights for lexical items, such as tf.idf scores. We show that this extension gives additional information about evaluated texts; in particular it allows us to measure translation Adequacy, which, for statistical MT systems, is often overestimated by the baseline BLEU method. The proposed model uses a single human reference translation, which increases the usability of the proposed method for practical purposes. The model suggests a linguistic interpretation which relates frequency weights and human intuition about translation Adequacy and Fluency. 1.
BLANC: Learning Evaluation Metrics for MT
, 2005
"... We introduce BLANC, a family of dynamic, trainable evaluation metrics for machine translation. Flexible, parametrized models can be learned from past data and automatically optimized to correlate well with human judgments for different criteria (e.g. adequacy, fluency) using different correla ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
We introduce BLANC, a family of dynamic, trainable evaluation metrics for machine translation. Flexible, parametrized models can be learned from past data and automatically optimized to correlate well with human judgments for different criteria (e.g. adequacy, fluency) using different correlation measures. Towards this end, we discuss ACS (all common skipngrams) , a practical algorithm with trainable parameters that estimates referencecandidate translation overlap by computing a weighted sum of all common skipngrams in polynomial time. We show that the BLEU and ROUGE metric families are special cases of BLANC, and we compare correlations with human judgments across these three metric families. We analyze the algorithmic complexity of ACS and argue that it is more powerful in modeling both local meaning and sentence-level structure, while offering the same practicality as the established algorithms it generalizes.
ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation
, 2004
"... Comparisons of automatic evaluation metrics for machine translation are usually conducted on corpus level using correlation statistics such as Pearson’s product moment correlation coefficient or Spearman’s rank order correlation coefficient between human scores and automatic scores. However, such co ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Comparisons of automatic evaluation metrics for machine translation are usually conducted on corpus level using correlation statistics such as Pearson’s product moment correlation coefficient or Spearman’s rank order correlation coefficient between human scores and automatic scores. However, such comparisons rely on human judgments of translation qualities such as adequacy and fluency. Unfortunately, these judgments are often inconsistent and very expensive to acquire. In this paper, we introduce a new evaluation method, ORANGE, for evaluating automatic machine translation evaluation metrics automatically without extra human involvement other than using a set of reference translations. We also show the results of comparing several existing automatic metrics and three new automatic metrics using ORANGE.
Towards the evaluation of referring expression generation
- In Proceedings of the 4th Australiasian Language Technology Workshop
"... The Natural Language Generation community is currently engaged in discussion as to whether and how to introduce one or several shared evaluation tasks, as are found in other fields of Natural Language Processing. As one of the most welldefined subtasks in NLG, the generation of referring expressions ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
The Natural Language Generation community is currently engaged in discussion as to whether and how to introduce one or several shared evaluation tasks, as are found in other fields of Natural Language Processing. As one of the most welldefined subtasks in NLG, the generation of referring expressions looks like a strong candidate for piloting such shared tasks. Based on our earlier evaluation of a number of existing algorithms for the generation of referring expressions, we explore in this paper some problems that arise in designing an evaluation task in this field, and try to identify general considerations that need to be met in evaluating generation subtasks. 1
R.: Evaluation in natural language generation: Lessons from referring expression generation. Traitement Automatique des Langues 48(1
, 2007
"... ABSTRACT. As one of the most well-defined subtasks in Natural Language Generation (NLG), the generation of referring expressions looks like a strong candidate for piloting shared evaluation tasks. Different to other areas of Natural Language Processing, it is still unclear what benefit the introduct ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
ABSTRACT. As one of the most well-defined subtasks in Natural Language Generation (NLG), the generation of referring expressions looks like a strong candidate for piloting shared evaluation tasks. Different to other areas of Natural Language Processing, it is still unclear what benefit the introduction of such tasks might have for the field of NLG. Based on an earlier evaluation of a number of well-established algorithms for the generation of referring expressions, this paper explores several problems that arise in designing evaluation for this task, and identifies general considerations that need to be met in evaluating Natural Language Generation subtasks. RÉSUMÉ. La génération d’expressions référentielles, une des sous-tâche de la génération automatique de textes les mieux définies, apparaît comme une candidate sérieuse pour la mise en place de tâches d’évaluation partagée, dans un domaine du traitement automatique des langues où la question de l’intérêt de ces tâches reste ouverte. Sur la base des résultats d’une évaluation de certains des principaux algorithmes connus de génération d’expressions référentielles, cet article explore plusieurs problèmes posés par l’évaluation et présente quelques considérations d’ordre général à prendre en compte lors de l’évaluation des sous-tâches de la génération automatique de textes.
A Fluency Error Categorization Scheme to Guide Automated Machine Translation Evaluation. AMTA: Machine Translation: From Real Users to Research
, 2004
"... Abstract. Existing automated MT evaluation methods often require expert human translations. These are produced for every language pair evaluated and, due to this expense, subsequent evaluations tend to rely on the same texts, which do not necessarily reflect real MT use. In contrast, we are designin ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Existing automated MT evaluation methods often require expert human translations. These are produced for every language pair evaluated and, due to this expense, subsequent evaluations tend to rely on the same texts, which do not necessarily reflect real MT use. In contrast, we are designing an automated MT evaluation system, intended for use by post-editors, purchasers and developers, that requires nothing but the raw MT output. Furthermore, our research is based on texts that reflect corporate use of MT. This paper describes our first step in system design: a hierarchical classification scheme of fluency errors in English MT output, to enable us to identify error types and frequencies, and guide the selection of errors for automated detection. We present results from the statistical analysis of 20,000 words of MT output, manually annotated using our classification scheme, and describe correlations between error frequencies and human scores for fluency and adequacy. 1
Extending MT evaluation tools with translation complexity metrics
"... In this paper we report on the results of an experiment in designing resource-light metrics that predict the potential translation complexity of a text or a corpus of homogenous texts for state-of-the-art MT systems. We show that the best prediction of translation complexity is given by the average ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this paper we report on the results of an experiment in designing resource-light metrics that predict the potential translation complexity of a text or a corpus of homogenous texts for state-of-the-art MT systems. We show that the best prediction of translation complexity is given by the average number of syllables per word (ASW). The translation complexity metrics based on this parameter are used to normalise automated MT evaluation scores such as BLEU, which otherwise are variable across texts of different types. The suggested approach makes a fairer comparison between the MT systems evaluated on different corpora. The translation complexity metric was integrated into two automated MT evaluation packages – BLEU and the Weighted N-gram model. The extended MT evaluation tools are available from the first author’s web site. 1

