## BLEUSP, INVWER, CDER: Three improved MT evaluation measures

### BibTeX

@MISC{Leusch_bleusp,invwer,,

author = {Gregor Leusch and Hermann Ney},

title = {BLEUSP, INVWER, CDER: Three improved MT evaluation measures},

year = {}

}

### OpenURL

### Abstract

We present three modifications of wellestablished automatic machine translation evaluation measures, to improve correlation between those measures and human evaluation. Following Lin & Och, we present an improved version of the BLEU score, which uses a smoothed geometric mean for combining different n-gram precisions. We use segment boundary markers to increase the weight of words near the segment boundaries in the BLEU score. Our second MT evaluation measure is a variant of the WER which allows for block movements, but does not demand complete and disjoint coverage of the source sentence. As this might be problematic if MT systems are tuned on this score, we later investigate a linear combination of this measure with PER. Finally, we describe an edit distance similar to TER, which also allows for block reordering. Our measure uses a full search, but with the constraint that block operations must be bracketed. We describe this measure using a Bracketing Transduction Grammar, and sketch a polynomial-time algorithm for its calculation. We also modify the WER-like measures such that they use word-dependent substitution costs instead of fixed ones to model the similarity between words. Experimental comparison of these measures show that our new measures correlate significantly better with human judgment than the original measures. 1

### Citations

1221 |
Binary codes capable of correcting deletions, insertions, and reversals
- Levenshtein
- 1966
(Show Context)
Citation Context ...rations are depicted. Only long jump edges from the best path are drawn. words should not be penalized too hard by an MT evaluation measure. WER, which is based on the classical Levenshtein distance (=-=Levenshtein, 1966-=-), penalizes block reorderings rather hard – each word that has been shifted usually needs to be deleted in its old position, and inserted in its new position. One approach here is to extend the Leven... |

280 |
Rank Correlation Methods
- Kendall, Gibbons
- 1990
(Show Context)
Citation Context ...d to all translations in the corpus by a human judge. We measured the absolute prediction as Pearson’s correlation coefficient r (Casella and Berger, 1990), and the ranking capability as Kendall’s τ (=-=Kendall, 1970-=-). The latter has the big advantage over other coefficients like Spearman’s ρ that it handles ties in a well-defined and reasonable manner. As there are only seven different outcomes for adequacy, but... |

150 |
The stanford graphbase: A platform for combinatorial computing
- Knuth
- 1993
(Show Context)
Citation Context ...m is to replace cSUB by cSUB(e, ˜e) in Subsection 4.2. For PER, it is no longer possible to use a linear time algorithm in the general case. Instead, we use a modification of the Hungarian algorithm (=-=Knuth, 1993-=-). The question is now how to define the worddependent substitution costs. A pragmatic approach is to compare the spelling of the words to be substituted with each other. The more similar the spelling... |

52 | Evaluation of machine translation and its evaluation
- Turian, Shen, et al.
- 2003
(Show Context)
Citation Context ...achieve polynomial run-time is to restrict the number of admissible block permutations, for example as in Section 4. Alternatively, a heuristic or approximative distance can be calculated, as in GTM (=-=Turian et al., 2003-=-). An implementation of both approaches at the same time can be found in TER. In the following section we will present another approach which has a suitable run-time, while still maintaining completen... |

49 | Block edit models for approximate string matching - LOPRESTI, TOMKINS - 1997 |

42 | Improved word-level system combination for machine translation - Rosti, Matsoukas, et al. - 2007 |

34 | A Study of Translation Error Rate with Targeted Human Annotation
- Snover, Dorr, et al.
- 2005
(Show Context)
Citation Context ...ted in its old position, and inserted in its new position. One approach here is to extend the Levenshtein distance by an additional operation, namely block movement (or shift, as it is called in TER (=-=Snover et al., 2005-=-)). Note that the number of blocks in a sentence is equal to the number of gaps among the blocks plus one. Thus, the block movements can equivalently be expressed as long jump operations that jump ove... |

33 | A novel string-to-string distance measure with applications to machine translation evaluation
- Leusch, Ueffing, et al.
- 2003
(Show Context)
Citation Context ... (INVWER), by normalizing it by the reference length. The distance can be calculated by an algorithm similar to a 2-dimensional CYK algorithm in time O(I 3J 3 ) and space O(I 2J 2 ), as described in (=-=Leusch et al., 2003-=-). Because the algorithm has basically a time complexity in Θ(I6 ) if I ≈ J, it can become quite slow for long sentences. Because of this, we split sentences longer than 30 words, parallel in candidat... |

27 | CDER: Efficient MT Evaluation Using Block Movements
- Leusch, Ueffing, et al.
- 2006
(Show Context)
Citation Context ...l solution can be solved in O(I 2 · J) time, where I is the length of the candidate sentence and J the length of the reference sentence. Within this paper, we will refer to this distance as dCD . In (=-=Leusch et al., 2006-=-) we showed how it can be computed in O(I · J) time using a modification of the Levenshtein algorithm. We also studied the reverse direction of the described measure; that is, we dropped the coverage ... |

21 | Incremental hypothesis alignment for building confusion networks with application to machine translation system combination - Rosti, Zhang, et al. - 2008 |

18 | A Re-examination of Machine Learning Approaches for Sentence-Level MT Evaluation - Albrecht, Hwa - 2007 |

18 | Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation - Lin, Och |

16 | W-J 2001 Bleu: a method for automatic evaluation of machine translation - Papineni, Roukos, et al. |

8 | Improving alignments for better confusion networks for combining machine translation systems - Ayan, Zheng, et al. - 2008 |

8 | Preprocessing and normalization for automatic evaluation of machine translation - Leusch, Ueffing, et al. - 2005 |

5 | Automatic evaluation measures for statistical machine 9 system optimization
- Mauser, Hasan, et al.
- 2008
(Show Context)
Citation Context ...ombination of up to 53 measures and features in a regression model and a classifier as evaluation measure. Also, a linear combination of BLEU and TER has been successfully used for tuning MT systems (=-=Mauser et al., 2008-=-; Rosti et al., 2008). In our approach, we only are interested in the linear combination of two MT evaluation measures, particularly the combination of CDER and PER. We expect this combination to have... |