## BLEUSP, INVWER, CDER: Three improved MT evaluation measures

### BibTeX

@MISC{Leusch_bleusp,invwer,,

author = {Gregor Leusch and Hermann Ney},

title = {BLEUSP, INVWER, CDER: Three improved MT evaluation measures},

year = {}

}

### OpenURL

### Abstract

We present three modifications of wellestablished automatic machine translation evaluation measures, to improve correlation between those measures and human evaluation. Following Lin & Och, we present an improved version of the BLEU score, which uses a smoothed geometric mean for combining different n-gram precisions. We use segment boundary markers to increase the weight of words near the segment boundaries in the BLEU score. Our second MT evaluation measure is a variant of the WER which allows for block movements, but does not demand complete and disjoint coverage of the source sentence. As this might be problematic if MT systems are tuned on this score, we later investigate a linear combination of this measure with PER. Finally, we describe an edit distance similar to TER, which also allows for block reordering. Our measure uses a full search, but with the constraint that block operations must be bracketed. We describe this measure using a Bracketing Transduction Grammar, and sketch a polynomial-time algorithm for its calculation. We also modify the WER-like measures such that they use word-dependent substitution costs instead of fixed ones to model the similarity between words. Experimental comparison of these measures show that our new measures correlate significantly better with human judgment than the original measures. 1

