## Human Evaluation of Machine Translation Through Binary System Comparisons

### Cached

### Download Links

Citations: | 3 - 0 self |

### BibTeX

@MISC{Vilar_humanevaluation,

author = {David Vilar and Gregor Leusch and Rafael E. Banchs},

title = {Human Evaluation of Machine Translation Through Binary System Comparisons},

year = {}

}

### OpenURL

### Abstract

We introduce a novel evaluation scheme for the human evaluation of different machine translation systems. Our method is based on direct comparison of two sentences at a time by human judges. These binary judgments are then used to decide between all possible rankings of the systems. The advantages of this new method are the lower dependency on extensive evaluation guidelines, and a tighter focus on a typical evaluation task, namely the ranking of systems. Furthermore we argue that machine translation evaluations should be regarded as statistical processes, both for human and automatic evaluation. We show how confidence ranges for state-of-the-art evaluation measures such as WER and TER can be computed accurately and efficiently without having to resort to Monte Carlo estimates. We give an example of our new evaluation scheme, as well as a comparison with classical automatic and human evaluation on data from a recent international evaluation campaign. 1

### Citations

2532 |
An introduction to the bootstrap
- Efron, Tibshirani
- 1993
(Show Context)
Citation Context ..., the total evaluation score for a binary comparison of systems X and Y is RX,Y := 1 m m∑ ri,X,Y , (3) i=1 with m the number of evaluated sentences. For this case, namely R being an arithmetic mean, (=-=Efron and Tibshirani, 1993-=-) gives an explicit formula for the estimated standard error of the score RX,Y . To simplify the notation, we will use R instead of RX,Y from now on, and ri instead of ri,X,Y . ri se[R] = 1 √ m − 1 m ... |

2195 |
The Art of Computer Programming
- Knuth
- 2000
(Show Context)
Citation Context ...ary comparisons are very time consuming, we want to minimize the absolute number of comparisons needed. This minimization should be carried out in the strict sense, not just in an asymptotic manner. (=-=Knuth, 1973-=-) discusses this issue in detail. It is relatively straightforward to show that, in the worst case, the minimum number of comparisons to be carried out to sort n elements is at least ⌈log n!⌉ (for whi... |

1469 | BLEU: A Method for Automatic Evaluation of Machine Translation
- Papineni, Roukos, et al.
- 2002
(Show Context)
Citation Context ...an be computed. The most widely known are the Word Error Rate (WER), the Position independent word Error Rate (PER), the NIST score (Doddington, 2002) and, especially in recent years, the BLEU score (=-=Papineni et al., 2002-=-) and the Translation Error Rate (TER) (Snover et al., 2005). All of theses measures compare the system output with one or more gold standard references and produce a numerical value (score or error r... |

309 |
Automatic Evaluation of Machine Translation Quality Using N-gram Co-occurrence Statistics.” (2002
- Doddington
- 2002
(Show Context)
Citation Context ...omatic measures which try to asses the quality of the translation can be computed. The most widely known are the Word Error Rate (WER), the Position independent word Error Rate (PER), the NIST score (=-=Doddington, 2002-=-) and, especially in recent years, the BLEU score (Papineni et al., 2002) and the Translation Error Rate (TER) (Snover et al., 2005). All of theses measures compare the system output with one or more ... |

68 | Koehn.2006.Re-evaluating the role of BLEU in machine translation research.In
- Callison-Burch, Osborne, et al.
(Show Context)
Citation Context ...roduced one. Once such reference translations are available, the evaluation can be carried out in a quick, efficient and reproducible manner. However, automatic measures also have big disadvantages; (=-=Callison-Burch et al., 2006-=-) describes some of them. A major problem is that a given sentence in one language can have several correct translations in another language and thus, the measure of similarity with one or even a smal... |

64 | Manual and Automatic Evaluation of Machine Translation between European Languages - Koehn, Monz - 2006 |

47 | Bootstrap estimates for confidence intervals in ASR performance evaluation
- Bisani, Ney
(Show Context)
Citation Context ...elow), or from the estimate for the comparison system R ′ . A universal estimation method is the bootstrap estimate: The core idea is to create replications of R by random sampling from the data set (=-=Bisani and Ney, 2004-=-). Bootstrapping is generally possible for all evaluation measures. With a high number of replicates, se[R] and E[R0] can be estimated with satisfactory precision. For a certain class of evaluation me... |

34 | A Study of Translation Error Rate with Targeted Human Annotation
- Snover, Dorr, et al.
- 2005
(Show Context)
Citation Context ... (WER), the Position independent word Error Rate (PER), the NIST score (Doddington, 2002) and, especially in recent years, the BLEU score (Papineni et al., 2002) and the Translation Error Rate (TER) (=-=Snover et al., 2005-=-). All of theses measures compare the system output with one or more gold standard references and produce a numerical value (score or error rate) which measures the similarity between the machine tran... |

17 |
The method of paired comparisons for social values
- Thurstone
- 1927
(Show Context)
Citation Context ...ose the best one out of them in a more or less 2 With the exception of cross-language information retrieval and similar tasks. definite way. In social sciences, a similar method has been proposed by (=-=Thurstone, 1927-=-). 3.1 Comparison of Two Systems For the comparison of two MT systems, a set of translated sentence pairs is selected. Each of these pairs consists of the translations of a particular source sentence ... |

16 |
The bargaining problem
- Jr
- 1950
(Show Context)
Citation Context ...⌉ (for which n log n is an approximation). It is not always possible to reach this minimum, however, as was proven e.g. for the case n = 12 in (Wells, 1971) and for n = 13 in (Peczarski, 2002). (Ford =-=Jr and Johnson, 1959-=-) propose an algorithm called merge insertion which comes very close to the theoretical limit. This algorithm is sketched in Figure 1. There are also algorithms with a better asymptotic runtime (Bui a... |

15 |
Elements of combinatorial computing
- WELLS
- 1971
(Show Context)
Citation Context ...to be carried out to sort n elements is at least ⌈log n!⌉ (for which n log n is an approximation). It is not always possible to reach this minimum, however, as was proven e.g. for the case n = 12 in (=-=Wells, 1971-=-) and for n = 13 in (Peczarski, 2002). (Ford Jr and Johnson, 1959) propose an algorithm called merge insertion which comes very close to the theoretical limit. This algorithm is sketched in Figure 1. ... |

8 | Preprocessing and normalization for automatic evaluation of machine translation - Leusch, Ueffing, et al. - 2005 |

2 |
Significant improvements to the Ford-Johnson algorithm for sorting
- Bui, Thanh
- 1985
(Show Context)
Citation Context ... 1959) propose an algorithm called merge insertion which comes very close to the theoretical limit. This algorithm is sketched in Figure 1. There are also algorithms with a better asymptotic runtime (=-=Bui and Thanh, 1985-=-), but they only take effect for values of n too large for our purposes (e.g., more than 100). Thus, using the algorithm from Figure 1 we can obtain the ordering of the systems with a (nearly) optimal... |