#### DMCA

## A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation (2013)

Venue: | In ACM SIGIR |

Citations: | 3 - 1 self |

### Citations

156 | How reliable are the results of large-scale information retrieval experiments?
- Zobel
- 1998
(Show Context)
Citation Context ...ce levels, but the Wilcoxon test does better for the usual levels and best overall. Surprisingly the permutation test does not seem to be the most exact at any significance level. 4. DISCUSSION Zobel =-=[7]-=- compared the t-test, Wilcoxon test and ANOVA at α = 0.05, though with only one random split in 25-25 topics. He found lower error rates with the t-test than with the Wilcoxon test, and generally lowe... |

116 | A comparison of statistical significance tests for information retrieval evaluation.
- Smucker, Allan, et al.
- 2007
(Show Context)
Citation Context ...oretically provides exact p-values. Because IR evaluations violate most of the assumptions, it is very important to know how robust these tests are in practice and which one is optimal. Previous work =-=[4, 5]-=- compared these five tests with TREC Ad Hoc data, reaching the following conclusions: a) the bootstrap, t-test and permutation test largely agree with each other, so there is hardly any practical diff... |

115 | Information retrieval system evaluation: effort, sensitivity, and reliability. SIGIR
- Sanderson, Zobel
- 2005
(Show Context)
Citation Context ... the Wilcoxon test, and generally lower than the nominal 0.05 level. Given that the latter showed higher power and has more relaxed assumptions, he recommended it over the t-test. Sanderson and Zobel =-=[3]-=- ran a larger study also with splits of up to 25-25 topics. They found that the sign test has higher error rates than the Wilcoxon test, which has itself higher error rates than the t-test. They also ... |

43 |
Evaluating evaluation metrics based on the bootstrap.
- Sakai
- 2006
(Show Context)
Citation Context ...am [1] used 124-124 topic splits and various significance levels. They found the Wilcoxon test more powerful than the t-test and sign test; and the t-test safer than the Wilcoxon and sign test. Sakai =-=[2]-=- proposed the bootstrap method for IR evaluation, but did not compare it with other tests. Smucker et al. [4] compared the same five tests we study in this paper, arguing that the t-test, permutation ... |

10 |
Topic set size redux.
- Voorhees
- 2009
(Show Context)
Citation Context ...re developed and judged by the same assessors for the most part, and they were developed using the same methodology and pooling protocol with roughly the same number of runs contributing to the pools =-=[6]-=-. Additionally, all three tracks used disks 4 and 5 as document collection. Therefore, we can consider these two sets of 50 topics as two different samples drawn from the same universe of topics. We r... |

6 |
Validity and Power of t-test for Comparing MAP and GMAP
- Cormack, Lynam
- 2007
(Show Context)
Citation Context ...evel when using 50 topic sets. Voorhees [6] also observed error rates below the nominal 0.05 level for the t-test, but more unstable effectiveness measures resulted in higher rates. Cormack and Lynam =-=[1]-=- used 124-124 topic splits and various significance levels. They found the Wilcoxon test more powerful than the t-test and sign test; and the t-test safer than the Wilcoxon and sign test. Sakai [2] pr... |

6 | Agreement Among Statistical Significance Tests for Information Retrieval Evaluation at Varying Sample Sizes
- Smucker, Allan, et al.
- 2009
(Show Context)
Citation Context ...oretically provides exact p-values. Because IR evaluations violate most of the assumptions, it is very important to know how robust these tests are in practice and which one is optimal. Previous work =-=[4, 5]-=- compared these five tests with TREC Ad Hoc data, reaching the following conclusions: a) the bootstrap, t-test and permutation test largely agree with each other, so there is hardly any practical diff... |