Results 1  10
of
11
Statistical Comparisons of Classifiers over Multiple Data Sets
, 2006
"... While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but igno ..."
Abstract

Cited by 243 (0 self)
 Add to MetaCart
While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust nonparametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding posthoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams.
UNDERSTANDING REPLICATION: CONFIDENCE INTERVALS, p VALUES, AND WHAT’S LIKELY TO HAPPEN NEXT TIME
"... Science loves replication: We conclude an effect is real if we believe replications would also show the effect. It is therefore crucial to understand replication. However, there is strong evidence of severe, widespread misconception about p values and confidence intervals, two of the main statistica ..."
Abstract
 Add to MetaCart
Science loves replication: We conclude an effect is real if we believe replications would also show the effect. It is therefore crucial to understand replication. However, there is strong evidence of severe, widespread misconception about p values and confidence intervals, two of the main statistical tools that guide us in deciding whether an observed effect is real. I propose we teach about replication directly. I describe three approaches: Via confidence intervals (What is the chance the original confidence interval will capture the mean of a repeat of the experiment?); Via p values (Given an initial p value, what is the distribution of p values for replications of the experiment?): and via Peter Killeen’s ‘prep, ’ which is the average probability that a replication will give a result in the same direction. In each case I will demonstrate an interactive graphical simulation designed to make the tricky ideas of replication vividly accessible. “Confirmation comes from repetition. … Repetition is the basis for judging … significance and confidence. ” (Tukey, 1969, pp. 8485) “Given the problems of statistical induction, we must finally rely, as have the older sciences, on replication. ” (Cohen, 1994, p. 1002) REPLICATION IS AT THE HEART OF SCIENCE; WE SHOULD TEACH IT EXPLICITLY
Probabilistic Inference: Test and Multiple Tests
, 2009
"... In this paper, we view that realworld scientific inference about an assertion of interest on unknown quantities is to produce a probability triplet (p,q,r), conditioned on available data. The probabilities p and q are for and against the truth of the assertion, whereas r = 1 − p − q is the remainin ..."
Abstract
 Add to MetaCart
In this paper, we view that realworld scientific inference about an assertion of interest on unknown quantities is to produce a probability triplet (p,q,r), conditioned on available data. The probabilities p and q are for and against the truth of the assertion, whereas r = 1 − p − q is the remaining probability called the probability of “don’t know”. Such a (p,q,r)formulation provides a promising way of representing realistic uncertainty assessment in statistical inference. With a brief discussion of what we call inferential models for producing (p,q,r) probability triplets for assertions, we focus on a particular inferential model for inference about an unobserved sorted uniform sample. We show how this inferential model can be used for (i) single tests, (ii) robust estimation of the empirical null distribution in the context of the local FDR method of Bradley Efron, and (iii) largescale simultaneous hypothesis problems, including the manynormalmeans problem and the problem of identifying significantly expressed genes in microarray data analysis. These examples indicate that hypothesis testing problems can be formulated and solved in a new way of probabilistic inference.
STATISTICAL INFERENCE WITH A SINGLE OBSERVATION OF N θ�1
"... We consider some fundamental issues in statistical inference by focusing on inference about the unknown mean θ of the Gaussian model N θ�1 with unit variance from a single observed data point X. A closer look at this seemingly simple inference problem reveals a limitation of objective Bayesian poste ..."
Abstract
 Add to MetaCart
We consider some fundamental issues in statistical inference by focusing on inference about the unknown mean θ of the Gaussian model N θ�1 with unit variance from a single observed data point X. A closer look at this seemingly simple inference problem reveals a limitation of objective Bayesian posteriors in that they cannot be interpreted as valid posteriors when combining certain types of information. A new solution to inference about θ from X is proposed. The proposed method is based on the fiducial distribution of θ given X, but with a new Weak Belief rule of combination for constrainttype information. It is shown that the proposed approach is promising for constrained statistical inference.
PSYCHOLOGICAL SCIENCE General Article An Alternative to Null Hypothesis Significance Tests
"... ABSTRACT—The statistic prep estimates the probability of replicating an effect. It captures traditional publication criteria for signaltonoise ratio, while avoiding parametric inference and the resulting Bayesian dilemma. In concert with effect size and replication intervals, p rep provides all of ..."
Abstract
 Add to MetaCart
ABSTRACT—The statistic prep estimates the probability of replicating an effect. It captures traditional publication criteria for signaltonoise ratio, while avoiding parametric inference and the resulting Bayesian dilemma. In concert with effect size and replication intervals, p rep provides all of the information now used in evaluating research, while avoiding many of the pitfalls of traditional statistical inference. Psychologists, who rightly pride themselves on their methodological expertise, have become increasingly embarrassed by ‘‘the survival of a flawed method’ ’ (Krueger, 2001) at the heart of their inferential procedures. Nullhypothesis significance tests (NHSTs) provide criteria for separating signal from noise in the majority of published research. They are based on inferred sampling distributions, given a hypothetical value for a parameter such as a population mean (m) or difference of means between an experimental group (m E) and a control group (m C; e.g., H0: mE mC 5 0). Analysis starts with a statistic on the obtained data, such as the difference in the sample means, D. D is a point on the line with probability mass of zero. It is necessary to relate that point to some interval in order to engage probability theory. Neyman and Pearson (1933) introduced critical intervals over which the probability of observing a statistic is less than a stipulated significance level, a (e.g., z scores between [ 1, 2] and between [12, 11] over which a <.05). If a statistic falls within those intervals, it is deemed significantly different from that expected under the null hypothesis. Fisher (1959) preferred to calculate the probability of obtaining a statistic larger than D  over the interval [D, 1]. This probability, p(x DH 0), is called the p value of the statistic. Researchers typically hope to obtain a p value sufficiently small (viz. less than a) so that they can reject the null hypothesis. Address correspondence to Peter Killeen, Department of Psychology,
a member of the Old City Publishing Group. KEEL DataMining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework
"... Reprints available directly from the publisher ..."
The Philosophical Relevance of Statistics
"... Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at ..."
Abstract
 Add to MetaCart
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at