## Permutation Tests for Studying Classifier Performance

Citations: | 5 - 0 self |

### BibTeX

@MISC{Ojala_permutationtests,

author = {Markus Ojala and Gemma C. Garriga},

title = {Permutation Tests for Studying Classifier Performance},

year = {}

}

### OpenURL

### Abstract

Abstract—We explore the framework of permutation-based p-values for assessing the behavior of the classification error. In this paper we study two simple permutation tests. The first test estimates the null distribution by permuting the labels in the data; this has been used extensively in classification problems in computational biology. The second test produces permutations of the features within classes, inspired by restricted randomization techniques traditionally used in statistics. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classification error via permutation tests is effective; in particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords-classification, labeled data, permutation tests, restricted randomization, significance testing I.

### Citations

3351 |
Controlling the false discovery rate: a practical and powerful approach to multiple testing
- Benjamini, Hochberg
- 1995
(Show Context)
Citation Context ...al classifiers with less over-fitting problems. The evaluation of the different models in this local search strategy is done via permutation tests, using the framework of multiple hypothesis testing (=-=Benjamini and Hochberg, 1995-=-; Holm, 1979). The first test used corresponds to permuting labels—that is, Test 1—while the second test is a conditional randomization test. Conditionally randomization tests permute the labels in th... |

3339 |
Data Mining: Practical machine learning tools and techniques. 2nd Edition
- Witten, Frank
- 2005
(Show Context)
Citation Context ...e stratified 10-fold cross-validation error. We study the behavior of four classifiers: 1-Nearest Neighbor, Decision Tree, Naive Bayes and Support Vector Machine. We use Weka 3.6 data mining software =-=[11]-=- with the default parameters of those classification algorithms. The Decision Tree classifier is similar to C4.5 algorithm, and the default kernel used with Support Vector Machine is linear. Figure 3 ... |

3231 | An introduction into the bootstrap - Efron, Tibshirani - 1993 |

1043 |
Bootstrap methods: Another look at the Jackknife. The Annals of Statistics, 7: 1–26
- Efron
- 1979
(Show Context)
Citation Context ...ion in decision trees [9]. However, the related literature has not performed extensive experimental studies for this traditional test in more general cases. Sub-sampling methods such as bootstrapping =-=[10]-=- use randomizations to study the properties of the underlying distribution instead of testing the data against some null model. The goal of this paper is to study permutation tests for assessing the b... |

775 | A simple sequentially rejective multiple test procedure. Scand J Stat 1979; 6: 65–70 Burnout in dialysis services 2289 at Pennsylvania State U niversity on February 27, 2014 http://ndt.oxfordjournals.org/ D ow nloaded from - Holm |

708 |
UCI machine learning repository
- Asuncion, Newman
- 2007
(Show Context)
Citation Context ...tion (4) for two different values of significance level α. 5. Empirical Results In this section, we give extensive empirical results on 33 various real data sets from UCI machine learning repository (=-=Asuncion and Newman, 2007-=-). Basic characteristics of the data sets are described in Table 2. The data sets are divided into three categories based on their size: small, medium and large. Some data sets contain only nominal or... |

200 |
Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses
- Good
- 1994
(Show Context)
Citation Context ...o difference between the classes. The null distribution under this null hypothesis is estimated by permuting the labels of the dataset. This corresponds also to the most traditional statistic methods =-=[7]-=-, where the results on a control group are compared against the result on a treatment group. This simple test has been proven effective already for selecting relevant genes in small data samples [8] o... |

112 |
UCI machine learning repository.. http://www.ics.uci.edu/∼mlearn/MLRepository.html
- Asuncion, Newman
- 2007
(Show Context)
Citation Context ...ta and the randomized datasets, which results into a very high p-value. V. EMPIRICAL RESULTS In this section we give empirical results on 22 various real datasets from UCI machine learning repository =-=[12]-=-. The datasets contain nominal or/and numeric features as well as missing values. In most datasets the features are measured in different scales, thus it is only reasonable to consider column-wise per... |

97 | Is cross-validation valid for small-sample microarray classification
- Braga-Neto, Dougherty
(Show Context)
Citation Context ... – Dataset D2 Figure 1. Examples of two 16 × 8 nominal datasets D1 and D2 each having two classes. The last column in both datasets denotes the class labels (+, –) of the samples in the rows. samples =-=[1]-=-–[4]. Also classical generalization bounds are not appropriate when the dimensionality of the data is too high. Indeed, for many other general cases, it is useful to have other statistics associated t... |

54 |
Prediction error estimation: a comparison of resampling methods
- Molinaro, Simon, et al.
- 2005
(Show Context)
Citation Context ...ill be analyzed with detail later on in the paper. In the recent years, a number of papers have suggested to use permutation-based p-values for assessing the competence of a classifier [2], [3], [5], =-=[6]-=-. Essentially the permutation test procedure measures how likely the observed accuracy would be obtained by chance. A p-value represents the fraction of random datasets under a certain null hypothesis... |

51 | Sequential monte carlo methods for statistical analysis of tables - Chen, Diaconis, et al. |

39 | Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar, Biometrics 64 - Bondell, Reich - 2008 |

36 | Assessing data mining results via swap randomization
- Gionis, Mannila, et al.
(Show Context)
Citation Context ...ost datasets the features are measured in different scales, thus it is only reasonable to consider column-wise permutations, leaving out of consideration some recent data mining randomization methods =-=[13]-=-, [14]. We use stratified 10-fold cross-validation error as the statistic. In all cases, we calculate the empirical p-values over 1000 randomized samples and use the threshold of α = 0.01 to determine... |

21 | Permutation tests for classification: towards statistical significance in image-based studies
- Golland, Fischl
- 2003
(Show Context)
Citation Context ...mple will be analyzed in detail later on in Section 3.3. In recent years, a number of papers have suggested to use permutation-based p-values for assessing the competence of a classifier (Golland and =-=Fischl, 2003-=-; Golland et al., 2005; Hsing et al., 2003; Jensen, 1992; Molinaro et al., 2005). Essentially, the permutation test procedure measures how likely the observed accuracy would be obtained by chance. A p... |

19 | Support vector machine for functional data classification
- Villa, Rossi
- 2005
(Show Context)
Citation Context ...eneralization bounds are not directly appropriate when the dimensionality of the data is too high; for these reasons, some recent approaches using filtering and regularization alleviate this problem (=-=Rossi and Villa, 2006-=-; Berlinet et al., 2008). Indeed, for many other general cases, it is useful to have other statistics associated to the error in order to understand better the behavior of the classifier. For example,... |

18 | Permutation Tests for Classification
- Mukherjee, Golland, et al.
- 2003
(Show Context)
Citation Context ... This example will be analyzed with detail later on in the paper. In the recent years, a number of papers have suggested to use permutation-based p-values for assessing the competence of a classifier =-=[2]-=-, [3], [5], [6]. Essentially the permutation test procedure measures how likely the observed accuracy would be obtained by chance. A p-value represents the fraction of random datasets under a certain ... |

18 | Using a permutation test for attribute selection in decision trees
- Frank, Witten
- 1998
(Show Context)
Citation Context ...ompared against the result on a treatment group. This simple test has been proven effective already for selecting relevant genes in small data samples [8] or for attribute selection in decision trees =-=[9]-=-. However, the related literature has not performed extensive experimental studies for this traditional test in more general cases. Sub-sampling methods such as bootstrapping [10] use randomizations t... |

15 |
Induction with randomization testing: decisionoriented analysis of large data sets
- Jensen
- 1992
(Show Context)
Citation Context ...ple will be analyzed with detail later on in the paper. In the recent years, a number of papers have suggested to use permutation-based p-values for assessing the competence of a classifier [2], [3], =-=[5]-=-, [6]. Essentially the permutation test procedure measures how likely the observed accuracy would be obtained by chance. A p-value represents the fraction of random datasets under a certain null hypot... |

14 |
Crossvalidation and bootstrapping are unreliable in small sample classification
- Isaksson, Wallman, et al.
(Show Context)
Citation Context ...ataset D2 Figure 1. Examples of two 16 × 8 nominal datasets D1 and D2 each having two classes. The last column in both datasets denotes the class labels (+, –) of the samples in the rows. samples [1]–=-=[4]-=-. Also classical generalization bounds are not appropriate when the dimensionality of the data is too high. Indeed, for many other general cases, it is useful to have other statistics associated to th... |

14 | Pruning decision trees and lists
- Frank
- 2000
(Show Context)
Citation Context ...ults on a treatment group. This simple test has been proven effective already for selecting relevant genes in small data samples (Maglietta et al., 2007) or for attribute selection in decision trees (=-=Frank, 2000-=-; Frank and Witten, 1998). However, the related literature has not performed extensive experimental studies for this traditional test in more general cases. The goal of this paper is to study permutat... |

10 | Relation between permutation-test P values and classifier error estimates
- Hsing, Attoor, et al.
- 2003
(Show Context)
Citation Context ... example will be analyzed with detail later on in the paper. In the recent years, a number of papers have suggested to use permutation-based p-values for assessing the competence of a classifier [2], =-=[3]-=-, [5], [6]. Essentially the permutation test procedure measures how likely the observed accuracy would be obtained by chance. A p-value represents the fraction of random datasets under a certain null ... |

9 | Heikki Mannila, Taneli Mielikäinen, and Panayiotis Tsaparas. Assessing data mining results via swap randomization - Gionis - 2006 |

7 |
Sequential tests of statistical hypotheses. The Annals of
- Wald
- 1945
(Show Context)
Citation Context ... test, or the value of the standard deviation in the critical point of p=α where α is the significance level. Alternatively, a sequential probability ratio test can be used (Besag and Clifford, 1991; =-=Wald, 1945-=-; Fay et al., 2007), where we sample randomizations of D until it is possible to accept or reject the null hypothesis. With these tests, often already 30 samples are enough for statistical inference w... |

6 | Randomization of real-valued matrices for assessing the significance of data mining results
- Ojala, Vuokko, et al.
- 2008
(Show Context)
Citation Context ...tasets the features are measured in different scales, thus it is only reasonable to consider column-wise permutations, leaving out of consideration some recent data mining randomization methods [13], =-=[14]-=-. We use stratified 10-fold cross-validation error as the statistic. In all cases, we calculate the empirical p-values over 1000 randomized samples and use the threshold of α = 0.01 to determine the s... |

6 | Niina Haiminen, and Heikki Mannila. Randomization of real-valued matrices for assessing the significance of data mining results - Ojala, Vuokko, et al. - 2008 |

3 | Selection of relevant genes in cancer diagnosis based on their prediction accuracy - Maglietta - 2007 |

2 | On using truncated sequential probability ratio test boundaries for Monte Carlo implementation of hypothesis tests
- Fay, Kim, et al.
- 2007
(Show Context)
Citation Context ...e value of the standard deviation in the critical point of p=α where α is the significance level. Alternatively, a sequential probability ratio test can be used (Besag and Clifford, 1991; Wald, 1945; =-=Fay et al., 2007-=-), where we sample randomizations of D until it is possible to accept or reject the null hypothesis. With these tests, often already 30 samples are enough for statistical inference with significance l... |

2 |
D Panchenko Permutation tests for classification
- Golland, Liang, et al.
- 2005
(Show Context)
Citation Context ... other (Hsing et al., 2003). However, it has been argued that evaluating a single classifier with an error measurement is ineffective for small amount of data samples (Braga-Neto and Dougherty, 2004; =-=Golland et al., 2005-=-; Isaksson et al., 2008). Also classical generalization bounds are not directly appropriate when the dimensionality of the data is too high; for these reasons, some recent approaches using filtering a... |

1 | Annales de l’ISUP - wavelets |

1 | Selection of relevant genes in cancer diagnosis based on their prediction accuracy - Ancona |