## Ordering and finding the best of K>2 supervised learning algorithms (2006)

Venue: | IEEE T. Pattern. Anal |

Citations: | 6 - 2 self |

### BibTeX

@ARTICLE{Yildiz06orderingand,

author = {Olcay Taner Yildiz and Ethem Alpaydin and Senior Member},

title = {Ordering and finding the best of K>2 supervised learning algorithms},

journal = {IEEE T. Pattern. Anal},

year = {2006},

pages = {392--402}

}

### OpenURL

### Abstract

Abstract—Given a data set and a number of supervised learning algorithms, we would like to find the algorithm with the smallest expected error. Existing pairwise tests allow a comparison of two algorithms only; range tests and ANOVA check whether multiple algorithms have the same expected error and cannot be used for finding the smallest. We propose a methodology, the MultiTest algorithm, whereby we order supervised learning algorithms taking into account 1) the result of pairwise statistical tests on expected error (what the data tells us), and 2) our prior preferences, e.g., due to complexity. We define the problem in graph-theoretic terms and propose an algorithm to find the “best ” learning algorithm in terms of these two criteria, or in the more general case, order learning algorithms in terms of their “goodness. ” Simulation results using five classification algorithms on 30 data sets indicate the utility of the method. Our proposed method can be generalized to regression and other loss functions by using a suitable pairwise test. Index Terms—Machine learning, classifier design and evaluation, experimental design. æ 1

### Citations

4931 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...oGistic Classification algorithm with a linear model. We use gradient-descent for learning. Discrete features are converted to numeric features by 1-of-n encoding. It has cðd þ 1Þ parameters. 4. C4.5 =-=[21]-=- is the archetypal decision tree method. We use postpruning with 20 percent of the data reserved for pruning. The tree changes depending on the problem but generally the complexity of the tree is betw... |

2046 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...instance [2]. It has c d parameters. This corresponds to assuming that classes are Gaussian distributed with a shared covariance matrix whose diagonals are equal and whose off-diagonals are 0. 3. LGC =-=[20]-=- is the LoGistic Classification algorithm with a linear model. We use gradient-descent for learning. Discrete features are converted to numeric features by 1-of-n encoding. It has cðd þ 1Þ parameters.... |

563 |
A simple sequentially rejective multiple test procedure
- Holm
- 1979
(Show Context)
Citation Context ... the resulting final ordering and to have a confidence level of , we need this Bonferroni correction [7]. When K is large, Bonferroni correction may be too conservative and we can use Holm correction =-=[18]-=- instead to have higher confidence levels for the one-sided statistical tests. We use graph theory to represent the result of the tests. The algorithm is given in Fig. 1. The graph has K vertices corr... |

528 | Approximate statistical tests for comparing supervised classification learning algorithms
- Dietterich
- 1998
(Show Context)
Citation Context ...sed to compare the means of populations which these validation error values are sampled from, i.e., their expected error. Tests in the literature are pairwise and compare the means of two populations =-=[1]-=-, [2]. Generally, these tests are two-sided and, in our case of expected error comparison, check whether two supervised learning algorithms yield the same expected error. If the test accepts, we concl... |

413 |
Practical Nonparametric Statistics
- Conover
- 1980
(Show Context)
Citation Context ...e levels to each hypothesis, and in that case, we may reject hypotheses which have high significance individually. In addition to the parametric tests we discussed, there are also nonparametric tests =-=[10]-=-; for example, KruskalWallis’ test is the nonparametric version of Anova. In contrast to parametric tests, nonparametric tests do not assume a particular population probability distribution. Contingen... |

324 | The case against accuracy estimation for comparing classifiers
- Provost, Fawcett, et al.
- 1998
(Show Context)
Citation Context ...ositive and false positive rates. In applications where the class distributions become skewed or when misclassification losses are not equal, accuracy-based comparisons may break down. Provost et al. =-=[14]-=- suggest using ROC analysis in classifier comparisons and, to this aim, propose the incremental ROC convex hull method (ROCCH) [15], which allows clear visual comparisons and sensitivity analyses. ROC... |

253 | Robust Classification for Imprecise Environments
- Provost, Fawcett
(Show Context)
Citation Context ...not equal, accuracy-based comparisons may break down. Provost et al. [14] suggest using ROC analysis in classifier comparisons and, to this aim, propose the incremental ROC convex hull method (ROCCH) =-=[15]-=-, which allows clear visual comparisons and sensitivity analyses. ROCCH selects the classifiers that are potentially optimal; therefore, only these classifiers must be kept for further comparisons. RO... |

201 |
Introduction to Machine Learning
- Alpaydin
- 2004
(Show Context)
Citation Context ...imes quite accurate. 2. NMC is the Nearest Mean Classifier, which keeps the mean vector for each class and assigns instance to the class whose mean has the smallest Euclidean distance to the instance =-=[2]-=-. It has c d parameters. This corresponds to assuming that classes are Gaussian distributed with a shared covariance matrix whose diagonals are equal and whose off-diagonals are 0. 3. LGC [20] is the ... |

145 |
Multiple Comparison Procedures
- Hochberg, Tamhane
- 1987
(Show Context)
Citation Context ...o not check for any of the subsets between mi assume all to have equal means. and mj, and 2.5 Related Work Comparing statistics from multiple populations is called multiple comparison procedures [7], =-=[8]-=-, and failures to adjust the statistical properties of multiple comparisons may lead to attribute selection errors, overfitting, and oversearching [9]. Several solutions were proposed to overcome thes... |

128 |
Simultaneous Statistical Inference
- Miller
- 1981
(Show Context)
Citation Context ...e equality of the means of subsets of populations. One such test is the Newman-Keuls test. There are also multiple range tests due to Duncan and Tukey, but the Newman-Keuls test is favored over them (=-=[6]-=-, p. 87). In our case of comparing expected error, the multiple range test is used to find subsets of algorithms with the same expected error. For example, given algorithms 1, 2, 3, 4, 5, a range test... |

101 |
The Analysis of Contingency Tables
- Everitt
- 1977
(Show Context)
Citation Context ...n be used in cases where learning and/or validation is so costly that they can only be done once, assuming that the internal variability of the supervised learning algorithms is small. McNemar’s test =-=[11]-=- is such a pairwise test having lower type I error and reasonable power [1]. Let n01 denote the number of instances misclassified by the first classifier but not by the second and n10 denote the numbe... |

75 |
Introduction to probability and statistics for engineers and scientists
- Ross
- 1987
(Show Context)
Citation Context ... 2 cv t test, but is two-sided and because it uses the squares of the differences, it cannot be used to define a one-sided test. 2.4 Multiple Populations 2.4.1 Anova Test Analysis of variance (Anova) =-=[5]-=- tests whether K samples are drawn from populations with the same mean, and cans394 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 3, MARCH 2006 be used to test whether K... |

62 |
Discrete Mathematics and Its Application
- Rosen
- 1995
(Show Context)
Citation Context ...t expected error. This calculates the “best”; if we want to find an ordering, we iterate steps 1 and 2 above, removing the best node and its incident edges at each iteration to get a topological sort =-=[19]-=-. Fig. 2 shows a sample execution of the MultiTest algorithm on four (K 4) algorithms numbered 1 to 4. They are sorted in decreasing order of prior preference 1, 2, Fig. 2. Sample execution of the M... |

49 |
Combined 5 × 2cv F test for comparing supervised classification algorithms
- Alpaydin
- 1999
(Show Context)
Citation Context ...e I error than the k-fold t test. We can derive a one-sided test and accept H0 : 1 2 if t0 2 ð 1;t ;5Þ. This is the one-sided pairwise test we use in the rest of the paper. The combined 5 2 cv F test =-=[4]-=- is an improved version of the 5 2 cv t test, but is two-sided and because it uses the squares of the differences, it cannot be used to define a one-sided test. 2.4 Multiple Populations 2.4.1 Anova Te... |

45 |
Design and Analysis of Experiments
- Dean, Voss
- 1999
(Show Context)
Citation Context ... we do not check for any of the subsets between mi assume all to have equal means. and mj, and 2.5 Related Work Comparing statistics from multiple populations is called multiple comparison procedures =-=[7]-=-, [8], and failures to adjust the statistical properties of multiple comparisons may lead to attribute selection errors, overfitting, and oversearching [9]. Several solutions were proposed to overcome... |

31 | Choosing between two learning algorithms based on calibrated tests
- Bouckaert
(Show Context)
Citation Context ... where the degrees of freedoms are multiplied with an appropriate ^ adjustment. Dietterich has shown that the 5 2 cv t test has low power and the k-fold cv t test has high type I error [1]. Bouckaert =-=[13]-=- claims that the reuse of the same data causes the effective degrees of freedom to be lower than theoretically expected and calibrates the effective degrees of freedom empirically. On synthetic proble... |

10 |
Computers and the Theory of Statistics
- Efron
- 1979
(Show Context)
Citation Context ...arameters, e.g., mean) and increases the variance of estimates made; with more replications, the folds overlap too much and the independence assumption of folds may no longer be tenable. In bootstrap =-=[3]-=-, we randomly draw instances with replacement from a data set where each drawn sample 0162-8828/06/$20.00 ß 2006 IEEE Published by the IEEE Computer SocietysYILDIZ AND ALPAYDIN: ORDERING AND FINDING T... |

5 |
A statistical technique for comparing the accuracies of several classi
- Looney
- 1988
(Show Context)
Citation Context ...he second but not by the first. We accept the null hypothesis that both classifiers have the same mean with level of confidence if t ðjn01 n10j 1Þ 2 n01 þ n10 > 2 1; : ð13Þ Similarly, Looney’s test =-=[12]-=- uses contingency table analysis and checks for the equality of K means, as in Anova. Since it checks whether multiple populations have the same mean, it cannot be used for finding the population with... |

2 |
A Multinomial Selection Procedure for Evaluating Pattern Recognition Algorithms
- Alsing, Bauer, et al.
- 2002
(Show Context)
Citation Context ... having precise prior class and cost distributions. The most important limitation of ROCCH is that it is only applicable to binary-class problems. In the multinomial selection problem (MSP) procedure =-=[16]-=-, for each test data point of class j, we compare class j posterior probabilities of each classifier and select the one with the maximum posterior. The best classifier is the one that has the maximum ... |

1 |
Tuning Model Complexity Using Cross-Validation for Supervised Learning
- Yildiz
- 2005
(Show Context)
Citation Context ...s on expected error and our prior preferences. We apply it to classification by making use of the 5 2 pairwise t test. It is applicable to regression and other loss functions by using a suitable test =-=[23]-=-. MultiTest is always able to find an algorithm as the “best” one (or order the algorithms in the general case) combining the expected error and prior preferences. Though Anova and Newman-Keuls or the... |