## Methodological Note On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach (1996)

### Cached

### Download Links

Citations: | 1 - 0 self |

### BibTeX

@MISC{Salzberg96methodologicalnote,

author = {Steven L. Salzberg and Usama Fayyad},

title = {Methodological Note On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach},

year = {1996}

}

### OpenURL

### Abstract

Abstract. An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions. This is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. This paper describes several phenomena that can, if ignored, invalidate an experimental comparison. These phenomena and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. The paper also discusses why comparative analysis is more important in evaluating some types of algorithms than for others, and provides some suggestions about how to avoid the pitfalls suffered by many experimental studies.

### Citations

753 |
UCI Repository of Machine Learning Databases
- Murphy, Aha
- 1994
(Show Context)
Citation Context ...opment and it represents an important step in the maturation of the field. One indication of this maturation is the creation and maintenance of the UC Irvine repository of machine learning databases (=-=Murphy, 1995-=-), which now contains over 100 datasets that have appeared in published work. This repository makes it very easy for machine learning researchers to compare new algorithms to previous work. The data m... |

476 | Parallel Networks that Learn to Pronounce English Text - Rosenberg, Sejnowski - 1987 |

459 | Very simple classification rules perform well on most commonly used datasets
- Holte
- 1993
(Show Context)
Citation Context ...her datasets. The only way this might be valid would be if the UCI repository were known to represent a larger population of classification problems. In fact, though, as argued persuasively by Holte (=-=Holte, 1993-=-) and others, the UCI repository is a very limited sample of problems, many of which are quite easy for a classifier. (Many of them may represent the same concept class, for example many might be almo... |

282 | Bayesian Model Selection in Social Research (with discussion - Raftery - 1995 |

151 | Experimental Designs - Cochran, Cox - 1950 |

115 |
The Analysis of Contingency Tables
- Everitt
- 1977
(Show Context)
Citation Context ...an reject the null hypothesis with high confidence. (Note: the computation here uses the binomial distribution, which is exact. Another, nearly identical form of this test is known as McNemar’s test (=-=Everitt, 1977-=-), which uses the χ 2 distribution. The statistic used for the McNemar test is (|s − f|−1) 2 /(s + f), which is simpler to compute.) If we make this into a 2-sided test, we must double the probability... |

103 | Generalising from Case Studies: A Case Study - Aha - 1992 |

101 | Symbolic and Neural Learning Algorithms: An Experimental Comparison - Shavlik, Mooney, et al. - 1991 |

61 |
On the connection between in-sampling testing and generalization error
- Wolpert
- 1992
(Show Context)
Citation Context ...e made for data mining: no single technique is likely to work best on all databases. Recent theoretical work has shown that, with certain assumptions, no classifier is always better than another one (=-=Wolpert, 1992-=-). However, experimental science is concerned with data that occurs in the real world, and it is not clear that these theoretical limitations are relevant. Comparative studies typically include at lea... |

59 |
Machine learning as an experimental science
- Kibler, Langley
- 1988
(Show Context)
Citation Context ...hms and designing the experiments. Here I focus on design of experiments, which has been the subject of little concern in the machine learning community until recently (with some exceptions, such as (=-=Kibler and Langley, 1988-=-) and (Cohen and Jensen, 1997)). Included in the comparative study category are papers that neither introduce a new algorithm nor improve an old one; instead, they consider one or more known algorithm... |

52 | A Study of Experimental Evaluations of Neural Network Learning Algorithms: Current Research Practice. Rapport technique 19/94, Fakultat fur Informatik, Universitat - Prechelt |

45 |
Multi-Interval Discretization of Continuous Valued Attributes for Classification Learning
- Fayyad, Irani
- 1993
(Show Context)
Citation Context ...rom one study to the next even when the same basic dataset is used. For example, numeric values are sometimes converted to a discrete set of intervals, especially when using decision tree algorithms (=-=Fayyad and Irani, 1993-=-). Whenever tuning takes place, every adjustment should really be considered a separate experiment. For example, if one attempts 10 different combinations of parameters, then significance levels (p-va... |

37 | A hybrid nearest-neighbor and nearest-hyperrectangle algorithm - Wettschereck - 1994 |

35 | Statistical evaluation of neural network experiments: Minimum requirements and current practice - Flexer - 1996 |

21 | Explaining
- Cohen
- 2000
(Show Context)
Citation Context ...ts. Here I focus on design of experiments, which has been the subject of little concern in the machine learning community until recently (with some exceptions, such as (Kibler and Langley, 1988) and (=-=Cohen and Jensen, 1997-=-)). Included in the comparative study category are papers that neither introduce a new algorithm nor improve an old one; instead, they consider one or more known algorithms and conduct experiments on ... |

20 | Data Mining as an Industry - DENTON - 1985 |

15 | Knowledge discovery through induction with randomization testing
- Jensen
- 1991
(Show Context)
Citation Context ...cs. A good general reference for experimental design is Cochran and Cox (1957), and descriptions of ANOVA and experimental design can be found in introductory texts such as Hildebrand (1986). Jensen (=-=Jensen, 1991-=-, Jensen, 1995) discusses a framework for experimental comparison of classifiers and addresses significance testing, and Cohen and Jensen (1997) discuss some specific ways to remove optimistic statist... |

13 |
Which method learns most from the data
- Feelders, Verkooijen
- 1995
(Show Context)
Citation Context ...ong time; it is known as the multiplicity effect. At least two recent papers have focused their attention nicely on how classification researchers might address this effect (Gascuel and Caraux, 1992, =-=Feelders and Verkooijen, 1995-=-). In particular, let α be the probability that if no differences exist among our algorithms, we will make at least one mistake; i.e., we will find at least one significant difference. Thus α is the p... |

8 | Statistical significance in inductive learning
- Gascuel, Caraux
- 1992
(Show Context)
Citation Context ... this problem for a very long time; it is known as the multiplicity effect. At least two recent papers have focused their attention nicely on how classification researchers might address this effect (=-=Gascuel and Caraux, 1992-=-, Feelders and Verkooijen, 1995). In particular, let α be the probability that if no differences exist among our algorithms, we will make at least one mistake; i.e., we will find at least one signific... |

7 |
Statistical tests for comparing supervised learning algorithms (Technical Report
- Dietterich
- 1996
(Show Context)
Citation Context ...n comparative machine learning studies. (One of the authors of the study cited above has written recently that the paired t-test has “a high probability of Type I error ... and should never be used” (=-=Dietterich, 1996-=-).) It is worth noting here that even statisticians have difficulty agreeing on the correct framework for hypothesis testing in complex experimental designs. For example, the whole framework of using ... |

7 | Statistical thinking for behavioral scientists - Hildebrand - 1986 |

3 |
Labeling space: A tool for thinking about significance testing in knowledge discovery. Office of Technology Assessment
- Jensen
- 1995
(Show Context)
Citation Context ...eral reference for experimental design is Cochran and Cox (1957), and descriptions of ANOVA and experimental design can be found in introductory texts such as Hildebrand (1986). Jensen (Jensen, 1991, =-=Jensen, 1995-=-) discusses a framework for experimental comparison of classifiers and addresses significance testing, and Cohen and Jensen (1997) discuss some specific ways to remove optimistic statistical bias from... |