## Multiple comparisons in induction algorithms (2000)

### Cached

### Download Links

- [ftp.cs.umass.edu]
- [eksl-www.cs.umass.edu]
- [kdl.cs.umass.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 73 - 10 self |

### BibTeX

@INPROCEEDINGS{Jensen00multiplecomparisons,

author = {David D. Jensen and Paul R. Cohen},

title = {Multiple comparisons in induction algorithms},

booktitle = {Machine Learning},

year = {2000},

pages = {309--338}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. A single mechanism is responsible for three pathologies of induction algorithms: attribute selection errors, overfitting, and oversearching. In each pathology, induction algorithms compare multiple items based on scores from an evaluation function and select the item with the maximum score. We call this a multiple comparison procedure (MCP). We analyze the statistical properties of MCP s and show how failure to adjust for these properties leads to the pathologies. We also discuss approaches that can control pathological behavior, including Bonferroni adjustment, randomization testing, and cross-validation.

### Citations

3912 |
Classification and regression trees
- Breiman, ed
- 1993
(Show Context)
Citation Context ... some algorithms explicitly consider both complexity and accuracy when evaluating model components (Iba, Wogulis & Langley, 1988). Cost-complexity pruning, a technique employed in the cart algorithm (=-=Breiman et al., 1984-=-), attempts to find a near-optimal complexity for a given tree through cross-validation. 21 Several more formal treatments consider model complexity as a way to avoid overfitting. One such treatment, ... |

3355 | Simplifying decision trees
- Quinlan
- 1999
(Show Context)
Citation Context ...endices. 4.1. Overfitting: Errors in hypothesis tests Errors in adding components to a model, usually called overfitting, are probably the best known pathology of induction algorithms (Einhorn, 1972; =-=Quinlan, 1987-=-; Quinlan & Rivest, 1989; Mingers, 1989a; Weiss & Kulikowski, 1991; White & Liu, 1995; Oates & Jensen, 1997) In empirical studies, induction algorithms often add spurious components to models. These c... |

752 | A study of cross-validation and bootstrap for accuracy estimation and model selection
- Kohavi
- 1995
(Show Context)
Citation Context ... other for hypothesis tests and parameter estimates for the resulting items. 7.2. Cross-validation Cross-validation is a more sophisticated method for obtaining scores based on disjoint data samples (=-=Kohavi, 1995-=-; Cohen, 1995; Weiss & Kulikowski, 1991). Crossvalidation divides a sample S, with N instances, into k disjoint sets, S i , each of which contains N/k instances. Then, for 1 # i # k, an MCP selects ma... |

624 | Learnability and the vapnik-chervonenkis dimension - Blumer, Ehrenfeucht, et al. - 1989 |

609 | Neural networks and the bias/variance dilemma - Geman, Bienenstock, et al. - 1992 |

372 |
Empirical Methods for Artificial Intelligence
- Cohen
- 1995
(Show Context)
Citation Context ...othesis tests and parameter estimates for the resulting items. 7.2. Cross-validation Cross-validation is a more sophisticated method for obtaining scores based on disjoint data samples (Kohavi, 1995; =-=Cohen, 1995-=-; Weiss & Kulikowski, 1991). Crossvalidation divides a sample S, with N instances, into k disjoint sets, S i , each of which contains N/k instances. Then, for 1 # i # k, an MCP selects maximumscoring ... |

293 | Inferring decision tree using the Minimum Description Length principle - Quinlan, Riverst - 1989 |

255 | Computer Systems That Learn: Classification and Prediction Methods from Statistics - Weiss, Kulikowski - 1990 |

200 | Improved Use of Continuous Attributes in c4.5
- Quinlan
- 1996
(Show Context)
Citation Context ... pathology is sometimes called attribute selection error. 6 Attribute selection errors, particularly in tree-building systems, have been reported for more than a decade (Quinlan, 1986; Quinlan, 1988; =-=Quinlan, 1996-=-; Mingers, 1989b; Fayyad & Irani, 1992; Liu & White, 1994) Such errors are harmful because the resulting models have consistently lower accuracy on new data than other models considered and rejected b... |

174 | D.H.: Bias plus variance decomposition for zero-one loss functions - Kohavi, Wolpert - 1996 |

167 |
An Empirical Comparison of Selection Measures for Decision Tree Induction
- MINGERS
- 1989
(Show Context)
Citation Context ...pothesis tests Errors in adding components to a model, usually called overfitting, are probably the best known pathology of induction algorithms (Einhorn, 1972; Quinlan, 1987; Quinlan & Rivest, 1989; =-=Mingers, 1989-=-a; Weiss & Kulikowski, 1991; White & Liu, 1995; Oates & Jensen, 1997) In empirical studies, induction algorithms often add spurious components to models. These components do not improve accuracy, and ... |

166 |
An empirical comparison of pruning methods for decision tree induction
- Mingers
- 1989
(Show Context)
Citation Context ...pothesis tests Errors in adding components to a model, usually called overfitting, are probably the best known pathology of induction algorithms (Einhorn, 1972; Quinlan, 1987; Quinlan & Rivest, 1989; =-=Mingers, 1989-=-a; Weiss & Kulikowski, 1991; White & Liu, 1995; Oates & Jensen, 1997) In empirical studies, induction algorithms often add spurious components to models. These components do not improve accuracy, and ... |

157 |
An Exploratory Technique for Investigating Large Quantities of Categorical Data
- Kass
- 1980
(Show Context)
Citation Context ...ormer course, correctly noting the e#ect of multiple comparisons on empirical evaluation of learning algorithms. Only a few induction algorithms explicitly compensate for multiple comparisons. Chaid (=-=Kass, 1980-=-; Kass, 1975), Firm (based on work by Hawkins & Kass (1982)), and tba (Jensen & Schmill, 1997) use Bonferroni adjustment to compensate for multiple comparisons during tree construction. Induce (Gaines... |

156 | On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach
- Salzberg
- 1997
(Show Context)
Citation Context ...1)). Much of this literature is concerned with experimental design, rather than the design of induction algorithms. Some work in machine learning (Gascuel & Caraux, 1992; Feelders & Verkooijen, 1996; =-=Salzberg, 1997-=-) also pursues this former course, correctly noting the e#ect of multiple comparisons on empirical evaluation of learning algorithms. Only a few induction algorithms explicitly compensate for multiple... |

146 |
A Conservation Law for Generalization Performance
- Schaffer
- 1994
(Show Context)
Citation Context ...odel over another whose appropriateness is domain specific. This view has been extended to more extreme forms, referred to as a “law of generalization performance” or a “no free lunch (NFL) theorem” (=-=Schaffer, 1994-=-; Wolpert, 1992; Wolpert, 1994). This work holds that any gain in accuracy obtained by avoiding overfitting (or by any other bias) in one domain will necessarily be offset by reduced accuracy in other... |

119 | Overfitting avoidance as bias
- Schaffer
- 1993
(Show Context)
Citation Context ...on algorithm. Thus, Pearl’s insights, the VC dimension, and the MDL principle all point toward multiple comparisons as an important factor in overfitting. 8.3. Overfitting avoidance as bias Schaffer (=-=Schaffer, 1993-=-) characterizes overfitting avoidance as a learning bias — that is, a method of preferring one model over another whose appropriateness is domain specific. This view has been extended to more extreme ... |

113 |
Computer-Intensive Methods for Testing Hypotheses: An Introduction. WileyInterscience
- Noreen
- 1989
(Show Context)
Citation Context ... the method is computationally-intensive (typically, k = 10) and its results can still be highly variable (Kohavi, 1995). 7.3. Randomization Randomization (Cohen, 1995; Edgington, 1995; Jensen, 1992; =-=Noreen, 1989-=-) can be used to construct an empirical sampling distribution. Each iteration of randomization creates a sample S # i that is consistent with the null hypothesis. The MCP used to obtain the actual sco... |

96 | The attribute selection problem in decision tree generation - Fayyad, Irani - 1992 |

85 | Oversearching and layered search in empirical learning - Quinlan, Cameron-Jones - 1995 |

66 | The effects of training set size on decision tree complexity - Oates, Jensen - 1997 |

60 |
On the connection between in-sample testing and generalization error. Complex Systems 6:47–94
- Wolpert
- 1992
(Show Context)
Citation Context ...er whose appropriateness is domain specific. This view has been extended to more extreme forms, referred to as a "law of generalization performance" or a "no free lunch (NFL) theorem&qu=-=ot; (Scha#er, 1994; Wolpert, 1992-=-; Wolpert, 1994). This work holds that any gain in accuracy obtained by avoiding overfitting (or by any other bias) in one domain will necessarily be o#set by reduced accuracy in other domains. Thus, ... |

52 | Lookahead and pathology in decision tree induction,in - Murthy, Salzberg - 1995 |

40 | For every generalization action, is there really an equal and opposite reaction? Analysis of the conservation law for generalization performance - Rao, Gordon, et al. - 1995 |

38 | An ounce of knowledge is worth a ton of data: Quantitative studies of the trade-off between expertise and data based on statistically well-founded empirical induction
- Gaines
- 1989
(Show Context)
Citation Context ..., 1980; Kass, 1975), Firm (based on work by Hawkins & Kass (1982)), and tba (Jensen & Schmill, 1997) use Bonferroni adjustment to compensate for multiple comparisons during tree construction. Induce (=-=Gaines, 1989-=-) uses a Bonferroni adjustment to compensate for comparing multiple rules. Irt (Jensen, 1991; Jensen, 1992) uses randomization tests to compensate for comparing multiple classification rules. Cart (Br... |

30 | On the connection between the complexity and credibility of inferred models - Pearl - 1978 |

30 | Trading off simplicity and coverage in incremental concept learning - Iba, Wogulis, et al. - 1988 |

27 | Concept simplification and prediction accuracy - Fisher, Schlimmer - 1988 |

26 | Overfitting and undercomputing in machine learning
- Dietterich
- 1995
(Show Context)
Citation Context ...emely large spaces of models. Paradoxically, these algorithms produce models that are often less accurate on new data than models produced by algorithms that search only a fraction of the same space (=-=Dietterich, 1995-=-). This pathology, termed oversearching, is harmful because the resulting models have lower accuracy, and because constructing such models uses more computational resources. Algorithms that su#er from... |

23 |
Decision trees and multi-valued attributes
- Quinlan
- 1988
(Show Context)
Citation Context ...a samples. This pathology is sometimes called attribute selection error. 6 Attribute selection errors, particularly in tree-building systems, have been reported for more than a decade (Quinlan, 1986; =-=Quinlan, 1988-=-; Quinlan, 1996; Mingers, 1989b; Fayyad & Irani, 1992; Liu & White, 1994) Such errors are harmful because the resulting models have consistently lower accuracy on new data than other models considered... |

19 | The importance of attribute selection measures in decision tree induction - Liu, White - 1994 |

17 |
Simultaneous Statistical Inference (2nd edition
- Miller
- 1981
(Show Context)
Citation Context ...e e#ects of MCP s, but equation 4 only holds if the scores X i are mutually independent and identically distributed. Related adjustments exist for specific distributions and correlational structures (=-=Miller, 1981-=-; Hand & Taylor, 1987; Cohen, 1995). However, the score distributions and correlation must still be known in order to correctly adjust for the e#ects of MCP s. Figure 6 illustrates how varying degrees... |

16 | Adjusting for multiple comparisons in decision tree pruning
- Jensen, Schmill
- 1997
(Show Context)
Citation Context ... overfitting, are probably the best known pathology of induction algorithms (Einhorn, 1972; Quinlan, 1987; Quinlan & Rivest, 1989; Mingers, 1989a; Weiss & Kulikowski, 1991; White & Liu, 1995; Oates & =-=Jensen, 1997-=-) In empirical studies, induction algorithms often add spurious components to models. These components do not improve accuracy, and even reduce it, when models are tested on new data samples. 3 Overfi... |

15 |
Induction with randomization testing: decisionoriented analysis of large data sets
- Jensen
- 1992
(Show Context)
Citation Context ...1987). Finally, overfitted models can have lower accuracy on new data than models that are not overfitted. This e#ect has been demonstrated with a variety of domains and systems (e.g., Quinlan, 1987; =-=Jensen, 1992-=-). Overfitting occurs when a multiple comparison procedure is applied to model components. An algorithm generates a set of n components C = {c 1 , c 2 , . . . , c n}, calculates a score x i for each c... |

15 | Knowledge discovery through induction with randomization testing
- Jensen
- 1991
(Show Context)
Citation Context ...l, 1997) use Bonferroni adjustment to compensate for multiple comparisons during tree construction. Induce (Gaines, 1989) uses a Bonferroni adjustment to compensate for comparing multiple rules. Irt (=-=Jensen, 1991-=-; Jensen, 1992) uses randomization tests to compensate for comparing multiple classification rules. Cart (Breiman, Friedman, Olshen & Stone, 1984) implicitly adjusts for multiple comparisons using cro... |

12 | Multivariate Analysis of Variance and Repeated Measures - Hand, Taylor - 1987 |

12 | Trading o simplicity and coverage in incremental concept learning - Iba, Wogulis, et al. - 1988 |

9 |
Randomization Tests, 3rd Edition
- Edgington
- 1995
(Show Context)
Citation Context ...partition of the data. However, the method is computationally-intensive (typically, k = 10) and its results can still be highly variable (Kohavi, 1995). 7.3. Randomization Randomization (Cohen, 1995; =-=Edgington, 1995-=-; Jensen, 1992; Noreen, 1989) can be used to construct an empirical sampling distribution. Each iteration of randomization creates a sample S # i that is consistent with the null hypothesis. The MCP u... |

9 |
Significance testing in automatic interaction detection
- Kass
- 1975
(Show Context)
Citation Context ..., correctly noting the e#ect of multiple comparisons on empirical evaluation of learning algorithms. Only a few induction algorithms explicitly compensate for multiple comparisons. Chaid (Kass, 1980; =-=Kass, 1975-=-), Firm (based on work by Hawkins & Kass (1982)), and tba (Jensen & Schmill, 1997) use Bonferroni adjustment to compensate for multiple comparisons during tree construction. Induce (Gaines, 1989) uses... |

9 | Overfitting Avoidance as Bias - Schaer - 1993 |

9 | A conservation law for generalization performance - Schaer - 1994 |

8 | Statistical significance in inductive learning - Gascuel, Caraux - 1992 |

6 | On the Statistical Comparison of Inductive Learning Methods - Feelders, Verkooijen - 1996 |

4 |
Alchemy in the behavioral sciences
- Einhorn
- 1972
(Show Context)
Citation Context ... in several appendices. 4.1. Overfitting: Errors in hypothesis tests Errors in adding components to a model, usually called overfitting, are probably the best known pathology of induction algorithms (=-=Einhorn, 1972-=-; Quinlan, 1987; Quinlan & Rivest, 1989; Mingers, 1989a; Weiss & Kulikowski, 1991; White & Liu, 1995; Oates & Jensen, 1997) In empirical studies, induction algorithms often add spurious components to ... |

2 | Measuring concept change. Training Issues - Brodley, Rissland - 1993 |

2 | Automatic interation detection - Hawkins - 1982 |

2 |
A First Course in Probability (2nd edition
- Ross
- 1984
(Show Context)
Citation Context ..., x 2 , . . . xn ) P r(X i > x) # P r(Xmax > x). 12 Integrating both sides Z # 0 P r(X 1 > x)dx # Z # 0 P r(Xmax > x)dx. (3) A well-known theorem of probability states that R # 0 P r(X > x)dx = E(X) (=-=Ross, 1984-=-). So, E(X i ) # E(Xmax ). If, for one or more samples, x isxmax , then E(X i )sE(Xmax ). As before, this e#ect can be demonstrated empirically. Based on the distributions shown in Figure 2, we can ca... |

2 | Searching for Structure (Alias, AID-III); An Approach to Analysis of Substantial Bodies of Micro-Data and Documentation for a Computer Program (Successor to the Automatic - Sonquist, Baker - 1971 |

2 | Superstitious learning and induction - White - 1995 |

2 |
A comment on Einhorn’s “Alchemy in the behavioral sciences
- Morgan, Andrews
- 1973
(Show Context)
Citation Context ...earchers to reject statistical hypothesis tests entirely. For example, some early tree-building algorithms such as aid completely dispense with significance tests. According to the program’s authors (=-=Morgan & Andrews, 1973-=-; Sonquist, Baker & Morgan, 1971), aid’s multiple comparisons render statistical significance tests useless. Similarly, Quinlan (Quinlan, 1987) rejects conventional significance tests on empirical gro... |

1 | A comment on Einhorn's "Alchemy in the behavioral sciences - Morgan - 1973 |