## Occam's Two Razors: The Sharp and the Blunt (1998)

Venue: | In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining |

Citations: | 28 - 3 self |

### BibTeX

@INPROCEEDINGS{Domingos98occam'stwo,

author = {Pedro Domingos},

title = {Occam's Two Razors: The Sharp and the Blunt},

booktitle = {In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining},

year = {1998},

pages = {37--43},

publisher = {AAAI Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Occam's razor has been the subject of much controversy. This paper argues that this is partly because it has been interpreted in two quite different ways, the first of which (simplicity is a goal in itself) is essentially correct, while the second (simplicity leads to greater accuracy) is not. The paper reviews the large variety of theoretical arguments and empirical evidence for and against the "second razor," and concludes that the balance is strongly against it. In particular, it builds on the case of (Schaffer, 1993) and (Webb, 1996) by considering additional theoretical arguments and recent empirical evidence that the second razor fails in most domains. A version of the first razor more appropriate to KDD is proposed, and we argue that continuing to apply the second razor risks causing significant opportunities to be missed. 1 Occam's Two Razors William of Occam's famous razor states that "Nunquam ponenda est pluralitas sin necesitate," which, approximately translated, means "En...

### Citations

9530 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...ys. The model structuresclass = sign(sin ax), with a single parameter, has an infinite VC dimension, because it can discriminate an arbitrarily large, arbitrarily labeled set of points on the x axis (=-=Vapnik 1995-=-, p. 78). Overfitting Is Due to Multiple Testing According to conventional wisdom, overfitting is caused by overly complex models, and Occam's razor combats it by introducing a preference for simpler ... |

8938 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...al principle. The closely-related minimummessage length (MML) principle (Wallace & Boulton 1968) is derived by taking the logarithm of Bayes' theorem and noting that, according to information theory (=-=Cover & Thomas 1991-=-), logarithms of probabilities can be seen as (minus) the lengths of the most efficient codes for the corresponding events. This has led some researchers to believe that a trade-off between error and ... |

5116 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...ity (by making the results of induction consistent with previous knowledge). Weak constraints are often sufficient ((Abu-Mostafa 1989; Donoho & Rendell 1996; Pazzani, Mani, & Shankle 1997); see also (=-=Bishop 1995-=-), Section 8.7). If we accept the fact that the most accurate models will not always be simple or easily understandable, we should allow an explicit trade-off between the two. Systems that first induc... |

1111 |
Bayesian Theory
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ... found in the statistical and pattern recognition literature. While the details vary, they typically take the form of an approximation to the optimal prediction procedure of Bayesian model averaging (=-=Bernardo & Smith 1994-=-; Chickering & Heckerman 1997) that results in evaluating candidate models according to a sum of two terms: an error or likelihood term, and a term penalizing the complexity of the model. Criteria of ... |

627 | On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29:103–130
- Domingos, Pazzani
- 1997
(Show Context)
Citation Context ... 1994). However, as we shall see below, more recent results call even this conclusion into question. Other Low-Variance Algorithms More generally, several pieces of recent work (e.g., (Friedman 1996; =-=Domingos & Pazzani 1997-=-)) have suggested that simple learners like the naive Bayesian classifier or the perceptron will often do better than more complex ones because, while having a higher systematic error component (the b... |

461 | Very simple classification rules perform well on most commonly used datasets - Holte - 1993 |

316 |
An information measure for classification
- Wallace, Boulton
- 1968
(Show Context)
Citation Context ... penalizing the complexity of the model. Criteria of this type include AIC (Akaike 1978), BIC (Schwarz 1978), and many others. Similar criteria with an information-theoretic interpretation, like MML (=-=Wallace & Boulton 1968-=-) and MDL (Rissanen 1978) are discussed below. Consider BIC, the first criterion to be explicitly proposed as an approximation to Bayesian model averaging. Leaving aside the fact that BIC involves a s... |

296 |
Inferring decision trees using the minimum description length principle
- Quinlan, Rivest
- 1989
(Show Context)
Citation Context ...ad to suboptimal choices. The Information-Theoretic Argument The minimum description length (MDL) principle (Rissanen 1978) is perhaps the form in which the second razor is most often applied (e.g., (=-=Quinlan & Rivest 1989)). Accord-=-ing to this principle, the "best" model is the one which minimizes the total number of bits needed to encode the model and the data. The MDL principle is appealing because it reduces two app... |

283 | Bagging, boosting, and C4.5
- Quinlan
- 1996
(Show Context)
Citation Context ...were obtained by introducing an ad hoc coefficient to reduce the penalty paid by complex decision trees. Finally, the success of multiple-model approaches in almost all commonly-used datasets (e.g., (=-=Quinlan 1996-=-)) shows that large error reductions can systematically result from sharply increased complexity. In particular, Rao and Potts (1997) show how bagging builds accurate frontiers from CART trees that ap... |

226 |
Quantifying inductive bias: AI learning algorithms and valiant’s learning framework
- Haussler
- 1988
(Show Context)
Citation Context ...tain many more models than lower-order ones, and thus contain many more lowlikelihood models along with the "best" one(s). (In precise terms, higher-order model structures have a higher VC d=-=imension (Haussler 1988-=-); or, considering finiteprecision numbers, they literally contain more models.) For example, the model space defined by ax 2 + bx + c contains many more models than the one defined by ax + b. Thus, i... |

200 | On bias, variance, 0/1 - loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery
- Friedman
- 1997
(Show Context)
Citation Context ... better (Elomaa 1994). However, as we shall see below, more recent results call even this conclusion into question. Other Low-Variance Algorithms More generally, several pieces of recent work (e.g., (=-=Friedman 1996-=-; Domingos & Pazzani 1997)) have suggested that simple learners like the naive Bayesian classifier or the perceptron will often do better than more complex ones because, while having a higher systemat... |

191 | The need for biases in learning generalizations - Mitchell - 1980 |

180 |
Modeling by the shortest data description, Automatica 14
- Rissanen
- 1978
(Show Context)
Citation Context ... model. Criteria of this type include AIC (Akaike 1978), BIC (Schwarz 1978), and many others. Similar criteria with an information-theoretic interpretation, like MML (Wallace & Boulton 1968) and MDL (=-=Rissanen 1978-=-) are discussed below. Consider BIC, the first criterion to be explicitly proposed as an approximation to Bayesian model averaging. Leaving aside the fact that BIC involves a sequence of approximation... |

167 |
An empirical comparison of pruning methods for decision tree induction
- Mingers
- 1989
(Show Context)
Citation Context ... simple empirical argument for the second razor might be stated as "Pruning works." Indeed, pruning often leads to models that are both simpler and more accurate than the corresponding unpru=-=ned ones (Mingers 1989-=-). However, it can also lead to lower accuracy (Schaffer 1993). It is easy to think of simple problems where pruning can only hurt accuracy (e.g., applying a decision tree algorithm like C4.5 to learn... |

152 | A conservation law for generalization performance - Schaffer - 1994 |

122 | Overfitting avoidance as bias
- Schaffer
- 1993
(Show Context)
Citation Context ...s the large variety of theoretical arguments and empirical evidence for and against the "second razor," and concludes that the balance is strongly against it. In particular, it builds on the=-= case of (Schaffer, 1993-=-) and (Webb, 1996) by considering additional theoretical arguments and recent empirical evidence that the second razor fails in most domains. A version of the first razor more appropriate to KDD is pr... |

114 | Theory refinement combining analytical and empirical methods
- Ourston, Mooney
- 1994
(Show Context)
Citation Context ...e for simpler models, but for restricting search. Suitably constrained, decision-tree or rule induction algorithms can be as stable as simpler ones, and more accurate. Theory revision systems (e.g., (=-=Ourston & Mooney 1994-=-)) are an example of this: they can produce accurate theories that are quite complex with comparatively little search, by making incremental changes to an initial theory that is already complex. Physi... |

89 |
Learning from hints in neural networks
- Abu-Mostafa
- 1990
(Show Context)
Citation Context ...curacy (by reducing the search needed to find an accurate model) and comprehensibility (by making the results of induction consistent with previous knowledge). Weak constraints are often sufficient ((=-=Abu-Mostafa 1989-=-; Donoho & Rendell 1996; Pazzani, Mani, & Shankle 1997); see also (Bishop 1995), Section 8.7). If we accept the fact that the most accurate models will not always be simple or easily understandable, w... |

88 | Oversearching and layered search in empirical learning - Quinlan, Cameron-Jones - 1995 |

71 | Extracting comprehensible models from trained neural networks
- Craven
- 1996
(Show Context)
Citation Context ... we should allow an explicit trade-off between the two. Systems that first induce the most accurate model they can, and then extract from it a more comprehensible model of variable complexity (e.g., (=-=Craven 1996-=-; Domingos 1997a)) seem a promising avenue. Conclusion Occam's razor can be interpreted in two ways: as favoring the simpler of two models with the same generalization error because simplicity is a go... |

59 | Exploring the decision forest: An empirical investigation of Occam’s razor in decision tree induction - Murphy, Pazzani - 1994 |

57 | Knowledge acquisition from examples via multiple models
- Domingos
- 1997
(Show Context)
Citation Context ...low an explicit trade-off between the two. Systems that first induce the most accurate model they can, and then extract from it a more comprehensible model of variable complexity (e.g., (Craven 1996; =-=Domingos 1997-=-a)) seem a promising avenue. Conclusion Occam's razor can be interpreted in two ways: as favoring the simpler of two models with the same generalization error because simplicity is a goal in itself, o... |

56 | Further experimental evidence against the utility of occams razor
- Webb
- 1996
(Show Context)
Citation Context ...f theoretical arguments and empirical evidence for and against the "second razor," and concludes that the balance is strongly against it. In particular, it builds on the case of (Schaffer, 1=-=993) and (Webb, 1996-=-) by considering additional theoretical arguments and recent empirical evidence that the second razor fails in most domains. A version of the first razor more appropriate to KDD is proposed, and we ar... |

52 |
RL4: A tool for knowledge-based induction
- Clearwater, Provost
- 1990
(Show Context)
Citation Context ... are already familiar to the domain experts, and missing the second-order variations that are often where the payoff of data mining lies. Systems that allow incorporation of domain constraints (e.g, (=-=Clearwater & Provost 1990-=-; Clark & Matwin 1993; Lee, Buchanan, & Aronis 1998)) are an alternative to blind reliance on simplicity. Incorporating such constraints can simultaneously improve accuracy (by reducing the search nee... |

52 | Lookahead and pathology in decision tree induction,in - Murthy, Salzberg - 1995 |

49 | Using qualitative models to guide induction learning
- Clark, Motwani
- 1993
(Show Context)
Citation Context ...e domain experts, and missing the second-order variations that are often where the payoff of data mining lies. Systems that allow incorporation of domain constraints (e.g, (Clearwater & Provost 1990; =-=Clark & Matwin 1993-=-; Lee, Buchanan, & Aronis 1998)) are an alternative to blind reliance on simplicity. Incorporating such constraints can simultaneously improve accuracy (by reducing the search needed to find an accura... |

45 |
On Finding the Most Probably Model
- Cheeseman
- 1990
(Show Context)
Citation Context ... for the corresponding events. This has led some researchers to believe that a trade-off between error and complexity is "a direct consequence of Bayes' theorem, requiring no additional assumptio=-=ns" (Cheeseman 1990-=-). However, this belief is founded on a confusion between assigning the shortest codes to the most probable hypotheses and a priori considering that the syntactically simplest models in the representa... |

43 | A New Metric-based Approach to Model Selection - Schuurmans - 1997 |

41 | Beyond concise and colorful: Learning Intelligible Rules - Pazzani, Mani, et al. - 1997 |

40 |
Estimating the Dimension of a Model.” Annals of Statistics 6:461–464
- Schwarz
- 1978
(Show Context)
Citation Context ...s in evaluating candidate models according to a sum of two terms: an error or likelihood term, and a term penalizing the complexity of the model. Criteria of this type include AIC (Akaike 1978), BIC (=-=Schwarz 1978-=-), and many others. Similar criteria with an information-theoretic interpretation, like MML (Wallace & Boulton 1968) and MDL (Rissanen 1978) are discussed below. Consider BIC, the first criterion to b... |

39 |
A Bayesian analysis of the minimum AIC procedure
- Akaike
- 1978
(Show Context)
Citation Context ...n 1997) that results in evaluating candidate models according to a sum of two terms: an error or likelihood term, and a term penalizing the complexity of the model. Criteria of this type include AIC (=-=Akaike 1978-=-), BIC (Schwarz 1978), and many others. Similar criteria with an information-theoretic interpretation, like MML (Wallace & Boulton 1968) and MDL (Rissanen 1978) are discussed below. Consider BIC, the ... |

35 | New Measurements Highlight the Importance of Redundant Knowledge - Gams - 1989 |

33 | Why Does Bagging Work? A Bayesian Account and its Implications
- Domingos
- 1997
(Show Context)
Citation Context ...low an explicit trade-off between the two. Systems that first induce the most accurate model they can, and then extract from it a more comprehensible model of variable complexity (e.g., (Craven 1996; =-=Domingos 1997-=-a)) seem a promising avenue. Conclusion Occam's razor can be interpreted in two ways: as favoring the simpler of two models with the same generalization error because simplicity is a goal in itself, o... |

32 | On the connection between the complexity and credibility of inferred models - Pearl - 1978 |

30 |
E cient approximations for the marginal likelihood of Bayesian networks with hidden variables
- Chickering, Heckerman
- 1997
(Show Context)
Citation Context ...al and pattern recognition literature. While the details vary, they typically take the form of an approximation to the optimal prediction procedure of Bayesian model averaging (Bernardo & Smith 1994; =-=Chickering & Heckerman 1997-=-) that results in evaluating candidate models according to a sum of two terms: an error or likelihood term, and a term penalizing the complexity of the model. Criteria of this type include AIC (Akaike... |

27 | Concept simplification and prediction accuracy - Fisher, Schlimmer - 1988 |

26 | Learning Prototypical Concept Descriptions - Datta, Kibler - 1995 |

21 | Explaining - Cohen - 2000 |

21 | Lessons in neural network training: overfitting may be harder than expected - Lawrence, Giles, et al. - 1997 |

20 | Knowledge-based learning in exploratory science: Learning rules to predict rodent carcinogenicity - Lee, Buchanan, et al. - 1998 |

16 | A process-oriented heuristic for model selection
- Domingos
- 1998
(Show Context)
Citation Context ...tempting 100 simple ones. Overfitting is thus best combatted not by the second razor, but by taking this multiple testing phenomenon into account when scoring candidate models (Jensen & Schmill 1997; =-=Domingos 1998-=-). Bias-Variance Schuurmans et al. (1997) have shown that complexitypenalty methods assume a particular bias-variance profile, and that if the true profile does not correspond to the postulated one sy... |

16 | Adjusting for multiple comparisons in decision tree pruning
- Jensen, Schmill
- 1997
(Show Context)
Citation Context ... of overfitting than attempting 100 simple ones. Overfitting is thus best combatted not by the second razor, but by taking this multiple testing phenomenon into account when scoring candidate models (=-=Jensen & Schmill 1997-=-; Domingos 1998). Bias-Variance Schuurmans et al. (1997) have shown that complexitypenalty methods assume a particular bias-variance profile, and that if the true profile does not correspond to the po... |

15 | Characterizing the generalization performance of model selection strategies - Schuurmans, Ungar - 1997 |

12 |
Occam: Studies and Selections
- Tornay
- 1938
(Show Context)
Citation Context ...'s Two Razors William of Occam's famous razor states that "Nunquam ponenda est pluralitas sin necesitate," which, approximately translated, means "Entities should not be multiplied beyo=-=nd necessity" (Tornay 1938-=-). It was born in the late Middle Ages as a criticism of scholastic philosophy, whose theories grew ever more elaborate without any corresponding improvement in predictive power. In the intervening ce... |

11 |
Constructive induction using fragmentary knowledge
- Donoho
- 1996
(Show Context)
Citation Context ...g the search needed to find an accurate model) and comprehensibility (by making the results of induction consistent with previous knowledge). Weak constraints are often sufficient ((Abu-Mostafa 1989; =-=Donoho & Rendell 1996-=-; Pazzani, Mani, & Shankle 1997); see also (Bishop 1995), Section 8.7). If we accept the fact that the most accurate models will not always be simple or easily understandable, we should allow an expli... |

11 | Induction of condensed determinations
- Langley
- 1996
(Show Context)
Citation Context ...use (as well as cheaper for computers to store and manipulate) . Thus the first razor is justified. However, simplicity and comprehensibility are not always equivalent. For example, a decision table (=-=Langley 1996-=-) may be larger than a similarly accurate decision tree, but more easily understood because all lines in the table use the same attributes. Induced models are also more comprehensible if they are cons... |

10 | Unifying instance-based and rulebased induction. Machine Learning 24:141{168 - Domingos - 1996 |

9 | defense of C4.5: Notes on learning one-level decision trees
- Elomaa, In
- 1994
(Show Context)
Citation Context ...lways achieve the Bayes rate (lowest error possible). At most, these experiments suggest that the advantage of going to more complex models is small; they do not imply that simpler models are better (=-=Elomaa 1994-=-). However, as we shall see below, more recent results call even this conclusion into question. Other Low-Variance Algorithms More generally, several pieces of recent work (e.g., (Friedman 1996; Domin... |

9 | Decision tree grafting - Webb - 1997 |

8 | Learning redundant rules in noisy domains - Cestnik, Bratko - 1988 |