## Pruning decision trees and lists (2000)

Citations: | 13 - 0 self |

### BibTeX

@TECHREPORT{Frank00pruningdecision,

author = {Eibe Frank},

title = {Pruning decision trees and lists},

institution = {},

year = {2000}

}

### Years of Citing Articles

### OpenURL

### Abstract

at the

### Citations

4934 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...em potentially more powerful predictors. However, they are also harder to interpret and computationally more expensive to generate. Standard learning algorithms for decision trees, for example, C4.5 (=-=Quinlan, 1992-=-) and CART (Breiman et al., 1984), generate a tree structure by splitting the training data into smaller and smaller subsets in a recursive top-down fashion. Starting with all the training data at the... |

3909 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...ful predictors. However, they are also harder to interpret and computationally more expensive to generate. Standard learning algorithms for decision trees, for example, C4.5 (Quinlan, 1992) and CART (=-=Breiman et al., 1984-=-), generate a tree structure by splitting the training data into smaller and smaller subsets in a recursive top-down fashion. Starting with all the training data at the root node, at each node they ch... |

3354 | Induction of Decision Trees
- Quinlan
- 1986
(Show Context)
Citation Context ...s \incremental reducederror pruning" (Furnkranz & Widmer, 1994; Furnkranz, 1997), which leads to a combination of the two main paradigms for learning rule sets: obtaining rules from decision tree=-=s (Quinlan, 198-=-7b) and separate-and-conquer rule learning (Furnkranz, 1999). The particular example in Figure 5.1 is a two-class learning problem. In this case it is sucient to learn a set of rules describing only o... |

2868 |
P.: UCI Repository of Machine Learning Databases
- Merz, Merphy
- 1996
(Show Context)
Citation Context ... directions for future work. 1.4 Experimental Methodology All methods investigated in this thesis are evaluated empirically on benchmark problems from the UCI repository of machine learning datasets (=-=Blake et al., 1998-=-). The same 27 datasets are used throughout. They comprise all the commonly-used practical datasets in the repository, excluding ones with more than 1000 instances due to the sheer number of experimen... |

2534 |
An Introduction to the Bootstrap
- Efron, Tibshirani
- 1993
(Show Context)
Citation Context ... proposed in the literature that are either modications of existing algorithms or based on similar ideas. Crawford (1989) uses cost-complexity pruning in conjunction with the :632 bootstrap (Efron & T=-=ibshirani, 1993-=-) for error estimation, substituting it for the standard cross-validation procedure. However, Weiss and Indurkhya (1994a,b) demonstrate that cross-validation is almost unbiased and close to optimal in... |

1694 | A Theory of the Learnable - Valiant - 1984 |

1631 | Experiments with a new boosting algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ... all the induced knowledge because they rely on the distance metric to make predictions. Another strand of research that is closely related to separate-and-conquer rule learning is boosting (Freund & =-=Schapire, 1996; Sc-=-hapire et al., 1997). Boosting produces a weighted combination of a series of so-called \weak" classiers by iteratively re-weighting the training data according to the performance of the most rec... |

1203 |
Categorical Data Analysis
- Agresti
- 1990
(Show Context)
Citation Context ...most other classication paradigms, for example, instance-based learning (Aha et al., 1991), neural networks (Rumelhart & McClelland, 1986), Bayesian networks (Jordan, 1999), and logistic regression (A=-=gresti, 1990-=-), they embody an explicit representation of all the knowledge that has been induced from the training data. Given a standard decision tree or list, a user can determine manually how a particular pred... |

1160 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...more accurate trees. However, the trees are also larger. 69 Minimum description length principle Several authors have proposed pruning methods based on the minimum description length (MDL) principle (=-=Rissanen, 197-=-8). These methods derive from the idea that a successful inducer will produce a classier that compresses the data, and exploit the fact that the complexity of a model, as well as the complexity of a d... |

1055 | Instance-based learning algorithms
- Aha, Kibler, et al.
- 1991
(Show Context)
Citation Context ...d Lists Decision trees (Quinlan, 1986b) and lists (Rivest, 1987) are two closely related types of classifier. In contrast to most other classification paradigms, for example, instance-based learning (=-=Aha et al., 1991-=-), neural networks (Rumelhart & McClelland, 1986), Bayesian networks (Jordan, 1999), and logistic regression (Agresti, 1990), they embody an explicit representation of all the knowledge that has been ... |

976 | Fast effective rule induction
- Cohen
- 1995
(Show Context)
Citation Context ...ree’s nodes. Two thirds of the data were used to grow the initial unpruned tree and the remaining third was set aside for pruning—the standard procedure for pruning a classifier using a hold-out set (=-=Cohen, 1995-=-; Fürnkranz, 1997; Oates & Jensen, 1999). Figure 3.4 shows the unpruned decision tree. The number of instances in the pruning data that are misclassified by the tree’s individual nodes are given in pa... |

752 | A study of cross-validation and bootstrap for accuracy estimation and model selection
- Kohavi
- 1995
(Show Context)
Citation Context ...ed for testing exactly once. Usually the parameter k is set to ten. It has been found empirically that this choice produces the most reliable estimates of the classier's true performance on average (K=-=ohavi, 1995-=-b), and there is also a theoretical result that supports thissnding (Kearns, 1996). The variance of the estimate can be further reduced by taking the average of a repeated number of cross-validation r... |

747 | The CN2 induction algorithm
- Clark, Niblett
- 1989
(Show Context)
Citation Context ... the error on the hold-out set, IDCL uses a heuristic that includes a penalty based on the rule set’s size. 5.5.4 Significance testing in rule learners In addition to the parametric test used in CN2 (=-=Clark & Niblett, 1989-=-), permutation tests have been proposed as a means of making statistically reliable decisions in rule learners. Gaines (1989) appears to be the first to propose the use of a permutation test in an ind... |

721 | Boosting the Margin: A New Explanation for the Effective of Voting methods, The Annals of Statistics
- Bartlett, Freund, et al.
(Show Context)
Citation Context ... knowledge because they rely on the distance metric to make predictions. Another strand of research that is closely related to separate-and-conquer rule learning is boosting (Freund & Schapire, 1996; =-=Schapire et al., 1997-=-). Boosting produces a weighted combination of a series of so-called “weak” classifiers by iteratively re-weighting the training data according to the performance of the most recently 152generated we... |

698 | Improved Boosting Algorithms using Confidence-rated Predictions
- Schapire, Singer
- 1999
(Show Context)
Citation Context ...ed by any of the rules learned so far and a weight of zero otherwise. Using a boosting algorithm for weak learners that can abstain from making a prediction when an instance is not covered by a rule (=-=Schapire & Singer, 1998-=-), it is possible to learn very accurate rule sets for two-class problems (Cohen & Singer, 1999). However, the boosted classifier does not make all the knowledge explicit because a substantial part of... |

611 | Learning in Graphical Models
- Jordan, editor
- 1999
(Show Context)
Citation Context ...ated types of classier. In contrast to most other classication paradigms, for example, instance-based learning (Aha et al., 1991), neural networks (Rumelhart & McClelland, 1986), Bayesian networks (Jo=-=rdan, 1999-=-), and logistic regression (Agresti, 1990), they embody an explicit representation of all the knowledge that has been induced from the training data. Given a standard decision tree or list, a user can... |

528 | Approximate statistical tests for comparing supervised classification learning algorithms
- Dietterich
- 1998
(Show Context)
Citation Context ...validation. There exists a heuristic procedure based on repeated cross-validation that attempts to test whether two methods perform dierently with respect to all possible data samples from a domain (D=-=ietterich, 199-=-8). However, it is based on an ad hoc modication of the standard t-test and does not appear to have been widely adopted in the machine learning community. 1.5 Historical Remarks Most of this thesis wa... |

438 | Very simple classification rules perform well on most commonly used datasets
- Holte
- 1993
(Show Context)
Citation Context ...mains are equally likely to occur. This is obviously not correct in the real world: practical domains usually exhibit a certain degree of smoothness and the underlying relationships are quite simple (=-=Holte, 1993-=-). This means that the “overfitting avoidance bias” implemented by pruning strategies often improves performance in practical applications (Schaffer, 1993). However, there are domains where pruning me... |

375 | Learning Decision Lists
- Rivest
- 1987
(Show Context)
Citation Context ...ess of a pruning mechanism depends on its ability to distinguish noisy instances from predictive patterns in the training data. 1.2 Decision Trees and Lists Decision trees (Quinlan, 1986b) and lists (=-=Rivest, 19-=-87) are two closely related types of classier. In contrast to most other classication paradigms, for example, instance-based learning (Aha et al., 1991), neural networks (Rumelhart & McClelland, 1986)... |

324 | Rule induction with CN2: Some recent improvements - Clark, Boswell - 1991 |

293 |
Inferring decision tree using the Minimum Description Length principle
- Quinlan, Riverst
- 1989
(Show Context)
Citation Context ...t includes two pruning methods: error-based pruning in Revision 8 of C4.5 (Quinlan, 1996), and ITI’s pruning procedure (Utgoff et al., 1997) that is based on the minimum description length principle (=-=Quinlan & Rivest, 1989-=-). They found that C4.5 generally performed best. They also performed an experiment in which they tuned C4.5’s pruning parameters using nine-fold cross-validation. Unfortunately they compare these res... |

202 |
Boolean feature discovery in empirical learning
- Pagallo, Haussler
- 1990
(Show Context)
Citation Context ...ter 5 Pruning decision lists Decision trees are accurate classifiers that are easy to understand. However, in some domains their comprehensibility suffers from a problem known as subtree replication (=-=Pagallo & Haussler, 1990-=-). When subtree replication occurs, identical subtrees can be found at several different places in the same tree structure. Figure 5.1 shows an example domain where subtree replication is inevitable. ... |

200 | Improved Use of Continuous Attributes in c4.5 - Quinlan - 1996 |

190 | Generating Accurate Rule Sets without Global Optimization
- Frank, Witten
- 1998
(Show Context)
Citation Context ... on the material for Chapter 3 was made in April, May, and June 1999. Chapter 4 is based on work presented at the Fifteenth International Conference on Machine Learning in Madison, Wisconsin (Frank & =-=Witten, 199-=-8b). Chapter 5 is also based on research presented at the same conference (Frank & Witten, 1998a). However, the material has been revised and signicantly extended. During the course of my PhD studies,... |

173 | Bias plus variance decomposition for zero-one loss functions
- Kohavi, Wolpert
- 1996
(Show Context)
Citation Context ...l possible domains. For every domain where learning algorithm A performs better than algorithm B, there is another domain where it performs worse. This result is known as the \no free lunch theorem&qu=-=ot; (Wolpert, 1996), -=-or the \conservation law for generalization performance" (Schaer, 1994). It implies that pruned and unpruned classiers perform equally well averaged across all possible domains. The no free lunch... |

170 | A Nearest Hyperrectangle Learning Method
- Salzberg
- 1991
(Show Context)
Citation Context ...a decision tree into a rule set. One strand of research considers rule learning as a generalization of instancebased learning (Aha et al., 1991) where an instance can be expanded to a hyperrectangle (=-=Salzberg, 1991; Wet-=-tschereck & Dietterich, 1995; Domingos, 1996). A hyperrectangle is the geometric representation of a rule in instance space. Whereas separate-and-conquer rule learning proceeds in a \top-down" fa... |

167 |
An Empirical Comparison of Selection Measures for Decision Tree Induction
- MINGERS
- 1989
(Show Context)
Citation Context ...tances from D with value V i for a) return Figure 4.1: Generic pre-pruning algorithm is chosen. This splitting criterion can be the attribute's information gain, but other criteria are also possible (=-=Mingers, 198-=-8). The selected attribute is then used to split the set of instances, and the algorithm recurses. If no signicant attributes are found in thesrst step, the splitting process stops and the subtree is ... |

167 | Domainspecific keyphrase extraction
- Frank, Paynter, et al.
- 1999
(Show Context)
Citation Context ...e material presented in this thesis. I conducted research on using model trees for classification (Frank et al., 1998), on applying machine learning to automatic domain-specific keyphrase extraction (=-=Frank et al., 1999-=-), on adapting the naive Bayes learning technique to regression problems (Frank et al., in press), and on making better use of global discretization methods for numeric attributes (Frank & Witten, 199... |

166 | An empirical comparison of pruning methods for decision tree induction - Mingers - 1989 |

158 |
Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustments
- Westfall, Young
- 1993
(Show Context)
Citation Context ...performed, the more likely it becomes that a significant association is observed by chance alone. Consequently the significance level must be adjusted in a situation with multiple significance tests (=-=Westfall & Young, 1993-=-; Jensen & Schmill, 1997). Chapter 3 presents a technique that can be used to automatically find the significance level that maximizes the classifier’s predictive accuracy on future data. Significance... |

157 |
An Exploratory Technique for Investigating Large Quantities of Categorical Data
- Kass
- 1980
(Show Context)
Citation Context ...mple, the parametric chi-squared test is used to decide when to stop splitting the training data into increasingly smaller subsets. The same technique is also used by the decision tree inducer CHAID (=-=Kass, 1980-=-). These pre-pruning techniques will be discussed in 71 more detail in Chapter 4. Jensen et al. (1997) apply critical value pruning in conjunction with the chisquared distribution. However, instead of... |

156 | On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach
- Salzberg
- 1997
(Show Context)
Citation Context ...ifferently across all possible data samples that could potentially be drawn from the domain. It cannot test for this because all cross-validation estimates involved are based on the same set of data (=-=Salzberg, 1997-=-). It is solely a way of inferring the potential outcome of a complete cross-validation. There exists a heuristic procedure based on repeated cross-validation that attempts to test whether two methods... |

149 | Generating production rules from decision trees
- Quinlan
- 1987
(Show Context)
Citation Context ...s \incremental reducederror pruning" (Furnkranz & Widmer, 1994; Furnkranz, 1997), which leads to a combination of the two main paradigms for learning rule sets: obtaining rules from decision tree=-=s (Quinlan, 198-=-7b) and separate-and-conquer rule learning (Furnkranz, 1999). The particular example in Figure 5.1 is a two-class learning problem. In this case it is sucient to learn a set of rules describing only o... |

146 | Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey
- Murthy
- 1998
(Show Context)
Citation Context ...the most extensively researched areas in machine learning. Several surveys of induction methods for decision trees have been published (Safavian & Landgrebe, 1991; Kalles, 1995; Breslow & Aha, 1997b; =-=Murthy, 19-=-98), and these all discuss dierent pruning strategies. In addition, empirical comparisons of a variety of dierent pruning methods have been conducted. This sectionsrst discusses the most popular pruni... |

146 |
A Conservation Law for Generalization Performance
- Schaffer
- 1994
(Show Context)
Citation Context ...etter than algorithm B, there is another domain where it performs worse. This result is known as the “no free lunch theorem” (Wolpert, 1996), or the “conservation law for generalization performance” (=-=Schaffer, 1994-=-). It implies that pruned and unpruned classifiers perform equally well averaged across all possible domains. The no free lunch theorem is based on two assumptions. The first assumption is that the te... |

143 | Concept learning and the problem of small disjuncts
- Holte, Acker, et al.
- 1989
(Show Context)
Citation Context ...fies. It seems clear that small disjuncts are more error prone than large ones, simply because they enjoy less support from the training data. A number of authors have observed this fact empirically (=-=Holte et al., 1989-=-; Ting, 1994; Ali & Pazzani, 1995; Weiss, 1995; Van den Bosch et al., 1997; Weiss & Hirsh, 1998). In the following we explain why this is the case and why pruning algorithms are a remedy for disjuncts... |

136 | Separate-and-Conquer Rule Learning
- Fürnkranz
- 1999
(Show Context)
Citation Context ...er, 1994; Fürnkranz, 1997), which leads to a combination of the two main paradigms for learning rule sets: obtaining rules from decision trees (Quinlan, 1987b) and separate-and-conquer rule learning (=-=Fürnkranz, 1999-=-). The particular example in Figure 5.1 is a two-class learning problem. In this case it is sufficient to learn a set of rules describing only one of the classes, for example, class A. A test instance... |

123 | The lack of a priori distinctions between learning algorithms
- Wolpert
- 1996
(Show Context)
Citation Context ...l possible domains. For every domain where learning algorithm A performs better than algorithm B, there is another domain where it performs worse. This result is known as the \no free lunch theorem&qu=-=ot; (Wolpert, 1996), -=-or the \conservation law for generalization performance" (Schaer, 1994). It implies that pruned and unpruned classiers perform equally well averaged across all possible domains. The no free lunch... |

122 | Incremental reduced error pruning
- Fürnkranz, Widmer
- 1994
(Show Context)
Citation Context ...ix y=0 y=1 B y=0 y=1 z=0 z=1 four B A B z=0 z=1 A B (a) (b) Figure 5.1: An example of subtree replication operations performed by a fast pruning algorithm known as “incremental reducederror pruning” (=-=Fürnkranz & Widmer, 1994-=-; Fürnkranz, 1997), which leads to a combination of the two main paradigms for learning rule sets: obtaining rules from decision trees (Quinlan, 1987b) and separate-and-conquer rule learning (Fürnkran... |

120 | Decision tree induction based on efficient tree restructuring
- Utgoff, Berkman, et al.
- 1997
(Show Context)
Citation Context ...e also published an empirical comparison of “tree-simplification procedures” that includes two pruning methods: error-based pruning in Revision 8 of C4.5 (Quinlan, 1996), and ITI’s pruning procedure (=-=Utgoff et al., 1997-=-) that is based on the minimum description length principle (Quinlan & Rivest, 1989). They found that C4.5 generally performed best. They also performed an experiment in which they tuned C4.5’s prunin... |

119 | Multivariate decision trees
- Brodley, Utgoff
- 1995
(Show Context)
Citation Context ...nvolved in the decision at each node. “Multivariate” decision trees can test for higher-order relationships that involve more than one attribute, for example, linear combinations of attribute values (=-=Brodley & Utgoff, 1995-=-). This makes them potentially more powerful predictors. However, they are also harder to interpret and computationally more expensive to generate. Standard learning algorithms for decision trees, for... |

119 | Overfitting avoidance as bias
- Schaffer
- 1993
(Show Context)
Citation Context ...d the underlying relationships are quite simple (Holte, 1993). This means that the “overfitting avoidance bias” implemented by pruning strategies often improves performance in practical applications (=-=Schaffer, 1993-=-). However, there are domains where pruning methods are likely to decrease the accuracy of a classifier, even in the IID setting (Schaffer, 1993). For example, according to the discussion from Section... |

114 | PRISM: an algorithm for inducing modular rules
- Cendrowska
- 1987
(Show Context)
Citation Context ...wo classes and the closed-world assumption cannot be applied. There are two ways of learning rules in the general multi-class setting. Thesrst approach generates a rule set separately for each class (=-=Cendrowska, 1987). T-=-his has the disadvantage that rule sets for dierent classes can \overlap" in instance 104 if x=0 and z=1 then A if y=0 and z=0 then C if x=1 then B if y=1 and z=1 then A otherwise B (a) if x=0 an... |

111 |
Learning by Being Told and Learning from Examples: An Experimental Comparison of the Two Methods of Knowledge Acquisition in the Context of Developing an Expert System for Soybean Disease Diagnosis, Int.J. of Policy Analysis and Info
- Michalski, Chilausky
- 1980
(Show Context)
Citation Context ...ng algorithms for classification problems have many practical applications. Consider, for example, one of the first fielded applications of classification learning: the diagnosis of soybean diseases (=-=Michalski & Chilausky, 1980-=-). In this application, the individual instances are soybean plants that are described by a set of attributes. Most of the attributes correspond to symptoms of various soybean diseases and their value... |

107 | Wrappers for Performance Enhancement and Oblivious Decision Graphs
- Kohavi
- 1995
(Show Context)
Citation Context ...ed for testing exactly once. Usually the parameter k is set to ten. It has been found empirically that this choice produces the most reliable estimates of the classier's true performance on average (K=-=ohavi, 1995-=-b), and there is also a theoretical result that supports thissnding (Kearns, 1996). The variance of the estimate can be further reduced by taking the average of a repeated number of cross-validation r... |

104 | ASSISTANT 86: A knowledge-elicitation tool for sophisticated users - Cestnik, Kononenko, et al. - 1987 |

103 | Error-based and entropy-based discretization of continuous features
- Kohavi, Sahami
- 1996
(Show Context)
Citation Context ..., with respect to the classification error, it appears to be completely useless. This property of error-based criteria has first been observed in the closely related problem of global discretization (=-=Kohavi & Sahami, 1996-=-), where numeric attributes are discretized into intervals prior to induction. The experimental results invite the question of when pre-pruning should be used instead of post-pruning in practical appl... |

103 | A Survey of Decision Tree Classifier Methodology
- Safavian, Landgrebe
- 1991
(Show Context)
Citation Context ... 3.6 Related work Pruning methods for decision trees are one of the most extensively researched areas in machine learning. Several surveys of induction methods for decision trees have been published (=-=Safavian & Landgrebe, 1991-=-; Kalles, 1995; Breslow & Aha, 1997b; Murthy, 1998), and these all discuss different pruning strategies. In addition, empirical comparisons of a variety of different pruning methods have been conducte... |

101 | Improvements on cross-validation: the .632+ bootstrap method - Efron, Tibshirani - 1997 |

95 | variance and arcing classifiers
- Breiman, ―Bias
- 1996
(Show Context)
Citation Context ...ut increasing the bias. There are several ways to define the bias plus variance decomposition for classification learning in exact mathematical terms (Dietterich & Kong, 1995; Kohavi & Wolpert, 1996; =-=Breiman, 1996-=-; Tibshirani, 1996; Friedman, 1997). However, it has yet to be determined which of these definitions is the most appropriate. 2.5 Conclusions This chapter explains why pruning methods are an important... |