## Top–Down Induction of Decision Trees Classifiers–A survey (2005)

Venue: | Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on |

Citations: | 19 - 3 self |

### BibTeX

@ARTICLE{Rokach05top–downinduction,

author = {Lior Rokach and Oded Maimon},

title = {Top–Down Induction of Decision Trees Classifiers–A survey},

journal = {Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on},

year = {2005},

pages = {487}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract—Decision trees are considered to be one of the most popular approaches for representing classifiers. Researchers from various disciplines such as statistics, machine learning, pattern recognition, and data mining considered the issue of growing a decision tree from available data. This paper presents an updated survey of current methods for constructing decision tree classifiers in a top-down manner. The paper suggests a unified algorithmic framework for presenting these algorithms and describes the various splitting criteria and pruning methodologies. Index Terms—Classification, decision trees, pruning methods, splitting criteria. I.

### Citations

5438 | C4.5: Programs for Machine Learning - Quinlan - 1993 |

4457 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...yperplanes, each orthogonal to one of the axes. Naturally, decision-makers prefer less complex decision trees, since they may be considered more comprehensive. Furthermore according to Breiman et al. =-=[5]-=- the tree complexity has a crucial effect on its accuracy performance. The tree complexity is explicitly controlled by the stopping criteria used and the pruning method employed. Usually, the tree com... |

4178 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...of the multivariate splitting criteria are based on linear combination of the input attributes. Finding the best linear combination can be performed using a greedy search [5], [29] linear programming =-=[30]-=-, [31], linear discriminant analysis [18], [30], [32]–[35] and others [36]–[38]. VII. STOPPING CRITERIA The growing phase continues until a stopping criteria is triggered. The following conditions are... |

3644 | Induction of decision trees
- Quinlan
- 1986
(Show Context)
Citation Context ...subdivides the training set into smaller subsets, until no split gains sufficient splitting measure or a stopping criteria is satisfied. There are various top-down decision trees inducers such as ID3 =-=[11]-=-, C4.5 [12], and CART [5]. Some consist of two conceptual phases: Growing and pruning (C4.5 and CART). Other inducers perform only the growing phase. V. UNIVARIATE SPLITTING CRITERIA A. Overview In mo... |

498 | Multivariate adaptive regression splines - Friedman - 1991 |

474 | Very Simple Classification Rules Perform Well on Most Commonly Used Datasets - Holte - 1993 |

326 |
Complexity in Statistical Inquiry
- Rissanen
- 1989
(Show Context)
Citation Context ...ee and the number of internal nodes is smaller than the number of leaves, OPT-2 is usually more efficient than OPT in terms of computational complexity. H. Minimum Description Length Pruning Rissanen =-=[42]-=-, Quinlan and Rivest [43] and Mehta et al. [44] used the minimum description length (MDL) for evaluating the generalized accuracy of a node. This method measures the size of a decision tree by means o... |

307 |
Inferring decision trees using the minimum description length princliple. Information and Computation 80:227{248
- Quinlan, Rivest
- 1989
(Show Context)
Citation Context ...rnal nodes is smaller than the number of leaves, OPT-2 is usually more efficient than OPT in terms of computational complexity. H. Minimum Description Length Pruning Rissanen [42], Quinlan and Rivest =-=[43]-=- and Mehta et al. [44] used the minimum description length (MDL) for evaluating the generalized accuracy of a node. This method measures the size of a decision tree by means of the number of bits requ... |

274 | SPRINT: A Scalable Parallel Classifier for Data
- Shafer, Mehta
(Show Context)
Citation Context ... largest dataset that can be processed because it uses a data structure that scales with the dataset size and this data structure is required to be resident in main memory all the time. Shafer et al. =-=[63]-=- have presented a similar solution called SPRINT. This algorithm induces decision trees relatively quickly and removes all of the memory restrictions from decision tree induction. SPRINT scales any im... |

268 | A system for induction of oblique decision trees
- Murthy, Kasif, et al.
- 1994
(Show Context)
Citation Context ...ivariate criteria. Most of the multivariate splitting criteria are based on linear combination of the input attributes. Finding the best linear combination can be performed using a greedy search [5], =-=[29]-=- linear programming [30], [31], linear discriminant analysis [18], [30], [32]–[35] and others [36]–[38]. VII. STOPPING CRITERIA The growing phase continues until a stopping criteria is triggered. The ... |

206 | SLIQ: A fast scalable classifier for data mining
- Mahta, Agrawal, et al.
- 1996
(Show Context)
Citation Context ...ssification performance, meaning that the classification accuracy of the combined decision trees is not as good as the accuracy of a single decision tree induced from the entire dataset. Mehta et al. =-=[62]-=- have proposed SLIQ an algorithm that does not require loading the entire dataset into the main memory, instead it uses secondary memory (disk) namely a certain instance is not necessarily resident in... |

204 |
An exploratory technique for investigating large quantities of categorical data
- KASS
- 1980
(Show Context)
Citation Context ... for node. D. CHAID Researchers in applied statistics have developed starting from early seventies several procedures for generating decision trees, such as: AID [49], MAID [50], THAID [51] and CHAID =-=[52]-=-. Chisquare-automatic-interaction-detection (CHIAD) was originally designed to handle nominal attributes only. For each input attribute , CHAID finds the pair of values in that is least significantly ... |

204 |
Boolean feature discovery in empirical learning
- Pagallo, Haussler
- 1990
(Show Context)
Citation Context ...hat is that other classifiers can compactly describe a classifer that would be very challenging to represent using a decision tree. A simple illustration of this phenomenon is the replication problem =-=[53]-=- of decision trees. Since most decision trees divide the instance space into mutually exclusive regions to represent a concept, in some cases Fig. 3. Illustration of decision tree with replication. Fi... |

193 |
Constructing optimal binary decision trees is NP-complete
- Hya, L, et al.
- 1976
(Show Context)
Citation Context ...mal decision tree from a given data is considered to be a hard task. Hancock et al. [7] have showed that finding a minimal decision tree consistent with the training set is NP-Hard. Hyafil and Rivest =-=[8]-=- have showed that constructing a minimal binary tree with respect to the expected number of tests required for classifying an unseen instance is NP-complete. Even finding the minimal equivalent decisi... |

186 | Snih, Y.: A comparison of prediction accuracy, complexity and training time of thirty-three old and new classification algorithms
- Loh
- 1999
(Show Context)
Citation Context ...ed in this table. Nevertheless most of these algorithms are variation of the algorithmic framework presented above. A profound comparison of the above algorithms and many others has been conducted in =-=[72]-=-. 484 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005 TABLE I ADDITIONAL DECISION TREES INDUCERS XI. ADVANTAGES AND DISADVANTAGES OF ... |

176 |
An Empirical Comparison of Pruning Methods for Decision Tree Induction
- Mingers
- 1989
(Show Context)
Citation Context ...thods reported in the literature. Wallace and Patrick [45] proposed a minimum message length (MML) pruning method. Kearns and Mansour [46] provide a theoretically-justified pruning algorithm. Mingers =-=[26]-=- proposed the critical value pruning (CVP). This method prunes an internal node if its splitting criterion is not greater than a certain threshold. By that it is similar to a stopping criterion. Howev... |

170 | Incremental Induction of Decision Trees
- Utgoff
- 1986
(Show Context)
Citation Context ...trees inducers require rebuilding the tree from scratch for reflecting new data that has became available. Several researches have addressed the issue of updating decision trees incrementally. Utgoff =-=[68]-=-, [69] presents several methods for updating decision trees incrementally. An extension to the CART algorithm that is capable of inducing incrementally is described in Crawford [70]). XIII. CONCLUSION... |

164 | Automatic Construction of Decision Trees from Data: A MultiDisciplinary Survey - Murthy - 1998 |

132 | Multivariate adaptive regression splines. The Annals of Statistics - Friedman - 1991 |

127 | Decision tree induction based on efficient tree restructuring - Utgoff, Berkman, et al. - 1997 |

123 | Multivariate decision trees - Brodley, Utgoff - 1995 |

108 | Unknown attribute values in induction
- Quinlan
- 1989
(Show Context)
Citation Context ...its values are missing) as well as classification (new instance that miss certain values). This problem has been addressed by several researchers such as Friedman [18], Breiman et al. [5] and Quinlan =-=[48]-=-. Friedman [18] suggests handling missing values in the training set in the following way. Let indicate the subset of instances in whose values are missing. When calculating the splitting criteria usi... |

107 | V.: RainForest – A Framework for Fast Decision Tree Construction of Large Datasets - Gehrke, Ramakrishnan, et al. - 1998 |

105 | A further comparison of splitting rules for decision-tree induction - Buntine, Niblett - 1992 |

105 | BOAT — optimistic decision tree construction
- Gehrke, Ganti, et al.
- 1999
(Show Context)
Citation Context ..., this requirement is considered modest and reasonable. Other decision tree inducers for large datasets can be found in the works of Alsabti et al. [65], Freitas and Lavington [66], and Gehrke et al. =-=[67]-=-. C. Incremental Induction Most of the decision trees inducers require rebuilding the tree from scratch for reflecting new data that has became available. Several researches have addressed the issue o... |

103 | Learning boolean concepts in the presence of many irrelevant features", Arti cial Intelligence 69(1-2
- Almuallim, Dietterich
- 1994
(Show Context)
Citation Context ...es in which all nodes at the same level test the same attribute. Despite its restriction, oblivious decision trees are found to be effective as a feature selection procedure. Almuallim and Dietterich =-=[54]-=- as well as Schlimmer [55] have proposed forward feature selection procedure by constructing oblivious decision trees, whereas Langley and Sage [56] suggested backward selection using the same means. ... |

100 |
The attribute selection problem in decision tree generation
- Fayyad, Irani
- 1992
(Show Context)
Citation Context ...ttribute is binary the Gini and twoing criteria are equivalent. For multiclass problems, the twoing criteria prefers attributes with evenly divided splits. J. Orthogonality Criterion Fayyad and Irani =-=[17]-=- have presented the orthogonality (ORT) criterion. This binary criteria is defined as (13) where is the angle between two distribution vectors and of the target attribute on the bags and , respectivel... |

96 | A Comparative Analysis of Methods For Pruning Decision Trees
- Esposito, Malerba, et al.
- 1997
(Show Context)
Citation Context ...uned if at least one of its children does not fulfill the pruning criterion. J. Comparison of Pruning Methods Several studies aim to compare the performance of different pruning techniques [6], [26], =-=[47]-=-. The results indicate that some methods (such as cost-complexity pruning, reduced-error pruning) tend to over-pruning, i.e., creating smaller but less accurate decision trees. Other methods (like err... |

94 | Split selection methods for classification trees
- Loh, Shih
- 1997
(Show Context)
Citation Context ...ivariate Splitting Criteria Comparative studies of the splitting criteria described above, and others, have been conducted by several researchers during the last thirty years, such as [5], [17], [24]–=-=[28]-=-, [71], [73]. Most of these comparisons are based on empirical results, although there are some theoretical conclusions. Most of the researchers point out that in most of the cases the choice of split... |

89 |
Coding decision trees
- Wallace, Patrick
- 1991
(Show Context)
Citation Context ...plitting cost of an internal node is calculated based on the cost aggregation of its children. I. Other Pruning Methods There are other pruning methods reported in the literature. Wallace and Patrick =-=[45]-=- proposed a minimum message length (MML) pruning method. Kearns and Mansour [46] provide a theoretically-justified pruning algorithm. Mingers [26] proposed the critical value pruning (CVP). This metho... |

76 | A Recursive Partitioning Decision Rule for Nonparametric Classification
- Friedman
- 1977
(Show Context)
Citation Context ...espectively. Fayyad and Irani [17] showed that this criterion performs better than the information gain and the Gini index for specific problem constellations. K. Kolmogorov–Smirnov Criteria Friedman =-=[18]-=- and Rounds [19] have suggested a binary criterion that uses Kolmogorov–Smirnov distance. Assuming a binary target attribute, namely , the criterion is defined as and and (14) Utgoff and Clouse [20] s... |

76 |
Mining Very Large Database with Parallel Processing
- Freitas, Lavington
- 1998
(Show Context)
Citation Context ... input relation. However, this requirement is considered modest and reasonable. Other decision tree inducers for large datasets can be found in the works of Alsabti et al. [65], Freitas and Lavington =-=[66]-=-, and Gehrke et al. [67]. C. Incremental Induction Most of the decision trees inducers require rebuilding the tree from scratch for reflecting new data that has became available. Several researches ha... |

69 | Public: A decision tree classifier that integrates building and pruning - Rastogi, Shim - 1998 |

68 | Efficiently inducing determinations: a complete and systematic search algorithm that uses optimal pruning
- Schlimmer
- 1993
(Show Context)
Citation Context ...he same level test the same attribute. Despite its restriction, oblivious decision trees are found to be effective as a feature selection procedure. Almuallim and Dietterich [54] as well as Schlimmer =-=[55]-=- have proposed forward feature selection procedure by constructing oblivious decision trees, whereas Langley and Sage [56] suggested backward selection using the same means. Kohavi and Sommer [57] hav... |

66 | ªAn Iterative Growing and Pruning Algorithm for Classification Tree Design,º - Gelfand, Ravishankar, et al. - 1991 |

64 | On the Accuracy of Meta-Learning for Scalable Data Mining. I n t e l l i g e n t I n f o r m a t i o n S y s t e m s 8
- Chan, Stolfo
- 1997
(Show Context)
Citation Context ...nduced is bounded by the memory size. Fifield [60] suggests parallel implementation of the ID3 algorithm. However, like Catlett it assumes that all dataset can fit in the main memory. Chan and Stolfo =-=[61]-=- suggest partitioning the datasets into several disjoin datasets, such that each dataset is loaded separately into the memory and used to induce a decision tree. The decision trees are then combined t... |

59 | Tree-structured Classification via Generalized Discriminant Analysis, JAm Stat Assoc - LOH, VANICHSETAKUL - 1988 |

59 |
Perceptron trees: A case study in hybrid concept representations
- Utgoff
- 1989
(Show Context)
Citation Context ...the input attributes. Finding the best linear combination can be performed using a greedy search [5], [29] linear programming [30], [31], linear discriminant analysis [18], [30], [32]–[35] and others =-=[36]-=-–[38]. VII. STOPPING CRITERIA The growing phase continues until a stopping criteria is triggered. The following conditions are common stopping rules. All instances in the training set belong to a sing... |

58 | Towards tractable algebras for bags
- Grumbach, Milo
(Show Context)
Citation Context ...on of the deterministic case when a supervisor classifies a tuple using a function . This paper uses the common notation of bag algebra to present projection and selection of tuples (see for instance =-=[4]-=-). Originally, the machine learning community has introduced the problem of concept learning. To learn a concept is to infer its general definition from a set of examples. This definition may be eithe... |

56 |
Trading accuracy for simplicity in decision trees
- Bohanec, Bratko
- 1994
(Show Context)
Citation Context ...ds can improve the generalization performance of a decision tree especially in noisy domains. Another key motivation of pruning is “trading accuracy for simplicity” as presented by Bratko and Bohanec =-=[39]-=-. When the goal is to produce a sufficiently accurate compact concept description, pruning is highly useful. Within this process the initial decision tree is seen as a completely accurate one. Thus, t... |

38 | Extensions to the cart algorithm - Crawford - 1989 |

37 | Oblivious decision trees and abstract cases
- Langley, Sage
- 1994
(Show Context)
Citation Context ...ature selection procedure. Almuallim and Dietterich [54] as well as Schlimmer [55] have proposed forward feature selection procedure by constructing oblivious decision trees, whereas Langley and Sage =-=[56]-=- suggested backward selection using the same means. Kohavi and Sommer [57] have showed that oblivious decision trees can be converted to a decision table. Recently, Last et al. [58] have suggested a n... |

36 | A Fast, Bottom-Up Decision Tree Pruning Algorithm with Near-Optimal Generalization
- Kearns, Mansour
- 1998
(Show Context)
Citation Context ... its children. I. Other Pruning Methods There are other pruning methods reported in the literature. Wallace and Patrick [45] proposed a minimum message length (MML) pruning method. Kearns and Mansour =-=[46]-=- provide a theoretically-justified pruning algorithm. Mingers [26] proposed the critical value pruning (CVP). This method prunes an internal node if its splitting criterion is not greater than a certa... |

33 |
Applications of information theory to psychology
- Attneave
- 1959
(Show Context)
Citation Context ...fined as Gini (6) Consequently, the evaluation criteria for selecting the attribute is defined as Gini Gain Gini Gini (7) D. Likelihood Ratio Chi-Squared Statistics The likelihood ratio is defined as =-=[14]-=-: Information Gain (8) This ratio is useful for measuring the statistical significance of the information gain criteria. The zero hypothesis is that the input attribute and the target attribute are co... |

32 | An exact probability metric for Decision Tree splitting and stopping
- Martin
- 1997
(Show Context)
Citation Context ...plitting Criteria Additional univariate splitting criteria can be found in the literature, such as permutation statistic [21], mean posterior improvement [22], and hypergeometric distribution measure =-=[23]-=-. M. Comparison of Univariate Splitting Criteria Comparative studies of the splitting criteria described above, and others, have been conducted by several researchers during the last thirty years, suc... |

27 |
Simplifying Decision Trees, Int
- Quinlan
- 1987
(Show Context)
Citation Context ... set. On the other hand, if the given dataset is not large enough they propose to use cross-validation methodology, despite the computational complexity implications. C. Reduced-Error Pruning Quinlan =-=[6]-=- has suggested a simple procedure for pruning decision trees known as reduced-error pruning. While traversing over the internal nodes from the bottom to the top, the procedure checks for each internal... |

26 | Multicategory discrimination via linear programming
- Bennett, Mangasarian
- 1994
(Show Context)
Citation Context ... multivariate splitting criteria are based on linear combination of the input attributes. Finding the best linear combination can be performed using a greedy search [5], [29] linear programming [30], =-=[31]-=-, linear discriminant analysis [18], [30], [32]–[35] and others [36]–[38]. VII. STOPPING CRITERIA The growing phase continues until a stopping criteria is triggered. The following conditions are commo... |

25 |
Lower bounds on learning decision lists and trees
- Hancock, Jiang, et al.
- 1996
(Show Context)
Citation Context ... be also defined, for instance: minimizing the number of nodes or minimizing the average depth. Induction of an optimal decision tree from a given data is considered to be a hard task. Hancock et al. =-=[7]-=- have showed that finding a minimal decision tree consistent with the training set is NP-Hard. Hyafil and Rivest [8] have showed that constructing a minimal binary tree with respect to the expected nu... |

25 |
Decision trees and multi-valued attributes
- Quinlan
- 1988
(Show Context)
Citation Context ...tributes. Then taking into consideration only attributes that have performed at least as good as the average information gain, the attribute that has obtained the best ratio gain is selected. Quinlan =-=[15]-=- has showed that the gain ratio tends to outperform simple information gain criteria both from the accuracy aspect as well as from classifier complexity aspects. G. Distance Measure Lopez de Mantras [... |

24 | Myopic policies in sequential classification - Ben-Bassat - 1978 |