## Feature Minimization within Decision Trees (1996)

Venue: | Computational Optimization and Applications |

Citations: | 14 - 2 self |

### BibTeX

@ARTICLE{Bredensteiner96featureminimization,

author = {Erin J. Bredensteiner and Kristin P. Bennett},

title = {Feature Minimization within Decision Trees},

journal = {Computational Optimization and Applications},

year = {1996},

volume = {10},

pages = {10--111}

}

### OpenURL

### Abstract

Decision trees for classification can be constructed using mathematical programming. Within decision tree algorithms, the feature minimization problem is to construct accurate decisions using as few features or attributes within each decision as possible. Feature minimization is an important aspect of data mining since it helps identify what attributes are important and helps produce accurate and interpretable decision trees. In feature minimization with bounded accuracy, we minimize the number of features using a given misclassification error tolerance. This problem can be formulated as a parametric bilinear program and is shown to be NP-complete. A parametric FrankWolfe method is used to solve the bilinear subproblems. The resulting minimization algorithm produces more compact, accurate, and interpretable trees. This procedure can be applied to many di#erent error functions. Formulations and results for two error functions are given. One method, FM RLP-P, dramatically reduced the number of features of one dataset from 147 to 2 while maintaining an 83.6% testing accuracy. Computational results compare favorably with the standard univariate decision tree method, C4.5, as well as with linear programming methods of tree construction. Key Words: Data mining, machine learning, feature minimization, decision trees, bilinear programming. # Knowledge Discovery and Data Mining Group, Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180. Email bredee@rpi.edu, bennek@rpi.edu. Telephone (518) 276-6899. FAX (518) 276-4824. This material is based on research supported by National Science Foundation Grant 949427. 1

### Citations

10921 |
Computers and Intractability: A Guide to the Theory of NP-Completeness
- Garey, Johnson
(Show Context)
Citation Context ...polynomial time whethersxy > 0 for at least K vectorssx # X and if at most # elementssy i , i = 1, . . . , n, are nonzero. To show that the above problem is NP-complete the Open Hemisphere problem of =-=[11]-=- can be easily transformed into a single instance of the bounded accuracy with limited features problem. The Open Hemisphere problem is the problem of determining if there is a vectorsy 11 such thatsx... |

4934 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...avoid over-parameterization and the resulting trees are more readily interpretable provided the number of decisions is not excessive. Examples of univariate decision tree algorithms are C4.5 and CART =-=[24, 6]-=-. Reducing the number of features at each decision allows the inclusion of all of the benefits of multivariate decisions while maintaining the simplicity of univariate decisions. The goal of this pape... |

3909 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...avoid over-parameterization and the resulting trees are more readily interpretable provided the number of decisions is not excessive. Examples of univariate decision tree algorithms are C4.5 and CART =-=[24, 6]-=-. Reducing the number of features at each decision allows the inclusion of all of the benefits of multivariate decisions while maintaining the simplicity of univariate decisions. The goal of this pape... |

2171 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ... 2.2 Feature Minimization Applied to RLP-P The following Perturbed Robust Linear Program (RLP-P) [3] is a linear programming modification of the Generalized Optimal Plane problem of Cortes and Vapnik =-=[8]-=-. This method is constructed to reduce a weighted average of the sum of the distances from the misclassified points to the separating plane and to decrease the classification error. 8 min w,#,u,v,s (1... |

1031 | Wrappers for feature subset selection
- Kohavi, John
- 1997
(Show Context)
Citation Context ...es (Forward Selection and Backward Elimination) are very popular in statistics [14]. They use statistical measures to determine which features to add or delete. The wrapper methods of Kohavi and John =-=[15]-=- provide a less greedy search of the feature space. But none of these approaches change the underlying discrimination algorithms. In Section 2, we will discuss the background and formulation of our fe... |

741 |
Aha, “UCI repository of machine learning data bases,” http: //www.ics.uci.edu/~mlearn/MLRepository.html
- Murphy, W
- 1992
(Show Context)
Citation Context ...Star/Galaxy Database and the Database Marketing data set are available via anonymous file transfer protocol (ftp) from the University of California Irvine UCI Repository of Machine Learning Databases =-=[21]-=-. Database Marketing The Database Marketing data set is divided into a training portion and a testing portion. The training set contains information on 1979 customers. The testing set has 1491 custome... |

721 |
Cross-Validatory Choices and Assessment of Statistical Prediction (with Discussion
- Stone
- 1974
(Show Context)
Citation Context ...achieve better results. In RLP-P and FM RLP-P, we let # = .02. Better solutions may result with di#erent choices for #. To estimate generalization or accuracy on future data, 10-fold cross validation =-=[26]-=- was used to evaluate the testing set accuracies. The original data set is split into ten equal parts. Nine of these are used for training and the remaining one is saved for testing. This process is r... |

180 |
Analysis of hidden units in a layered network trained to classify sonar targets
- Gorman, Sejnowski
- 1988
(Show Context)
Citation Context ...ly linearly separable. These two data sets are generated from a large set of star and galaxy images collected by Odewahn [22] at the University of Minnesota. Sonar, Mines vs. Rocks The Sonar data set =-=[13]-=- contains sixty real-valued attributes between 0.0 and 1.0 used to define 208 mines and rocks. Attributes are obtained by bouncing sonar signals o# a metal cylinder (or rock) at various angles and ris... |

172 |
Mathematical programs with equilibrium constraints, Cambridge Univerity
- Luo, Pang, et al.
- 1996
(Show Context)
Citation Context ...nction. The step function is removed from problem (5) using properties found in [18] and [19]. The details are contained in the appendix. The resulting linear program (6) with equilibrium constraints =-=[17]-=- is equivalent to the original problem (5). min w+,w- ,#,u,v,r 1 m eu + 1 k ev + #er subject to u +A(w+ -w) - e# - e # 0 v -B(w+ - w- ) + e# - e # 0 (w+ + w- )(e - r) = 0 0 # r # e u # 0, v # 0, w+ # ... |

166 |
An algorithm for quadratic programming
- Frank, Wolfe
- 1956
(Show Context)
Citation Context ...s given in the following two subsections. 4.1 Bilinear Subproblems The parametric bilinear programming formulation (8) is an uncoupled bilinear program. It has been shown that a Frank-Wolfe algorithm =-=[10]-=- applied to an uncoupled bilinear program will converge to a global solution or a stationary point [5]. Applying this Frank-Wolfe algorithm to problem (8) we obtain the following algorithm: Algorithm ... |

119 | Multivariate decision trees
- Brodley, Utgoff
- 1995
(Show Context)
Citation Context ...ons using as few features as possible. By minimizing the number of features used at each decision, understandability of the resulting tree is increased and the number of data evaluations is decreased =-=[7]-=-. Feature minimization is not necessary in univariate decision tree algorithms in which each decision in the tree is based on a single feature or attribute. Note that in this paper we use the term fea... |

60 | Feature selection via mathematical programming
- Bradley, Mangasarian, et al.
(Show Context)
Citation Context ...ting mathematical program constructs the discriminant and minimizes the number of features used simultaneously. Other mathematical programming approaches to feature minimization have been proposed in =-=[23, 20]-=-. The problem formulations and method of solutions are di#erent than our 1 feature minimization methods. All of these recent mathematical programming approaches contrast with popular feature selection... |

52 | Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology
- Kohavi, Sommerfield
- 1995
(Show Context)
Citation Context ...minimizes some measure of the classification error. We propose directly changing the underlying discrimination algorithm. Some common approaches to feature minimization are based on greedy heuristics =-=[16, 7]-=-. Sequential Backward Elimination (SBE) and Sequential Forward Elimination (SFE) [7] involve searching the feature space for features that do not contribute (SBE) or contribute (SFE) to the quality of... |

45 |
Applied Multivariate Data Analysis Volume II: Categorical and multivariate methods
- Jobson
- 1992
(Show Context)
Citation Context ...move is determined by finding the best discriminant for each possible attribute. Similar stepwise discrimination procedures (Forward Selection and Backward Elimination) are very popular in statistics =-=[14]-=-. They use statistical measures to determine which features to add or delete. The wrapper methods of Kohavi and John [15] provide a less greedy search of the feature space. But none of these approache... |

43 |
Automated star/galaxy discrimination with neural networks
- Odewahn, Stockwell, et al.
- 1992
(Show Context)
Citation Context ...a galaxy and is described by 14 numeric attributes. The bright data set is nearly linearly separable. These two data sets are generated from a large set of star and galaxy images collected by Odewahn =-=[22]-=- at the University of Minnesota. Sonar, Mines vs. Rocks The Sonar data set [13] contains sixty real-valued attributes between 0.0 and 1.0 used to define 208 mines and rocks. Attributes are obtained by... |

39 |
Decision tree construction via linear programming
- Bennett
- 1992
(Show Context)
Citation Context ...t S is denoted by arg min x#S f(x). For a vector x in R n , x+ will denote the vector in R n with components (x+ ) i := max{x i , 0}, i = 1, . . . , n. The step function x # will denote the vector in =-=[0, 1]-=- n with components (x # ) i := 0 if x i # 0 and (x # ) i := 1 if x i > 0, i = 1, . . . , n. 2 Feature Minimization At each decision we are interested in finding a linear function that separates the tw... |

39 | Misclassification minimization
- Mangasarian
- 1994
(Show Context)
Citation Context ...[3]. We will refer to it as the Perturbed Robust Linear Program (RLP-P). Our feature minimization method could also be applied 4 to algorithms that minimize the number of points misclassified such as =-=[2, 18]-=- or to other successful linear programming approaches [12, 25], but we leave these extensions for future work. 2.1 Feature Minimization Applied to RLP The following robust linear programming problem, ... |

36 |
Neural network training via linear programming
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ...rst error function minimizes the average magnitude of misclassified points within each class. The underlying problem without feature minimization is a linear program. This robust linear program (RLP) =-=[4]-=- has been used for decision tree construction [1]. RLP combined with the greedy sequential backward elimination method for feature minimization, a simplified version of SBE, forms the basis of a breas... |

35 | Bilinear separation of two sets in n-space
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ...of a parametric bilinear program. We then prove in Section 3 that our feature minimization problem is NP-complete. 2 In Section 4, we propose an algorithm based on the Frank-Wolfe method discussed in =-=[5]-=- for solving the parametric bilinear programming problem. Section 5 contains a computational comparison of our feature minimization method to C4.5 and two linear programming approaches to decision tre... |

30 |
Improved linear programming models for discriminant analysis. Decision Sciences
- Glover
- 1990
(Show Context)
Citation Context ...am (RLP-P). Our feature minimization method could also be applied 4 to algorithms that minimize the number of points misclassified such as [2, 18] or to other successful linear programming approaches =-=[12, 25]-=-, but we leave these extensions for future work. 2.1 Feature Minimization Applied to RLP The following robust linear programming problem, RLP [4], minimizes a weighted average of the sum of the distan... |

27 | Machine learning via polyhedral concave minimization
- Mangasarian
- 1995
(Show Context)
Citation Context ...ting mathematical program constructs the discriminant and minimizes the number of features used simultaneously. Other mathematical programming approaches to feature minimization have been proposed in =-=[23, 20]-=-. The problem formulations and method of solutions are di#erent than our 1 feature minimization methods. All of these recent mathematical programming approaches contrast with popular feature selection... |

23 |
A polynomial time algorithm for the construction and training of a class of multilayer perceptrons, Neural Networks 6
- Roy, Kim, et al.
- 1993
(Show Context)
Citation Context ...am (RLP-P). Our feature minimization method could also be applied 4 to algorithms that minimize the number of points misclassified such as [2, 18] or to other successful linear programming approaches =-=[12, 25]-=-, but we leave these extensions for future work. 2.1 Feature Minimization Applied to RLP The following robust linear programming problem, RLP [4], minimizes a weighted average of the sum of the distan... |

19 | Geometry in learning
- Bennett, Bredensteiner
- 1998
(Show Context)
Citation Context ...ication of the first. In addition to decreasing the average magnitude of misclassified points, it also decreases the maximum classification error. This problem can also be written as a linear program =-=[3]-=-. We will refer to it as the Perturbed Robust Linear Program (RLP-P). Our feature minimization method could also be applied 4 to algorithms that minimize the number of points misclassified such as [2,... |

17 | A parametric optimization method for machine learning
- Bennett, Bredensteiner
- 1997
(Show Context)
Citation Context ...[3]. We will refer to it as the Perturbed Robust Linear Program (RLP-P). Our feature minimization method could also be applied 4 to algorithms that minimize the number of points misclassified such as =-=[2, 18]-=- or to other successful linear programming approaches [12, 25], but we leave these extensions for future work. 2.1 Feature Minimization Applied to RLP The following robust linear programming problem, ... |

13 | Mathematical programming in machine learning
- Mangasarian
- 1996
(Show Context)
Citation Context ...here # > 0 is constant. The first issue we will confront in the above problem is the elimination of the step function. The step function is removed from problem (5) using properties found in [18] and =-=[19]-=-. The details are contained in the appendix. The resulting linear program (6) with equilibrium constraints [17] is equivalent to the original problem (5). min w+,w- ,#,u,v,r 1 m eu + 1 k ev + #er subj... |

9 | Image analysis and machine learning applied to Breast cancer diagnosis and prognosis
- Wolberg, Street, et al.
- 1995
(Show Context)
Citation Context ...n tree construction [1]. RLP combined with the greedy sequential backward elimination method for feature minimization, a simplified version of SBE, forms the basis of a breast cancer diagnosis system =-=[28, 27]-=-. The second error function is a slight modification of the first. In addition to decreasing the average magnitude of misclassified points, it also decreases the maximum classification error. This pro... |

2 |
diagnosis and prognosis via linear-programming-based machine learning
- Cancer
- 1994
(Show Context)
Citation Context ...n tree construction [1]. RLP combined with the greedy sequential backward elimination method for feature minimization, a simplified version of SBE, forms the basis of a breast cancer diagnosis system =-=[28, 27]-=-. The second error function is a slight modification of the first. In addition to decreasing the average magnitude of misclassified points, it also decreases the maximum classification error. This pro... |