## Irrelevant Features and the Subset Selection Problem (1994)

### Cached

### Download Links

- [robotics.stanford.edu]
- [www-cs-students.stanford.edu]
- [www.machinelearning.net]
- [machine-learning.martinsewell.com]
- DBLP

### Other Repositories/Bibliography

Venue: | MACHINE LEARNING: PROCEEDINGS OF THE ELEVENTH INTERNATIONAL |

Citations: | 613 - 23 self |

### BibTeX

@INPROCEEDINGS{John94irrelevantfeatures,

author = {George H. John and Ron Kohavi and Karl Pfleger},

title = {Irrelevant Features and the Subset Selection Problem},

booktitle = {MACHINE LEARNING: PROCEEDINGS OF THE ELEVENTH INTERNATIONAL},

year = {1994},

pages = {121--129},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

We address the problem of finding a subset of features that allows a supervised induction algorithm to induce small high-accuracy concepts. We examine notions of relevance and irrelevance, and show that the definitions used in the machine learning literature do not adequately partition the features into useful categories of relevance. We present definitions for irrelevance and for two degrees of relevance. These definitions improve our understanding of the behavior of previous subset selection algorithms, and help define the subset of features that should be sought. The features selected should depend not only on the features and the target concept, but also on the induction algorithm. We describe a method for feature subset selection using cross-validation that is applicable to any induction algorithm, and discuss experiments conducted with ID3 and C4.5 on artificial and real datasets.

### Citations

5184 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...he feature "correlated " matches the class label 75% of the time. The left subtree is the correct decision tree, which is correctly induced if the "correlated" feature is removed f=-=rom the data. C4.5 (Quinlan 1992) and CART-=- (Breiman et al. 1984) induce similar trees with the "correlated" feature at the root. Such a split causes all these induction algorithms to generate trees that are less accurate than if thi... |

4112 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ... matches the class label 75% of the time. The left subtree is the correct decision tree, which is correctly induced if the "correlated" feature is removed from the data. C4.5 (Quinlan 1992) =-=and CART (Breiman et al. 1984) induce s-=-imilar trees with the "correlated" feature at the root. Such a split causes all these induction algorithms to generate trees that are less accurate than if this feature is completely removed... |

3483 | Induction of Decision Trees
- Quinlan
- 1986
(Show Context)
Citation Context ...uced concepts which depend on irrelevant features, or in some cases even relevant features that hurt the overall accuracy. Figure 1 shows such a choice of a non-optimal split at the root made by ID3 (=-=Quinlan 1986). The Boolean targe-=-t concept is (A0sA1)s(B0sB1). The feature named "irrelevant " is uniformly random, and the feature "correlated " matches the class label 75% of the time. The left subtree is the co... |

752 |
UCI repository of machine learning data bases
- Murphy, Aha
- 1994
(Show Context)
Citation Context ...igh variance, we call this deterministic variant RelieveD. In our experiments, features with relevancy rankings below 0 were removed. The real-world datasets were taken from the UC-Irvine repository (=-=Murphy & Aha 1994-=-) and from Quinlan (1992) . Figures 5 and 6 summarize our results. We give details for those datasets that had the largest differences either in accuracy or tree size. Artificial datasets CorrAL This ... |

681 | I~srning Quickly When Irrelevant Attributes Abound. Machine Learning 2(4):285318 - Littlestone - 1988 |

667 |
Pattern Recognition: A Statistical Approach
- Devijver, Kittler
- 1982
(Show Context)
Citation Context ...l one. 5 RELATED WORK Researchers in statistics (Boyce, Farhi, & Weischedel 1974; Narendra & Fukunaga 1977; Draper & Smith 1981; Miller 1990; Neter, Wasserman, & Kutner 1990) and pattern recognition (=-=Devijver & Kittler 1982-=-; Ben-Bassat 1982) have investigated the feature subset selection problem for decades, but most work has concentrated on subset selection using linear regression. Sequential backward elimination, some... |

599 |
Applied Regression Analysis
- Draper, Smith
- 1981
(Show Context)
Citation Context ... forward versions is that the backward version starts with all features and the forward version starts with no features. The algorithms are straightforward and are described in many statistics books (=-=Draper & Smith 1981-=-; Neter, Wasserman, & Kutner 1990) under the names backward stepwise eliminationsand forward stepwise selection. One only has to be careful to set the degradation and improvement margins so that cycle... |

458 | Very simple classification rules perform well on most commonly used datasets
- Holte
- 1993
(Show Context)
Citation Context ... improvement of prediction accuracy over C4.5 is that C4.5 does quite well on most of the datasets tested here, leaving little room for improvement. This seems to be in line with with Holte's claims (=-=Holte 1993-=-). Harder datasets might show more significant improvement. Indeed the wrapper model produced the most significant improvement for the two datasets (parity5+5 and CorrAL) on which C4.5 performed the w... |

378 |
A practical approach to feature selection
- Kira, Rendell
- 1992
(Show Context)
Citation Context ...will select all strongly relevant features, none of the irrelevant ones, and a smallest subset of the weakly relevant features that are sufficient to determine the concept. Algorithms such as Relief (=-=Kira & Rendell 1992-=-a; 1992b; Kononenko 1994) (see Section 3.1) attempt to efficiently approximate the set of relevant features. 3 FEATURE SUBSET SELECTION There are a number of different approaches to subset selection. ... |

370 |
Computer Systems that Learn
- Weiss, Kulikowski
- 1991
(Show Context)
Citation Context ...iven a subset of features, we want to estimate the accuracy of the induced structure using only the given features. We propose evaluating the subset using nfold cross validation (Breiman et al. 1984; =-=Weiss & Kulikowski 1991-=-). The training data is split into n approximately equally sized partitions. The induction algorithm is then run n times, each time using n \Gamma 1 partitions as the training set and the other partit... |

319 | Applied linear statistical models - Neter, Wasserman, et al. - 1990 |

314 | Estimating attributes: analysis and extension of Relief
- Kononenko
- 1994
(Show Context)
Citation Context ...vant features, none of the irrelevant ones, and a smallest subset of the weakly relevant features that are sufficient to determine the concept. Algorithms such as Relief (Kira & Rendell 1992a; 1992b; =-=Kononenko 1994-=-) (see Section 3.1) attempt to efficiently approximate the set of relevant features. 3 FEATURE SUBSET SELECTION There are a number of different approaches to subset selection. In this section, we clai... |

272 |
The feature selection problem: Traditional methods and a new algorithm
- Kira, Rendell
(Show Context)
Citation Context ...will select all strongly relevant features, none of the irrelevant ones, and a smallest subset of the weakly relevant features that are sufficient to determine the concept. Algorithms such as Relief (=-=Kira & Rendell 1992-=-a; 1992b; Kononenko 1994) (see Section 3.1) attempt to efficiently approximate the set of relevant features. 3 FEATURE SUBSET SELECTION There are a number of different approaches to subset selection. ... |

258 |
Sthocastic complexity and Modelling
- Rissanen
- 1986
(Show Context)
Citation Context ...ng values to a set of features, and the task is to induce a hypothesis that accurately predicts the label of novel instances. Following Occam's razor (Blumer et al. 1987), minimum description length (=-=Rissanen 1986-=-), and minimum message length (Wallace & Freeman 1987), one usually attempts to find structures that correctly classify a large subset of the training set, and yet are not so complex that they begin t... |

248 |
Subset Selection in Regression
- Miller
- 1990
(Show Context)
Citation Context ...dundant features. Thus the best feature subset is not always the minimal one. 5 RELATED WORK Researchers in statistics (Boyce, Farhi, & Weischedel 1974; Narendra & Fukunaga 1977; Draper & Smith 1981; =-=Miller 1990-=-; Neter, Wasserman, & Kutner 1990) and pattern recognition (Devijver & Kittler 1982; Ben-Bassat 1982) have investigated the feature subset selection problem for decades, but most work has concentrated... |

220 |
Some comments on Cp
- Mallows
- 1973
(Show Context)
Citation Context ...any measures have been suggested to evaluate the subset selection (as opposed to cross validation), such as adjusted mean squared error, adjusted multiple correlation coe cient, and the Cp statistic (=-=Mallows 1973-=-). In Mucciardi & Gose (1971), seven di erent techniques for subset selection were empirically compared for a nine-class electrocardiographic problem. The search for the best subset can be improved by... |

218 | Learning with many irrelevant features
- Almuallim, Dietterich
(Show Context)
Citation Context ...y definition. In Example 1, feature X 1 is strongly relevant; features X 2 and X 4 are weakly relevant; and X 3 and X 5 are irrelevant. Figure 2 shows our view of relevance. Algorithms such as FOCUS (=-=Almuallim & Dietterich 1991-=-) (see Section 3.1) find a minimal set of features that are sufficient to determine the concept. Given enough data, these algorithms will select all strongly relevant features, none of the irrelevant ... |

206 | Training a 3-node neural network is NP-complete
- Blum, Rivest
- 1988
(Show Context)
Citation Context ...fit the data. Ideally, the induction algorithm should use only the subset of features that leads to the best performance. Since induction of minimal structures is NP-hard in many cases (Hancock 1989; =-=Blum & Rivest 1992-=-), algorithms usually conduct a heuristic search in the Correlated A1 0 Irrelevant 1 B0 0 A0 1 A0 0 1 1 0 0 B1 1 B1 0 1 1 0 0 1 1 0 0 B0 1 0 0 1 1 0 0 A1 1 B0 0 1 1 0 0 1 1 Figure 1: An example where ... |

203 | Models of Incremental concept formation - Gennari, Langley, et al. - 1989 |

203 |
Boolean feature discovery in empirical learning
- Pagallo, Haussler
- 1986
(Show Context)
Citation Context ...ure 1) shows that common algorithms such as ID3, C4.5, and CART, fail to ignore features which, if ignored, would improve accuracy. Feature subset selection is also useful for constructive induction (=-=Pagallo & Haussler 1990-=-) where features can be constructed and tested using the wrapper model to determine if they improve performance. Finally, in real world applications, features may have an associated cost (i.e., when t... |

195 | Classi - cation and regression trees - Breiman, Friedman, et al. - 1984 |

194 |
A branch and bound algorithm for feature subset selection
- Narendra, Fukunaga
- 1977
(Show Context)
Citation Context ...induce a hypothesis which makes use of these redundant features. Thus the best feature subset is not always the minimal one. 5 RELATED WORK Researchers in statistics (Boyce, Farhi, & Weischedel 1974; =-=Narendra & Fukunaga 1977-=-; Draper & Smith 1981; Miller 1990; Neter, Wasserman, & Kutner 1990) and pattern recognition (Devijver & Kittler 1982; Ben-Bassat 1982) have investigated the feature subset selection problem for decad... |

193 | Greedy Attribute Selection - Caruana, Freitag - 1994 |

191 |
Estimation and inference by compact coding
- Wallace, Freeman
- 1987
(Show Context)
Citation Context ... is to induce a hypothesis that accurately predicts the label of novel instances. Following Occam's razor (Blumer et al. 1987), minimum description length (Rissanen 1986), and minimum message length (=-=Wallace & Freeman 1987-=-), one usually attempts to find structures that correctly classify a large subset of the training set, and yet are not so complex that they begin to overfit the data. Ideally, the induction algorithm ... |

177 | The MONK's Problems - A Performance Comparison of Dierent Learning Algorithms - Thrun, Bala, et al. - 1991 |

147 | Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms - Skalak - 1994 |

131 | Ecient algorithms for minimizing cross validation error - Moore, Lee - 1994 |

130 |
Some comments on
- Mallows
- 1973
(Show Context)
Citation Context ... measures have been suggested to evaluate the subset selection (as opposed to cross validation), such as adjusted mean squared error, adjusted multiple correlation coefficient, and the C p statistic (=-=Mallows 1973-=-). In Mucciardi & Gose (1971), seven different techniques for subset selection were empirically compared for a nine-class electrocardiographic problem. The search for the best subset can be improved b... |

107 |
On automatic feature selection
- Siedlecki, Sklansky
- 1988
(Show Context)
Citation Context ...lus `--take away r." Branch and bound algorithms were introduced by Narendra & Fukunaga (1977). Finally, more recent papers attempt to use AI techniques, such as beam search and bidirectional sea=-=rch (Siedlecki & Sklansky 1988-=-), best first search (Xu, Yan, & Chang 1989), and genetic algorithms (Vafai & De Jong 1992). Many measures have been suggested to evaluate the subset selection (as opposed to cross validation), such a... |

95 | Using decision trees to improve case-based learning - Cardie - 1993 |

66 | Decision Trees and Diagrams - Moret - 1982 |

65 | Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal pruning - Schlimmer - 1993 |

52 | Efficient pruning methods for separate-and-conquer rule learning systems - Cohen - 1993 |

49 | Feature selection using rough sets theory - Modrzejewski - 1993 |

46 | M~: On the Effectiveness of Receptors in Recognition Systems - Marill, Green - 1963 |

44 | Genetic algorithms as a tool for feature selection in machine learning - Vafaie, Jong - 1992 |

38 | Oblivious Decision Trees and Abstract Cases - Langley, Sage - 1994 |

36 | A comparison of seven techniques for choosing subsets of pattern recognition - Nucciardi, Gose - 1971 |

24 |
Use of distance measures, information measures and error bounds on feature evaluation
- Ben-Bassat
- 1987
(Show Context)
Citation Context ...earchers in statistics (Boyce, Farhi, & Weischedel 1974; Narendra & Fukunaga 1977; Draper & Smith 1981; Miller 1990; Neter, Wasserman, & Kutner 1990) and pattern recognition (Devijver & Kittler 1982; =-=Ben-Bassat 1982-=-) have investigated the feature subset selection problem for decades, but most work has concentrated on subset selection using linear regression. Sequential backward elimination, sometimes called sequ... |

23 |
Very simple classi cation rules perform well on most commonly used datasets
- Holte
- 1993
(Show Context)
Citation Context ... improvement of prediction accuracy over C4.5 is that C4.5 does quite well on most of the datasets tested here, leaving little room for improvement. This seems to be in line with with Holte's claims (=-=Holte 1993-=-). Harder datasets might show more signi cant improvement. Indeed the wrapper model produced the most signi cantimprovement for the two datasets (parity5+5 and CorrAL) on which C4.5 performed the wors... |

19 | Irrelevance Reasoning in Knowledge Based Systems - Levy - 1993 |

11 | Optimal subset selection - Boyce, Farhi - 1974 |

9 | Models of incremental concept formation. Arti Intelligence 40:11{61 - Gennari, Langley, et al. - 1989 |

4 |
On the difficulty of finding small consistent decision trees
- Hancock
- 1989
(Show Context)
Citation Context ... begin to overfit the data. Ideally, the induction algorithm should use only the subset of features that leads to the best performance. Since induction of minimal structures is NP-hard in many cases (=-=Hancock 1989-=-; Blum & Rivest 1992), algorithms usually conduct a heuristic search in the Correlated A1 0 Irrelevant 1 B0 0 A0 1 A0 0 1 1 0 0 B1 1 B1 0 1 1 0 0 1 1 0 0 B0 1 0 0 1 1 0 0 A1 1 B0 0 1 1 0 0 1 1 Figure ... |

4 |
On the di culty of nding small consistent decision trees
- Hancock
- 1989
(Show Context)
Citation Context ...y begin to over t the data. Ideally, the induction algorithm should use only the subset of features that leads to the best performance. Since induction of minimal structures is NP-hard in many cases (=-=Hancock 1989-=-� Blum & Rivest 1992), algorithms usually conduct a heuristic search inthe Ron Kohavi Computer Science Dept. Stanford University Stanford, CA 94305 ronnyk@CS.Stanford.EDU 0 0 0 0 B0 B1 1 1 1 Karl P eg... |

3 | The use of knowledge in analogy and induction - Russel - 1989 |

2 | Preliminary steps toward the automation of induction - Russel - 1986 |