## Correlation-based feature selection for discrete and numeric class machine learning (2000)

### Cached

### Download Links

- [www.cs.waikato.ac.nz]
- [www.cs.waikato.ac.nz]
- [www.cs.waikato.ac.nz]
- [www.cs.waikato.ac.nz]
- DBLP

### Other Repositories/Bibliography

Citations: | 161 - 1 self |

### BibTeX

@INPROCEEDINGS{Hall00correlation-basedfeature,

author = {Mark A. Hall},

title = {Correlation-based feature selection for discrete and numeric class machine learning},

booktitle = {},

year = {2000},

pages = {359--366},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

Algorithms for feature selection fall into two broad categories: wrappers use the learning algorithm itself to evaluate the usefulness of features, while lters evaluate features according to heuristics based on general characteristics of the data. For application to large databases, lters have proven to be more practical than wrappers because they are much faster. However, most existing lter algorithms only work with discrete classi cation problems. This paper describes a fast, correlation-based lter algorithm that can be applied to continuous and discrete problems. Experiments using the new method as a preprocessing step for naive Bayes, instance-based learning, decision trees, locally weighted regression, and model trees show it to be an e ective feature selector|it reduces the data in dimensionality by more than sixty percent in most cases without negatively a ecting accuracy. Also, decision and model trees built from the pre-processed data are often signi cantly smaller. 1 1

### Citations

5378 | C4.5 Programs for machine learning - Quinlan - 1993 |

3059 | UCI Repository of machine learning databases - Blake, Merz - 1998 |

1124 | Wrappers for feature subset selection
- John
- 1997
(Show Context)
Citation Context ...ed representation of the target concept. Algorithms that perform feature selection as a preprocessing step prior to learning can generally be placed into one of two broad categories. Wrapper methods (=-=Kohavi & John, 1997-=-) employ—as a subroutine—a statistical re-sampling technique (such as cross validation) using the actual target learning algorithm to estimate the accuracy of feature subsets. This approach has proved... |

631 | Irrelevant features and the subset selection problem - John, Kohavi, et al. - 1994 |

494 | Locally Weighted Learning - Atkeson, Moore, et al. - 1995 |

386 |
A practical approach to feature selection
- Kira, Rendell
- 1992
(Show Context)
Citation Context ...Koller & Sahami, 1996) eliminates features whose information content is subsumed by some number of the remaining features. Still other methods attempt to rank features according to a relevancy score (=-=Kira & Rendell, 1992-=-).Filters have proven to be much faster than wrappers and hence can be applied to large data sets containing many features. Because they are more general they can be used with any learner, unlike the... |

382 | Toward optimal feature selection
- Koller, Sahami
- 1996
(Show Context)
Citation Context ... Some look for consistency in the data—that is, they note when every combination of values for a feature subset is associated with a single class label (Almuallim & Dietterich, 1991). Another method (=-=Koller & Sahami, 1996-=-) eliminates features whose information content is subsumed by some number of the remaining features. Still other methods attempt to rank features according to a relevancy score (Kira & Rendell, 1992)... |

322 | Estimating attributes: Analysis and extension of RELIEF
- Kononenko
- 1994
(Show Context)
Citation Context ...he same value for instances from the same class. Relief was originally defined for two-class problems (Kira & Rendell, 1992) and was later extended (ReliefF) to handle noise and multiclass data sets (=-=Kononenko, 1994-=-). ReliefF smoothes the influence of noise in the data by averaging the contribution of k nearest neighbours from the same and opposite class of each sampled instance instead of the single nearest nei... |

305 | Inferring Decision Trees Using the Minimum Description Length Principle - Quinlan, Rivest - 1989 |

248 | Smoothing Methods in Statistics - SIMONOFF - 1996 |

221 | Learning with many irrelevant features
- Almuallim, Dietterich
- 1991
(Show Context)
Citation Context ...ining data when selecting a subset of features. Some look for consistency in the data—that is, they note when every combination of values for a feature subset is associated with a single class label (=-=Almuallim & Dietterich, 1991-=-). Another method (Koller & Sahami, 1996) eliminates features whose information content is subsumed by some number of the remaining features. Still other methods attempt to rank features according to ... |

168 | Correlation-based Feature Selection for Machine Learning
- Hall
- 1999
(Show Context)
Citation Context ...independently, CFS cannot identify strongly interacting attributes such as in a parity problem. However, it has been shown that it can identify useful attributes under moderate levels of interaction (=-=Hall, 1998-=-). 3.2 Searching the Feature Subset Space The purpose of feature selection is to decide which of the initial (possibly large) number of features to include in the final subset and which to ignore. If ... |

112 | Wrappers for performance enhancement and oblivious decision graphs - Kohavi - 1995 |

92 | Induction of model trees for predicting continuous classes - Wang, Witten - 1997 |

81 | On biases in estimating multi-valued attributes - Kononenko - 1995 |

42 | An empirical investigation of brute force to choose features, smoothers, and function approximators - Moore, Hill, et al. - 1992 |

21 | Techical note: Some properties of splitting criteria - Breiman - 1996 |

21 |
Numerical Recipies
- Press, Flannery, et al.
- 1989
(Show Context)
Citation Context ...ic features using the technique of Fayyad and Irani (1993) and then uses symmetrical uncertainty (a modified information gain measure) to estimate the degree of association between discrete features (=-=Press et al., 1988-=-): [ ] H(X) + H(Y ) − H(X, Y ) SU = 2.0 × H(Y ) + H(X) (2) Symmetrical uncertainty is used (rather than gain ratio) because it is a symmetric measure and can therefore be used to measure feature-featu... |

17 | Multi-interval discretisation of continuousvalued attributes for classification learning - Fayyad, Irani - 1993 |

17 | Theory of psychological measurement - Ghiselli - 1964 |

17 |
An adaption of relief for attribute estimation in regression
- Sikonja, Kononenko
- 1997
(Show Context)
Citation Context ...wrapper feature selection can be applied to regression problems with relative ease, few filter algorithms handle continuous class data. The only exception is RReliefF (Regressional Relief) (Robnik- ˇ =-=Sikonja & Kononenko, 1997-=-), which is an extension of Kira and Rendell’s (1992) Relief algorithm for classification problems. The Relief algorithms are quite different to the algorithm described in this paper in that they scor... |

15 | Feature selection via the discovery of simple classification rules - Holmes, Nevill-Manning - 1995 |

8 | Numeric prediction using instance-based learning with encoding length selection - Kilpatrick, Cameron-Jones - 1998 |

3 | Naive Bayes for regression. Working Paper 98/15 - Frank, Trigg, et al. - 1998 |

2 | Efficient algorithms for minimising cross validation error - Lee, S - 1994 |

1 | Efficient algorithms for identifying relevant See [22] for details on how this process is adapted to the continuous class case. features - Almuallim, Dietterich - 1992 |

1 | in press). Naive bayes for regression. Machine Learning. Theory of psychological measureGhiselli - Frank, Trigg, et al. - 1964 |