Results 1  10
of
12
Irrelevant Features and the Subset Selection Problem
 MACHINE LEARNING: PROCEEDINGS OF THE ELEVENTH INTERNATIONAL
, 1994
"... We address the problem of finding a subset of features that allows a supervised induction algorithm to induce small highaccuracy concepts. We examine notions of relevance and irrelevance, and show that the definitions used in the machine learning literature do not adequately partition the features ..."
Abstract

Cited by 721 (26 self)
 Add to MetaCart
We address the problem of finding a subset of features that allows a supervised induction algorithm to induce small highaccuracy concepts. We examine notions of relevance and irrelevance, and show that the definitions used in the machine learning literature do not adequately partition the features into useful categories of relevance. We present definitions for irrelevance and for two degrees of relevance. These definitions improve our understanding of the behavior of previous subset selection algorithms, and help define the subset of features that should be sought. The features selected should depend not only on the features and the target concept, but also on the induction algorithm. We describe a method for feature subset selection using crossvalidation that is applicable to any induction algorithm, and discuss experiments conducted with ID3 and C4.5 on artificial and real datasets.
Automatic Construction of Decision Trees from Data: A MultiDisciplinary Survey
 Data Mining and Knowledge Discovery
, 1997
"... Decision trees have proved to be valuable tools for the description, classification and generalization of data. Work on constructing decision trees from data exists in multiple disciplines such as statistics, pattern recognition, decision theory, signal processing, machine learning and artificial ne ..."
Abstract

Cited by 209 (1 self)
 Add to MetaCart
(Show Context)
Decision trees have proved to be valuable tools for the description, classification and generalization of data. Work on constructing decision trees from data exists in multiple disciplines such as statistics, pattern recognition, decision theory, signal processing, machine learning and artificial neural networks. Researchers in these disciplines, sometimes working on quite different problems, identified similar issues and heuristics for decision tree construction. This paper surveys existing work on decision tree construction, attempting to identify the important issues involved, directions the work has taken and the current state of the art. Keywords: classification, treestructured classifiers, data compaction 1. Introduction Advances in data collection methods, storage and processing technology are providing a unique challenge and opportunity for automated data exploration techniques. Enormous amounts of data are being collected daily from major scientific projects e.g., Human Genome...
The Power of Decision Tables
 Proceedings of the European Conference on Machine Learning
, 1995
"... . We evaluate the power of decision tables as a hypothesis space for supervised learning algorithms. Decision tables are one of the simplest hypothesis spaces possible, and usually they are easy to understand. Experimental results show that on artificial and realworld domains containing only discre ..."
Abstract

Cited by 145 (5 self)
 Add to MetaCart
. We evaluate the power of decision tables as a hypothesis space for supervised learning algorithms. Decision tables are one of the simplest hypothesis spaces possible, and usually they are easy to understand. Experimental results show that on artificial and realworld domains containing only discrete features, IDTM, an algorithm inducing decision tables, can sometimes outperform stateoftheart algorithms such as C4.5. Surprisingly, performance is quite good on some datasets with continuous features, indicating that many datasets used in machine learning either do not require these features, or that these features have few values. We also describe an incremental method for performing crossvalidation that is applicable to incremental learning algorithms including IDTM. Using incremental crossvalidation, it is possible to crossvalidate a given dataset and IDTM in time that is linear in the number of instances, the number of features, and the number of label values. The time for incre...
Feature Subset Selection by Bayesian networks: a comparison with genetic and sequential algorithms
"... In this paper we perform a comparison among FSSEBNA, a randomized, populationbased and evolutionary algorithm, and two genetic and other two sequential search approaches in the well known Feature Subset Selection (FSS) problem. In FSSEBNA, the FSS problem, stated as a search problem, uses the E ..."
Abstract

Cited by 53 (15 self)
 Add to MetaCart
In this paper we perform a comparison among FSSEBNA, a randomized, populationbased and evolutionary algorithm, and two genetic and other two sequential search approaches in the well known Feature Subset Selection (FSS) problem. In FSSEBNA, the FSS problem, stated as a search problem, uses the EBNA (Estimation of Bayesian Network Algorithm) search engine, an algorithm within the EDA (Estimation of Distribution Algorithm) approach. The EDA paradigm is born from the roots of the GA community in order to explicitly discover the relationships among the features of the problem and not disrupt them by genetic recombination operators. The EDA paradigm avoids the use of recombination operators and it guarantees the evolution of the population of solutions and the discovery of these relationships by the factorization of the probability distribution of best individuals in each generation of the search. In EBNA, this factorization is carried out by a Bayesian network induced by a chea...
Useful Feature Subsets and Rough Set Reducts
, 1994
"... In supervised classification learning, one attempts to induce a classifier that correctly predicts the label of novel instances. We demonstrate that by choosing a useful subset of features for the indiscernibility relation, an induction algorithm based on simple decision table can have high predicti ..."
Abstract

Cited by 31 (4 self)
 Add to MetaCart
In supervised classification learning, one attempts to induce a classifier that correctly predicts the label of novel instances. We demonstrate that by choosing a useful subset of features for the indiscernibility relation, an induction algorithm based on simple decision table can have high prediction accuracy on artificial and realworld datasets. We show that useful feature subsets are not necessarily maximal independent sets (relative reducts) with respect to the label, and that, in practical situations, using a subset of the relative core features may lead to superior performance. 1 Introduction In supervised classification learning, one is given a training set containing labelled instances (examples) . Each labelled instance contains a list of feature values (attribute values) and a discrete label value. The induction task is to build a classifier that will correctly predict the label of novel instances. Common classifiers are decision trees, neural networks, and nearestneighbor...
A Biometrics Invited Paper. The analysis and selection of variables in linear regression
 Biometrics
, 1976
"... Biometrics is currently published by International Biometric Society. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Biometrics is currently published by International Biometric Society. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
Materialization Optimizations for Feature Selection Workloads
"... There is an arms race in the data management industry to support analytics, in which one critical step is feature selection, the process of selecting a feature set that will be used to build a statistical model. Analytics is one of the biggest topics in data management, and feature selection is wid ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
There is an arms race in the data management industry to support analytics, in which one critical step is feature selection, the process of selecting a feature set that will be used to build a statistical model. Analytics is one of the biggest topics in data management, and feature selection is widely regarded as the most critical step of analytics; thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a featureselection language and a supporting prototype system that builds on top of current industrial, Rintegration layers. From our interactions with analysts, we learned that feature selection is an interactive, humanintheloop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analog in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of data sets and programs, we find that traditional databasestyle approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new tradeoff space across multiple Rbackends. Furthermore, we show that it is possible to build a simple costbased optimizer to automatically select a nearoptimal execution plan for feature selection.
Bayesian Networks for Feature Subset Selection
, 2000
"... We present FSSEBNA, a new randomized, populationbased and evolutionary algorithm which deals with the well known FSS problem on Data Mining applications. In FSSEBNA, the FSS problem, stated as a search problem, uses the EBNA (Estimation of Bayesian Network Algorithm) search engine, an algorithm w ..."
Abstract
 Add to MetaCart
We present FSSEBNA, a new randomized, populationbased and evolutionary algorithm which deals with the well known FSS problem on Data Mining applications. In FSSEBNA, the FSS problem, stated as a search problem, uses the EBNA (Estimation of Bayesian Network Algorithm) search engine, an algorithm within the EDA (Estimation of Distribution Algorithm) approach. The EDA paradigm was born from the roots of the GA community in order to explicitly discover the relationships among the features of the problem and not disrupt them by genetic recombination operators. The EDA paradigm avoids the use of recombination operators and it guarantees the evolution of the population of solutions and the discovery of these relationships by the factorization of the probability distribution of best individuals in each generation of the search. In EBNA, this factorization is carried out by a Bayesian network induced by a cheap local search mechanism. Promising results on a set of real domains are achieved...