## Learning Boolean Concepts in the Presence of Many Irrelevant Features (1994)

### Cached

### Download Links

- [ftp.cs.orst.edu]
- [web.engr.oregonstate.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Artificial Intelligence |

Citations: | 94 - 0 self |

### BibTeX

@ARTICLE{Almuallim94learningboolean,

author = {Hussein Almuallim and Thomas G. Dietterich},

title = {Learning Boolean Concepts in the Presence of Many Irrelevant Features},

journal = {Artificial Intelligence},

year = {1994},

volume = {69},

pages = {279--305}

}

### Years of Citing Articles

### OpenURL

### Abstract

In many domains, an appropriate inductive bias is the MIN-FEATURES bias, which prefers consistent hypotheses definable over as few features as possible. This paper defines and studies this bias in Boolean domains. First, it is shown that any learning algorithm implementing the MIN-FEATURES bias requires \Theta( 1 ffl ln 1 ffi + 1 ffl [2 p + p ln n]) training examples to guarantee PAC-learning a concept having p relevant features out of n available features. This bound is only logarithmic in the number of irrelevant features. For implementing the MIN-FEATURES bias, the paper presents five algorithms that identify a subset of features sufficient to construct a hypothesis consistent with the training examples. FOCUS-1 is a straightforward algorithm that returns a minimal and sufficient subset of features in quasi-polynomial time. FOCUS-2 does the same task as FOCUS-1 but is empirically shown to be substantially faster than FOCUS-1. Finally, the Simple-Greedy, Mutual-Information-G...

### Citations

10922 |
Computers and Intractability: A Guide to the Theory of NP-Completeness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...m a consistent hypothesis (e.g., �� x 1 �� x 3s(��x 3 \Phi x 4 )), and that all subsets of features of cardinality less than 3 are insufficient. 2 The minimum set cover problem is known to=-= be NP-hard [10]-=-. At first glance, this may appear to mean that it is not possible to implement the MIN-FEATURES bias in polynomial time (unless P=NP). However, one should be careful before drawing such a conclusion.... |

3354 | Induction of Decision Trees
- Quinlan
- 1986
(Show Context)
Citation Context ...o steps: First, we identify the smallest subset of features that is sufficient to construct a hypothesis consistent with the given training examples. Then, we apply some learning procedure (e.g., ID3 =-=[23]-=-) that focuses on just those chosen features. For identification of the smallest sufficient subset of features we describe two algorithms. The first algorithm, FOCUS-1, is a straightforward algorithm ... |

672 | Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm
- Littlestone
- 1988
(Show Context)
Citation Context ...ferent diseases from the medical records of a large number of patients. These records usually contain more information than is actually required for describing each disease. Another example (given in =-=[16]-=-) involves pattern recognition tasks in which feature detectors automatically extract a large number of features for the learner's consideration, not knowing which might prove useful. The task of sele... |

552 |
Generalization as Search
- Mitchell
- 1982
(Show Context)
Citation Context ...EATURES bias can be stated simply. Given a training sample S of some unknown target concept c, let V be the set of all possible hypotheses consistent with S. (V is sometimes called the version space; =-=[17]) Let H be-=- the subset of V whose elements have the fewest relevant features. The MIN-FEATURES bias chooses its guess, h, from H arbitrarily. Note that the MIN-FEATURES bias is "incomplete" in the sens... |

191 |
A general lower bound on the number of examples needed for learning
- Ehrenfeucht, Haussler, et al.
- 1989
(Show Context)
Citation Context ...hat is shattered by C. Blumer et al. show that the number of examples needed for learning any class of concepts strongly depends on the VC-dimension of the class [6]. Specifically, Ehrenfeucht et al. =-=[9]-=- prove the following: Theorem 2 Let C be a class of concepts and 0 ! ffl; ffi ! 1. Then, any algorithm that learns every concept in C with respect to ffl; ffi and any probability distribution must use... |

183 | A Branch and Bound Algorithm for Feature Subset Selection - Narendra, Fukunaga - 1977 |

167 | Learning in the presence of malicious errors
- Kearns, Li
- 1993
(Show Context)
Citation Context ...s noise-free training data. In practice, however, training examples are often subject to various kinds of noise affecting the values of the features and/or the classification of the training examples =-=[13, 15]-=-. A direct way to deal with noise is to modify the given algorithms by relaxing the requirement of covering all the conflicts generated from the training data. That is, we search for a small set of fe... |

151 |
A greedy heuristic for the set covering problem
- Chvátal
- 1979
(Show Context)
Citation Context ...groups. We explain how this can be done in the appendix. 6.2 The Simple-Greedy (SG) Algorithm The Simple-Greedy algorithm is based on the well-known greedy heuristic for the minimum set cover problem =-=[8]-=-. Starting with the set of all conflicts, the algorithm chooses each time the feature that covers the largest number of conflicts that are not yet covered. The conflicts that are covered by this featu... |

41 |
Boolean feature discovery
- Pagallo, Haussler
- 1990
(Show Context)
Citation Context ...example, ID3 [23] has a bias in favor of small decision trees, and small trees would seem to test only a subset of the input features. In the experimental part of this work, we compare ID3 and FRINGE =-=[21]-=- to our algorithms. The experiments demonstrate that ID3 and FRINGE do not provide good approximations to the MIN-FEATURES bias---these algorithms often produce hypotheses as output that are much more... |

34 |
A Comparison of Seven Techniques for Choosing Subsets of Pattern Recognition Properties
- Mucciardi, Gose
- 1971
(Show Context)
Citation Context ...ures are highly correlated) is based on performing a principal components analysis to find a reduced set of new uncorrelated features defined by combining the original features using the eigenvectors =-=[18, 19]-=-. To our knowledge, the problem of finding the smallest subset of Boolean features that is sufficient to construct a consistent hypothesis---which is the topic of this paper---has not been addressed. ... |

32 | Efficient algorithms for identifying relevant features - Almuallim, Dietterich - 1991 |

17 |
Myths and Legends in Learning Classification Rules
- Buntine
- 1990
(Show Context)
Citation Context .... Often, it is difficult even to state the bias in any simple way. Consequently, it is difficult to tell in advance whether the bias is appropriate for a new learning problem. Recently, a few authors =-=[5, 24]-=- have advocated a different procedure: (i) adopt a bias over some space of hypotheses (or, equivalently, select a prior probability distribution over the space), (ii) select a scheme for representing ... |

11 |
Optimum feature selection by zero-one integer programming
- Ichino, Sklansky
- 1984
(Show Context)
Citation Context ..., [14] shows methods for selecting a small subset of features that optimizes the expected error of the nearest neighbor classifier. Similar work has addressed feature selection for the Box classifier =-=[11]-=-, the linear classifier [12] and the Bayes classifier [22]. Other work (aimed at removing feature redundancy when features are highly correlated) is based on performing a principal components analysis... |

7 |
Feature Selection for Linear Classifier
- Ichino, Sklansky
- 1984
(Show Context)
Citation Context ...ecting a small subset of features that optimizes the expected error of the nearest neighbor classifier. Similar work has addressed feature selection for the Box classifier [11], the linear classifier =-=[12]-=- and the Bayes classifier [22]. Other work (aimed at removing feature redundancy when features are highly correlated) is based on performing a principal components analysis to find a reduced set of ne... |

5 | Concept Coverage and its application to two learning tasks - Almuallim - 1992 |

5 |
A mathematical theory of generalization: part
- Wolpert
- 1990
(Show Context)
Citation Context .... Often, it is difficult even to state the bias in any simple way. Consequently, it is difficult to tell in advance whether the bias is appropriate for a new learning problem. Recently, a few authors =-=[5, 24]-=- have advocated a different procedure: (i) adopt a bias over some space of hypotheses (or, equivalently, select a prior probability distribution over the space), (ii) select a scheme for representing ... |

4 | Exploiting Symmetry Properties in the Evaluation of Inductive Learning Algorithms: An Empirical Domain-Independent Comparative Study - Almuallim - 1991 |

4 |
On feature selection
- Queiros, Gelsema
- 1984
(Show Context)
Citation Context ...res that optimizes the expected error of the nearest neighbor classifier. Similar work has addressed feature selection for the Box classifier [11], the linear classifier [12] and the Bayes classifier =-=[22]-=-. Other work (aimed at removing feature redundancy when features are highly correlated) is based on performing a principal components analysis to find a reduced set of new uncorrelated features define... |

2 |
Computational problems of feature selection pertaining to large data sets
- Kittler
- 1980
(Show Context)
Citation Context ...ature selection" or "dimensionality reduction." However, most feature selection criteria in pattern recognition are defined with respect to a specific classifier or group of classifiers=-=. For example, [14]-=- shows methods for selecting a small subset of features that optimizes the expected error of the nearest neighbor classifier. Similar work has addressed feature selection for the Box classifier [11], ... |

1 |
Learning from Good and Bad Data (Klawer
- Laird
- 1988
(Show Context)
Citation Context ...s noise-free training data. In practice, however, training examples are often subject to various kinds of noise affecting the values of the features and/or the classification of the training examples =-=[13, 15]-=-. A direct way to deal with noise is to modify the given algorithms by relaxing the requirement of covering all the conflicts generated from the training data. That is, we search for a small set of fe... |

1 |
Computational Complexity and VLSI Implementation of an Optimal Feature Selection Strategy
- Morgera
- 1986
(Show Context)
Citation Context ...ures are highly correlated) is based on performing a principal components analysis to find a reduced set of new uncorrelated features defined by combining the original features using the eigenvectors =-=[18, 19]-=-. To our knowledge, the problem of finding the smallest subset of Boolean features that is sufficient to construct a consistent hypothesis---which is the topic of this paper---has not been addressed. ... |