DMCA
Pruning irrelevant features from oblivious decision trees (1994)
Venue: | In Proceedings of the AAAI Fall Symposium on Relevance, 145148 |
Citations: | 2 - 0 self |
BibTeX
@INPROCEEDINGS{Langley94pruningirrelevant,
author = {Pat Langley},
title = {Pruning irrelevant features from oblivious decision trees},
booktitle = {In Proceedings of the AAAI Fall Symposium on Relevance, 145148},
year = {1994},
publisher = {AAAI Press}
}
OpenURL
Abstract
Abstract In this paper, we examine an approach to feature selection designed to handle domains that involve both irrelevant and interacting features. We review the reasons this situation poses challenges to both nearest neighbor and decision-tree methods, then describe a new algorithm -OBLIVION -that carries out greedy pruning of oblivious decision trees. We summarize the results of experiments with artificial domains, which show that OBLIVION'S sample complexity grows slowly with the number of irrelevant features, and with natural domains, which suggest that few existing data sets contain many irrelevant features. In closing, we consider other work on feature selection and outline directions for future research. Nature of the Problem One of the central problems in machine induction involves discriminating between features that are relevant to the target concept and ones that are irrelevant. Presumably, many real-world learning tasks contain large numbers of irrelevant terms, and for such tasks, one would prefer to use algorithms that scale well along this dimension. More specifically, one would like the number of training instances needed to reach a given level of accuracy (the sample complexity) to grow slowly with increasing numbers of irrelevant features. We define relevance in the context of such an induction task. Given a set of classified training instances for some target concept, the goal is to improve classification accuracy on a set of novel test instances. One way to improve accuracy involves identifying the features relevant to the target concept. Following John, Some previous experimental studies have examined the effect of irrelevant features on learning. For example, In response to this problem, Almuallim and Dietterich (1990) developed Focus, an algorithm which directly searches for minimal combinations of attributes that perfectly discriminate among the classes. This method begins by looking at each feature in isolation, then turns to pairs of features, triples, and so forth, halting as soon as it finds a combination that generates pure partitions of the training set (i.e., in which no instances have different classes). Their scheme then passes on the reduced set of features to ID3, which constructs a decision tree from the simplified training data. Comparative studies with ID3 and with Pagallo and Hanssler's (1990) FRINGE showed that, for a given number of training cases on randomly selected Boolean target concepts, Focus was almost unaffected by the introduction of irrelevant attributes, whereas the accuracy of ID3 and FRINGE degraded significantly. Schlimmer (1993) has described a similar method that also starts with individual attributes and searches the space of attribute combinations, continuing until it finds a partition of the training set that has pure classes. Both of these algorithms address the problem of attribute interaction in the presence of irrelevants by directly examining combinations of features. At least for noise-free data, this approach has the advantage 132 From: AAAI Technical Report FS-94-02. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. of guaranteeing identification of minimal relevant feature sets, in contrast to the greedy approach used by C4.5 and its relatives. However, the price is greatly increased computational cost. Almuallim and Dietterich showed that Focus' time complexity is quasipolynomial in the number of attributes, which they acknowledged is impractical for target concepts that involve many features. Schlimmer introduced techniques for pruning the search tree without losing completeness, but even with this savings, he had to limit the length of feature combinations considered (and thus the complexity of learnable target concepts) to keep search within bounds. Thus, there remains a need for more practical algorithms that can handle domains with both complex feature interactions and irrelevant attributes. Pruning of Oblivious Decision Trees Our research goal was to develop an algorithm that handled both irrelevant features and attribute interactions without resorting to expensive, enumerative search. Our response draws upon the realization that both Almuallim and Dietterich's and Schlimmer's approaches construct oblivious decision trees, in which all nodes at the same level test the same attribute. For example, a three-level oblivious tree might test attribute X at the top node, attribute Y in all nodes at the second level, and attribute Z in all nodes at the lowest level. This framework does not limit one's representational coverage; for every possible decision tree there exists an equivalent oblivious tree, though the former may have fewer nodes than the latter. Although the above algorithms use forward selection (i.e., top-down search) to construct oblivious decision trees, this is not the only possible approach. Almuallim and Dietterich's Focus and Schlimmer's method require combinatorial search to handle attribute interactions precisely because they operate in this direction. However, experience with C4.5 and its relatives suggests that much of their power lies not in their forward selection scheme but in their use of pruning to eliminate unnecessary attributes. This suggests an alternative approach in which one starts with a full oblivious decision tree that includes all attributes, then uses pruning or backward elimination to remove features that do not aid classification accuracy. This scheme's advantage lies in the fact that accuracy decreases substantially when one removes a single relevant attribute, even if it interacts with other features, but accuracy remains unaffected when one prunes an irrelevant or redundant feature. This means that one can use greedy search rather than enumerative methods. We have developed an algorithm, called OBLIVION, that instantiates this idea. The method begins with a full oblivious tree that incorporates all potentially relevant attributes and estimates this tree's accuracy on the entire training set, using a conservative technique like n-way cross validation. OBLIVION then removes each attribute in turn, estimates the accuracy of the resulting tree in each case, and selects the most accurate. If this best tree makes no more errors than the initial one, OBLIVION replaces the initial tree with the best one and continues the process. On each step, the algorithm tentatively prunes each of the remaining features, selects the best, and generates a new tree with one fewer attribute. This continues until the accuracy of the best pruned tree is less than the accuracy of the current one. Unlike Focus and Schlimmer's method, OBLIVION'S time complexity is polynomial in the number of features, growing with the square of this factor. There remain a few problematic details, such as determining the order of the retained attributes. However, one need not assign an order at all, since every order should produce equivalent behavior. Instead, one can view an oblivious decision tree as a set of disjoint rules, each using the same attributes in their condition sides. Because pruning can produce impure partitions of the training set, each rule specifies a distribution of class values. When an instance matches a rule's conditions, it simply predicts the most likely class. But sparse training data raises another issue -making predictions when a test case fails to perfectly match any rule. In this situation, we assume that one finds the best matching rules, sums the class probability distributions for each one, and predicts the most likely class. In fact, this scheme is equivalent to using the simple nearest neighbor algorithm, but with some attributes ignored during the distance calculations. Given a test instance, this technique retrieves all those training cases that are nearest to it in the reduced space. If many features have been pruned, it becomes likely that a perfect match will occur so that the distance will be zero. Pruning also makes it probable that many training cases, though different in the original space, will appear identical in the reduced space. Given a tie, we assume that nearest neighbor takes the majority vote, which produces the same effect as predicting the most frequent class associated with an abstract rule. If no perfect matches exist, the method takes the majority vote of the nearest stored cases (which can correspond to multiple rules), giving the same result as the probabilistic scheme above. This insight into the relation between oblivious decision trees and nearest-neighbor algorithms was an unexpected benefit of our work. Experimental Results with OBLIVION We have carried out two types of experiments to evaluate OBLIVION's learning ability in comparison with nearest neighbor and decision-tree methods. In the first, we presented the three algorithms with artificial data, which let us explicitly vary the number of irrelevant Boolean features and observe the resulting degra-