## Extracting Comprehensible Models from Trained Neural Networks (1996)

### Cached

### Download Links

Citations: | 71 - 3 self |

### BibTeX

@TECHREPORT{Craven96extractingcomprehensible,

author = {W. Craven},

title = {Extracting Comprehensible Models from Trained Neural Networks},

institution = {},

year = {1996}

}

### OpenURL

### Abstract

To Mom, Dad, and Susan, for their support and encouragement.

### Citations

7334 |
Probabilistic Reasoning in Intelligent Systems
- Pearl
- 1988
(Show Context)
Citation Context ... the Trepan algorithm to handle such models. For example, to make queries to a Bayesian-network instance model Trepan would have to be integrated with an algorithm for inference in Bayesian networks (=-=Pearl, 1988-=-) to ensure that it sampled properly from the distribution. Better Theoretical Guarantees Chapter 5 argued that the scalability of rule-extraction methods, and Trepan in particular, should be analyzed... |

5220 |
C4.5: programs for machine learning
- Quinlan
(Show Context)
Citation Context ..., learning algorithms differ considerably in how they represent induced hypotheses. For example, there are learning algorithms that represent their hypotheses as decision trees (Breiman et al., 1984; =-=Quinlan, 1993-=-), decision lists (Rivest, 1987; Clark & Niblett, 1989), inference rules (Michalski, 1983; Quinlan, 1993), neural networks (Rumelhart et al., 1986), hidden Markov models (Rabiner, 1989), Bayesian netw... |

4503 | A tutorial on hidden markov models and selected applications in speech recognition
- Rabiner
- 1990
(Show Context)
Citation Context ...n et al., 1984; Quinlan, 1993), decision lists (Rivest, 1987; Clark & Niblett, 1989), inference rules (Michalski, 1983; Quinlan, 1993), neural networks (Rumelhart et al., 1986), hidden Markov models (=-=Rabiner, 1989-=-), Bayesian networks (Heckerman, 1995), and stored lists of examples (Stanfill & Waltz, 1986; Aha et al., 1991). The various representations for expressing learned hypotheses differ greatly in how rea... |

4155 |
Classi and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...entational flexibility, learning algorithms differ considerably in how they represent induced hypotheses. For example, there are learning algorithms that represent their hypotheses as decision trees (=-=Breiman et al., 1984-=-; Quinlan, 1993), decision lists (Rivest, 1987; Clark & Niblett, 1989), inference rules (Michalski, 1983; Quinlan, 1993), neural networks (Rumelhart et al., 1986), hidden Markov models (Rabiner, 1989)... |

3509 | Induction of decision trees
- Quinlan
- 1986
(Show Context)
Citation Context ...4.5 (Quinlan, 1993), which arose in the artificial intelligence community, and CART (Breiman et al., 1984) which was developed in the statistics community. C4.5 is the successor to the ID3 algorithm (=-=Quinlan, 1986-=-). These two algorithms, and numerous variants of them, are similar in their overall structure, but differ somewhat in details. Here I focus mainly on C4.5, since it used extensively in the experiment... |

3406 | Fuzzy sets
- Zadeh
- 1965
(Show Context)
Citation Context ...d for the task. Hayashi’s extraction method is quite similar to the local search procedures described above. The principal difference is that the literals in the rules can represent fuzzy conditions (=-=Zadeh, 1965-=-). A fuzzy condition is one which has graded, as opposed to Boolean, degrees of satisfaction. A local method developed by Towell and Shavlik (1993) searches not for conjunctive rules, but instead for ... |

2874 |
Learning internal representations by error propagation", Parallel Distributed Processing : explorations in the microstructure of cognition 1
- Rumelhart, Hinton, et al.
- 1987
(Show Context)
Citation Context ...sent their hypotheses as decision trees (Breiman et al., 1984; Quinlan, 1993), decision lists (Rivest, 1987; Clark & Niblett, 1989), inference rules (Michalski, 1983; Quinlan, 1993), neural networks (=-=Rumelhart et al., 1986-=-), hidden Markov models (Rabiner, 1989), Bayesian networks (Heckerman, 1995), and stored lists of examples (Stanfill & Waltz, 1986; Aha et al., 1991). The various representations for expressing learne... |

2421 | A decision-theoretic generalization of on-line learning and an application to boosting
- Freund, Schapire
- 1995
(Show Context)
Citation Context ... function - but such a tree may be extremely large in some cases. 5.2.2 Weak Learning Recently, decision-tree learning has been analyzed in the context of the weak learning framework (Schapire, 1990; =-=Freund & Schapire, 1995-=-). The concept of a weak learner is one that is able to predict just slightly more accurately than random guessing. It is known that if we have a weak-learning method for some problem, then we can con... |

2364 |
Density Estimation for Statistics and Data Analysis
- Silverman
- 1986
(Show Context)
Citation Context ... unsupervised algorithms varies greatly from method to method. For example, there are unsupervised methods that explain for their training data by estimating probability distribution functions (e.g., =-=Silverman, 1986-=-), constructing hierarchical categorizations (e.g., Fisher, 1987), and reducing the data to a lower dimensional space that accounts for most of its variance (e.g., Jolliffe, 1986). Reinforcement learn... |

2171 |
Principal Component Analysis
- Jolliffe
- 1986
(Show Context)
Citation Context ...ion functions (e.g., Silverman, 1986), constructing hierarchical categorizations (e.g., Fisher, 1987), and reducing the data to a lower dimensional space that accounts for most of its variance (e.g., =-=Jolliffe, 1986-=-). Reinforcement learning involves a task that lies in between supervised and unsupervised learning. A reinforcement learner operates in a dynamic environment, and it may take actions that influence t... |

1732 | A theory of the learnable - Valiant - 1984 |

1714 | Experiments with a new boosting algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ... a hypothesis-boosting method. Finally, other groups have used hypothesis-boosting methods to improve the performance of classifiers in applied settings (Drucker et al., 1993; Drucker & Cortes, 1996; =-=Freund & Schapire, 1996-=-; Quinlan, 1996). In most of these efforts, however, the "weak" learning algorithms being boosted were fairly strong learners, such as C4.5 or multi-layer neural networks, and therefore the hypotheses... |

1365 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ...rithms. In some cases, neural networks have a more appropriate restricted hypothesis space bias than other learning algorithms. For example, the Q-learning method for reinforcement learning problems (=-=Watkins, 1989-=-) requires that the learner represent hypotheses as continuousvalued functions, and it requires that these hypotheses be updated after each training example. Few if any symbolic learning algorithms ar... |

1210 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...t weightsharing was motivated by the desire for better predictive accuracy, it is explored here as a means for facilitating rule extraction. In the spirit of the minimum-description-length principle (=-=Rissanen, 1978-=-), soft weightsharing uses a cost function that penalizes network complexity. Thus, during training the network tries to find an optimal trade-off between data-misfit (i.e., the error rate on the trai... |

1101 | Instancebased learning algorithms
- Aha, Kibler, et al.
- 1991
(Show Context)
Citation Context ...alski, 1983; Quinlan, 1993), neural networks (Rumelhart et al., 1986), hidden Markov models (Rabiner, 1989), Bayesian networks (Heckerman, 1995), and stored lists of examples (Stanfill & Waltz, 1986; =-=Aha et al., 1991-=-). The various representations for expressing learned hypotheses differ greatly in how readily they can be inspected and understood by humans. Hypothesis languages that have a logiclike syntax (e.g., ... |

867 | Learning logical definitions from relations
- Quinlan
- 1990
(Show Context)
Citation Context ...ew things, analytical learning concentrates on systems that learn to improve the efficiency of what they already know how to do. 2 There are also supervised learning algorithms that induce relations (=-=Quinlan, 1990-=-; Muggleton & DeRaedt, 1994), as opposed to functions, from examples. 2sconsists only of the �x part; it does not include the y value. The goal in unsupervised learning is to build a model that accoun... |

777 | The CN2 induction algorithm
- Clark, Niblett
- 1989
(Show Context)
Citation Context ...ow they represent induced hypotheses. For example, there are learning algorithms that represent their hypotheses as decision trees (Breiman et al., 1984; Quinlan, 1993), decision lists (Rivest, 1987; =-=Clark & Niblett, 1989-=-), inference rules (Michalski, 1983; Quinlan, 1993), neural networks (Rumelhart et al., 1986), hidden Markov models (Rabiner, 1989), Bayesian networks (Heckerman, 1995), and stored lists of examples (... |

771 |
Cross-Validatory Choice and Assessment of Statistical Predictions
- Stone
- 1974
(Show Context)
Citation Context ... a holdout set. Unless the size of the available data set is quite large, or unless the nature of the data somehow precludes it, a preferred method for accuracy estimation is to use cross validation (=-=Stone, 1974-=-). In k-fold cross validation, the available data is partitioned into k separate sets of approximately equal size. The cross-validation procedure involves k iterations in which the learning method is ... |

687 |
A Theory and Methodology of Inductive Learning
- Michalski
(Show Context)
Citation Context ... example, there are learning algorithms that represent their hypotheses as decision trees (Breiman et al., 1984; Quinlan, 1993), decision lists (Rivest, 1987; Clark & Niblett, 1989), inference rules (=-=Michalski, 1983-=-; Quinlan, 1993), neural networks (Rumelhart et al., 1986), hidden Markov models (Rabiner, 1989), Bayesian networks (Heckerman, 1995), and stored lists of examples (Stanfill & Waltz, 1986; Aha et al.,... |

685 | The strength of weak learnability
- Schapire
- 1990
(Show Context)
Citation Context ...sents the target function - but such a tree may be extremely large in some cases. 5.2.2 Weak Learning Recently, decision-tree learning has been analyzed in the context of the weak learning framework (=-=Schapire, 1990-=-; Freund & Schapire, 1995). The concept of a weak learner is one that is able to predict just slightly more accurately than random guessing. It is known that if we have a weak-learning method for some... |

681 | The cascade-correlation learning architecture
- Fahlman, Lebiere
- 1989
(Show Context)
Citation Context ...t able to form hypotheses themselves, but instead must be used in conjunction with ordinary learning methods. Several constructive neural-network approaches have been previously developed (Ash, 1989; =-=Fahlman & Lebiere, 1989-=-; Frean, 1990). Similarly, there are several algorithms that simplify learned networks by removing weights or hidden units (Le Cun et al., 1989; Mozer & Smolensky, 1988). Unlike BBP, however, these me... |

664 | Knowledge Acquisition Via Incremental Conceptual Clustering. Machine Learning 2:139–172
- Fisher
- 1987
(Show Context)
Citation Context ...example, there are unsupervised methods that explain for their training data by estimating probability distribution functions (e.g., Silverman, 1986), constructing hierarchical categorizations (e.g., =-=Fisher, 1987-=-), and reducing the data to a lower dimensional space that accounts for most of its variance (e.g., Jolliffe, 1986). Reinforcement learning involves a task that lies in between supervised and unsuperv... |

661 |
Queries and concept-learning
- Angluin
- 1988
(Show Context)
Citation Context ...epan are described in detail below. 3.2.1 Membership Queries and the Oracle The generality of Trepan derives from the fact that its interaction with the network consists solely of membership queries (=-=Angluin, 1988-=-). A membership query is a question to an oracle that consists of an instance from the learner's instance space. Given a membership query, the oracle returns the class label for the instance. Recall t... |

634 |
Learnability and the Vapnik-Chervonenkis dimension
- Blumer, Ehrenfeucht, et al.
- 1989
(Show Context)
Citation Context ... that the BBP algorithm is an efficient PAC algorithm for the class Pk using the hypothesis class Hr. While there are already known algorithms for PAC learning the general class of perceptrons (e.g., =-=Blumer et al., 1989-=-), an atypical aspect of the BBP algorithm is that when learning target functions which are in the class of sparse perceptrons, it produces hypotheses that are relatively sparse themselves. In particu... |

615 | Irrelevant features and the subset selection problem
- John, Kohavi, et al.
- 1994
(Show Context)
Citation Context ...an sometimes be improved by using an explicit featureselection method in conjunction with an ordinary learning algorithm (Almuallim & Dietterich, 1991; Kira & Rendell, 1992b; Caruana & Freitag, 1994; =-=John et al., 1994-=-; Moore & Lee, 1994;s136 Skalak, 1994). In the experiments below, we use one such method - Relief (Kira & Rendell, 1992a; 1992b) - with C4.5 as the learning algorithm. Relief is a filter-type method f... |

610 |
An introduction to computational learning theory
- Kearns, Vazirani
- 1994
(Show Context)
Citation Context ... Trepan treats the rule-extraction task as a learning problem, this section discusses Trepan in the context of analytical frameworks that have developed in the field of computational learning theory (=-=Kearns & Vazirani, 1994-=-). Specifically, I discuss results for learning decision trees that have been proven for the PAC-learning, weak-learning, and agnostic PAC-learning models. 5.2.1 PAC Learning The notion of probably ap... |

548 | Active learning with statistical models
- Cohn, Ghahramani, et al.
- 1996
(Show Context)
Citation Context ...nt, thereby exerting some control over the states (i.e., training examples) it experiences. Practical algorithms for active learning have also been developed for regression tasks (e.g., MacKay, 1992; =-=Cohn et al., 1996-=-), and classification tasks (e.g., Baum & Lang, 1991; Cohn et al., 1994). The classification algorithms, however, are for learning in neural networks and thus they do not produce comprehensible hypoth... |

540 |
Perceptrons: An Introduction to Computational Geometry
- Minsky, Papert
- 1969
(Show Context)
Citation Context ...fers to the constraints that a learning algorithm places on the hypotheses that it is able to construct. For example, the hypothesis space of a perceptron is limited to linear discriminant functions (=-=Minsky & Papert, 1969-=-). The preference bias of a learning algorithm refers to the preference ordering it places on the models that are within its hypothesis space. For example, most learning algorithms initially try to fi... |

492 |
Toward memory-based reasoning
- Stanfill, Waltz
- 1986
(Show Context)
Citation Context ...), inference rules (Michalski, 1983; Quinlan, 1993), neural networks (Rumelhart et al., 1986), hidden Markov models (Rabiner, 1989), Bayesian networks (Heckerman, 1995), and stored lists of examples (=-=Stanfill & Waltz, 1986-=-; Aha et al., 1991). The various representations for expressing learned hypotheses differ greatly in how readily they can be inspected and understood by humans. Hypothesis languages that have a logicl... |

479 | Parallel networks that learn to pronounce english text
- Sejnowski, Rosenberg
- 1987
(Show Context)
Citation Context ...ng protein-coding regions in E. coli DNA sequences (Craven & Shavlik, 1993b), diagnosing the presence of heart disease in patients (Detrano et al., 1989), mapping English text into its pronunciation (=-=Sejnowski & Rosenberg, 1987-=-), recognizing promoters in E. coli DNA sequences (Towell et al., 1990), diagnosing faults in local telephone loops (Provost & Danyluk, 1995), and predicting the party affiliation of members of the U.... |

467 | Raedt. Inductive logic programming: Theory and methods
- Muggleton, De
- 1994
(Show Context)
Citation Context ...lytical learning concentrates on systems that learn to improve the efficiency of what they already know how to do. 2There are also supervised learning algorithms that induce relations (Quinlan, 1990; =-=Muggleton & DeRaedt, 1994-=-), as opposed to functions, from examples.s3 consists only of the ~x part; it does not include the y value. The goal in unsupervised learning is to build a model that accounts for regularities in the ... |

435 | Optimal brain damage - LeCun, Denker, et al. - 1990 |

432 | Boosting a weak learning algorithm by majority
- Freund
- 1995
(Show Context)
Citation Context ... weak learning algorithm for a class F , there are general mechanisms for boosting the weak learner into a strong learner for F . Several such boosting algorithms have been developed (Schapire, 1990; =-=Freund, 1990-=-; Freund, 1992), and one called AdaBoost (Freund & Schapire, 1995) is the basis of the BBP algorithm. Figure 27 shows the version of AdaBoost on which our BBP algorithm is based. We assume here that t... |

429 | Improved generalization with active learning
- Cohn, Atlas, et al.
- 1994
(Show Context)
Citation Context ...les) it experiences. Practical algorithms for active learning have also been developed for regression tasks (e.g., MacKay, 1992; Cohn et al., 1996), and classification tasks (e.g., Baum & Lang, 1991; =-=Cohn et al., 1994-=-). The classification algorithms, however, are for learning in neural networks and thus they do not produce comprehensible hypotheses like Trepan. Catlett (1992) and others (Musick et al., 1993) have ... |

421 | A Practical Bayesian Framework for Backpropagation Networks
- MacKay
- 1992
(Show Context)
Citation Context ... its environment, thereby exerting some control over the states (i.e., training examples) it experiences. Practical algorithms for active learning have also been developed for regression tasks (e.g., =-=MacKay, 1992-=-; Cohn et al., 1996), and classification tasks (e.g., Baum & Lang, 1991; Cohn et al., 1994). The classification algorithms, however, are for learning in neural networks and thus they do not produce co... |

380 |
Decision theoretic generalizations of the PAC model for neural net and other learning applications
- Haussler
- 1992
(Show Context)
Citation Context ...tract a decision-tree of arbitrarily good fidelity may not be especially interesting if such a tree might be too complex to understand. In this section, I consider the model of agnostic PAC learning (=-=Haussler, 1992-=-; Kearns et al., 1992). In several respects the agnostic PAC model is more appropriate to analyzing the rule-extraction task than are the ordinary PAC and the weak-learning models. Agnostic PAC learni... |

380 |
A practical approach to feature selection
- Kira, Rendell
- 1992
(Show Context)
Citation Context ...hart et al., 1986) trained using a conjugate-gradient method (Kramer & Sangiovanni-Vincentelli, 1989); 2. decision trees induced using C4.5 (Quinlan, 1993); 3. the Relief feature-selection algorithm (=-=Kira & Rendell, 1992-=-a; 1992b) used in conjunction with C4.5; 4. the Balanced version of the Winnow algorithm (Littlestone, 1989; 1995). The reasons for selecting each of these algorithms are as follows. Multi-layer netwo... |

380 | Learning decision lists
- Rivest
- 1987
(Show Context)
Citation Context ...siderably in how they represent induced hypotheses. For example, there are learning algorithms that represent their hypotheses as decision trees (Breiman et al., 1984; Quinlan, 1993), decision lists (=-=Rivest, 1987-=-; Clark & Niblett, 1989), inference rules (Michalski, 1983; Quinlan, 1993), neural networks (Rumelhart et al., 1986), hidden Markov models (Rabiner, 1989), Bayesian networks (Heckerman, 1995), and sto... |

370 |
Fuzzy ARTMAP: a neural network architecture for incremental supervised learning of analog multidimensional maps
- Carpenter, Grossberg, et al.
- 1993
(Show Context)
Citation Context ...ghtforward process of translating each local function directly into a rule. Finally, Tan (1994) has presented a rule extraction method for complex neural-network architecture called the Fuzzy ARTMAP (=-=Carpenter et al., 1992-=-). As is the case for networks with local basis functions, rule-extraction in this context involves directly translating parts of the network architecture into rules. One novel aspect of this approach... |

344 | Connectionist learning procedures
- Hinton
- 1989
(Show Context)
Citation Context ... term that reflects a prior distribution over the values that the parameters can take (Rumelhart et al., 1995). An appropriate cost function for classification problems is the cross-entropy function (=-=Hinton, 1989-=-): C = \GammasX i Xj [t j ln(aj) + (1 \Gammastj) ln(1 \Gammasaj)] Here i ranges the examples in the training set, j ranges over the output units of the network, tj is the target value for the jth outp... |

337 | Estimating Continuous Distributions in Bayesian Classifiers," presented at
- John, Langley
- 1995
(Show Context)
Citation Context ...y estimation procedure has the property of consistency, meaning that as the size of training set tends to infinity, the estimate of the density function converges to the true function (Devroye, 1983; =-=John & Langley, 1995-=-). Since the value of oe is inversely proportional to the available data, however, the method produces smooth, near-Gaussian estimates when training data is scarce. This method of modeling the underly... |

333 | From Data Mining to Knowledge Discovery in Databases: An Overview
- Fayyad, -Shapiro, et al.
- 1996
(Show Context)
Citation Context ...d to a better understanding of the problem domain. Inductive learning with a focus on comprehensibility is a central activity in the growing field of knowledge discovery in databases and data mining (=-=Fayyad et al., 1996-=-). Of course, it is often the case that a learning method is applied in a given domain for both of these purposes: to construct a system that can perform a useful task, and to get a better understandi... |

328 | Distributed representations, simple recurrent networks, and grammatical structure
- Elman
- 1991
(Show Context)
Citation Context ...ed methods, which describe individual hidden units, several authors have investigated methods aimed at characterizing sets of hidden units (Sanger, 1989; Hanson & Burr, 1990; Dennis & Phillips, 1991; =-=Elman, 1991-=-). Theses157 methods are similar in spirit to the FSA-extraction algorithms in that they examine the space of hidden-unit activations, and try to identify regions in this space that are associated wit... |

317 | A Tutorial on Learning Bayesian Networks
- Heckerman
- 1995
(Show Context)
Citation Context ...ision lists (Rivest, 1987; Clark & Niblett, 1989), inference rules (Michalski, 1983; Quinlan, 1993), neural networks (Rumelhart et al., 1986), hidden Markov models (Rabiner, 1989), Bayesian networks (=-=Heckerman, 1995-=-), and stored lists of examples (Stanfill & Waltz, 1986; Aha et al., 1991). The various representations for expressing learned hypotheses differ greatly in how readily they can be inspected and unders... |

313 | Syskill & Webert: Identifying interesting web sites
- Pazzani, Muramatsu, et al.
- 1996
(Show Context)
Citation Context ...earned model to a large body of data, one might want to translate a learned hypothesis (or part of one) into a standard query language so that it can be used to query a database (Fayyad et al., 1996; =-=Pazzani et al., 1996-=-). In addition to predictive accuracy, comprehensibility, and representational flexibility, there are other criteria that are often important considerations when evaluating a learning algorithm or the... |

288 | Improving elevator performance using reinforcement learning
- Crites, Barto
- 1996
(Show Context)
Citation Context ...der to use the data for query instances. Similarly, for some reinforcement learning tasks there exists a good computational model of the problem domain that could be used to generate instances (e.g., =-=Crites & Barto, 1996-=-). This situation is discussed further in Chapter 5. The second distribution-based method to querying - which is investigated in depth in the following chapter - is to construct a model of the underly... |

287 |
Regularization algorithms for learning that are equivalent to multilayer networks
- Poggio, Girosi
- 1990
(Show Context)
Citation Context ...s in the network employ. I shall first discuss methods for networks that use sigmoidal transfer functions, and then cover methods for networks that employ local basis functions (Moody & Darken, 1988; =-=Poggio & Girosi, 1990-=-) instead of sigmoids. There are a number of local rule-extraction methods for networks that use sigmoidal activation transfer for their hidden and output units. In these methods, the assumption is ma... |

284 |
Mathematical Statistics and Data Analysis
- Rice
- 1988
(Show Context)
Citation Context ...f the marginal distributions for any feature are significantly different. Since each feature presents an opportunity to spuriously reject the null hypothesis, Trepan uses the Bonferroni correction 3 (=-=Rice, 1995-=-) to adjust the significance level of the overall test downward for the individual tests. Note that if a node has very little data on which to base its model, then it is unlikely that the null hypothe... |

283 | Bagging, boosting, and C4.5
- Quinlan
- 1996
(Show Context)
Citation Context ...thod. Finally, other groups have used hypothesis-boosting methods to improve the performance of classifiers in applied settings (Drucker et al., 1993; Drucker & Cortes, 1996; Freund & Schapire, 1996; =-=Quinlan, 1996-=-). In most of these efforts, however, the "weak" learning algorithms being boosted were fairly strong learners, such as C4.5 or multi-layer neural networks, and therefore the hypotheses produced were ... |

275 |
The feature selection problem: Traditional methods and a new algorithm
- Kira, Rendell
- 1992
(Show Context)
Citation Context ...hart et al., 1986) trained using a conjugate-gradient method (Kramer & Sangiovanni-Vincentelli, 1989); 2. decision trees induced using C4.5 (Quinlan, 1993); 3. the Relief feature-selection algorithm (=-=Kira & Rendell, 1992-=-a; 1992b) used in conjunction with C4.5; 4. the Balanced version of the Winnow algorithm (Littlestone, 1989; 1995). The reasons for selecting each of these algorithms are as follows. Multi-layer netwo... |