## Computational Intelligence Methods for Rule-Based Data Understanding (2004)

Venue: | PROCEEDINGS OF THE IEEE |

Citations: | 23 - 3 self |

### BibTeX

@INPROCEEDINGS{Duch04computationalintelligence,

author = {Wlodzislaw Duch and Rudy Setiono and Jacek M. Zurada},

title = {Computational Intelligence Methods for Rule-Based Data Understanding},

booktitle = {PROCEEDINGS OF THE IEEE},

year = {2004},

pages = {771--805},

publisher = {}

}

### OpenURL

### Abstract

... This paper is focused on the extraction and use of logical rules for data understanding. All aspects of rule generation, optimization, and application are described, including the problem of finding good symbolic descriptors for continuous data, tradeoffs between accuracy and simplicity at the rule-extraction stage, and tradeoffs between rejection and error level at the rule optimization stage. Stability of rule-based description, calculation of probabilities from rules, and other related issues are also discussed. Major approaches to extraction of logical rules based on neural networks, decision trees, machine learning, and statistical methods are introduced. Optimization and application issues for sets of logical rules are described. Applications of such methods to benchmark and real-life problems are reported and illustrated with simple logical rules for many datasets. Challenges and new directions for research are outlined.

### Citations

4828 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...all training data samples 782 PROCEEDINGS OF THE IEEE, VOL. 92, NO. 5, MAY 2004s. The second term, scaled by , is used frequently in the weight pruning or in the Bayesian regularization method [107], =-=[108]-=- to improve generalization of the MLP networks. A naive interpretation of why such regularization works is based on the observation that small weights and thresholds mean that only the linear part of ... |

3909 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ... as simulated annealing, randomization, regression, linear programming, and neural-inspired algorithms. 778 PROCEEDINGS OF THE IEEE, VOL. 92, NO. 5, MAY 2004sC. Methods Using Random Perturbation CART =-=[49]-=- splits each node in the decision tree to maximize the purity of the resulting subsets. A node with patterns that belong only to one class has the highest purity. Nodes with patterns from several clas... |

3354 | Induction of Decision Trees
- Quinlan
- 1986
(Show Context)
Citation Context ...simplest decision rules [32]. This algorithm searches for the best feature that has a set of values or a range of values for which vectors from a single class dominate, and presents it as a rule. ID3 =-=[41]-=- and its successors, C4.5 [42] and C5.0, are currently the most widely used algorithms for generating decision trees. Given a dataset , a decision tree is generated recursively as follows. 1) If conta... |

3077 | Fuzzy sets
- Zadeh
- 1965
(Show Context)
Citation Context ...rapezoidal membership functions are similar approximations to the soft trapezoid functions obtained from combinations of two sigmoidal transfer functions (see next section). The fuzzy set theory [13]�=-=��[16]-=- gives only a formal definition of membership function and relation, DUCH et al.: COMPUTATIONAL INTELLIGENCE METHODS FOR RULE-BASED DATA UNDERSTANDING 773sFig. 2. Shapes of decision borders for: (a) g... |

2046 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...pings from the input to the output space. Discovery of class structures, interesting association patterns, sequences, or causal relationships has never been an explicit goal in designing such methods =-=[1]-=-, [2]. Predictive nonparametric classification and approximation methods frequently achieve high accuracy using a large number of numerical parameters in a way that is incomprehensible to humans. This... |

1232 |
Pattern Recognition with Fuzzy Objective Function Algorithms
- Bezdek
- 1981
(Show Context)
Citation Context ... of relational fuzzy rules, see [151]. The process of generating relational fuzzy rule-based data explanations consists of the following two steps. 1) Initial rule generation. Fuzzy -means clustering =-=[152]-=- is first performed to find similar data groups (clusters). The locations and shapes of the initial membership functions are estimated using subsets of data whose membership to the corresponding clust... |

1113 |
Pattern recognition and neural networks
- Ripley
- 1996
(Show Context)
Citation Context ...ationally more costly to find, but certainly for some data, they may be simple and accurate. D. Diabetes The “Pima Indian Diabetes” dataset [118] is also frequently used as benchmark data [89], [1=-=34]–[136]-=-. All patients were females, at least 21 years old, of Pima Indian heritage. Seven hundred sixty-eight cases have been collected, 500 (65.1%) healthy and 268 (34.9%) with diabetes. Eight attributes de... |

976 | Fast effective rule induction
- Cohen
- 1995
(Show Context)
Citation Context ... number of examples covered by the rule, is the number of class examples, and is the number of classes. Generated rules are either ordered (have to be used in specified sequence) or unordered. RIPPER =-=[54]-=- creates conjunctive rules covering examples from a given class, adding new features in a similar way as decision trees, selecting the best subsets or splits on the basis of information gain heuristic... |

881 |
Exploratory Data Analysis
- Tukey
- 1977
(Show Context)
Citation Context ...ndustrial applications. Visualization forms a basis of the exploratory data analysis (EDA) that tries to uncover underlying data structure, detect outliers and anomalies, and find important variables =-=[7]-=-, [8]. Experts are able to understand the data simply by inspecting such visual representations. Visualization of neural networks outputs and activities of hidden layer allows to understand better map... |

747 | The CN2 induction algorithm
- Clark, Niblett
- 1989
(Show Context)
Citation Context ...ted cases (for online learning). The AQ15 program and several other algorithms were used in a multistrategy approach to data mining [52], combining ML, database, and knowledge-based technologies. CN2 =-=[53]-=- is an example of a covering algorithm combining features of AQ with decision tree learning. A search for good rules proceeds in a general-to-specific order, adding new conjunctive conditions or remov... |

670 | A theory and methodology of inductive learning - Michalski - 1983 |

552 |
Generalization as Search
- Mitchell
- 1982
(Show Context)
Citation Context ... DATA UNDERSTANDING 779saccurate rules than those generated by a C4.5 decision tree in the rules mode. Version spaces (VS) is an algorithm that also belongs to the family of covering algorithms [24], =-=[55]-=-. The VS algorithm works with symbolic inputs, formulating hypotheses about the data in the form of conjunctive rules. Such a hypothesis space may be ordered according to how general or specific the h... |

539 |
The meaning and use of the area under a receiver operating characteristic (ROC) curve
- Hanley, McNeil
- 1982
(Show Context)
Citation Context ...e point (in Fig. 6 points for two classifiers, A and B, are shown). The area under the line connecting (0,0) with point plus the line connecting with (1,1), is known as the area under ROC curve (AUC) =-=[115]-=-. For a crisp rule classifier AUC . Thus, different combinations of sensitivity and specificity give the same AUC as long as the sum is constant. Maximization of AUC is equivalent to the minimal cost ... |

438 | Very simple classification rules perform well on most commonly used datasets
- Holte
- 1993
(Show Context)
Citation Context ...n using a subset of all data only (the subset covered by a decision tree node or a given network node). Global methods may be based on: 1) searching intervals that contain vectors from a single class =-=[32]-=-; 2) entropy or information theory to determine intervals that have low entropy or high mutual information with class labels [33]; 3) univariate decision trees to find the best splits for a single fea... |

408 | Supervised and unsupervised discretization of continuous features
- Dougherty, Kohavi, et al.
- 1995
(Show Context)
Citation Context ...limiting the applicability of this method for finding linguistic variables (as shown in Fig. 7 later). Methods that are useful for creating linguistic variables draw inspiration from different fields =-=[31]-=-. Global discretization methods are independent of any rule-extraction method, treating each attribute separately. Local discretization methods are usually embedded in rule-extraction algorithms, perf... |

399 | A practical Bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...ss for all training data samples 782 PROCEEDINGS OF THE IEEE, VOL. 92, NO. 5, MAY 2004s. The second term, scaled by , is used frequently in the weight pruning or in the Bayesian regularization method =-=[107]-=-, [108] to improve generalization of the MLP networks. A naive interpretation of why such regularization works is based on the observation that small weights and thresholds mean that only the linear p... |

365 |
Computer Systems that Learn
- Weiss, Kulikowski
- 1995
(Show Context)
Citation Context ...), and information about menopause, which breast and breast quadrant were affected, and whether radiation treatment has been applied. These data have been analyzed using a number of algorithms (e.g., =-=[126]��-=-�[131]). The results are rather poor, as should be expected, since the data attributes are not very informative. Several ML methods gave predictions that are worse than those of the majority classifie... |

345 |
2001) Principles of Data Mining
- Hand, Mannila, et al.
(Show Context)
Citation Context ...asses for each feature. Histograms should be smoothed, for example, by assuming that each feature value is really a Gaussian or a triangular fuzzy number (kernel smoothing techniques are discussed in =-=[30]-=-). Unfortunately, histograms for different classes frequently overlap strongly, limiting the applicability of this method for finding linguistic variables (as shown in Fig. 7 later). Methods that are ... |

317 | The MultiPurpose Incremental Learning System AQ15 and its Testing Applications to Three Medical Domains - Michalski, Mozetic, et al. - 1986 |

234 |
A survey and critique of techniques for extracting rules from trained artificial neural networks
- Andrews, Diederich, et al.
- 1995
(Show Context)
Citation Context ... knowledge-based expert systems is a great challenge for computational intelligence. Reasoning with logical rules is more acceptable to human users than the recommendations given by black box systems =-=[4]-=-, because such reasoning is comprehensible, provides explanations, and may be validated by human inspection. It also increases confidence in the system, and may help to discover important relationship... |

210 | Robust linear programming discrimination of two linearly inseparable sets
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ...large number of rules will usually lead to poor generalization, and the insight into the knowledge hidden in the data will be lost. C. Wisconsin Breast Cancer Data The Wisconsin breast cancer dataset =-=[132]-=- is one of the favorite benchmark datasets for testing classifiers (Table 5). Properties of cancer cells were collected for 699 cases, with 792 PROCEEDINGS OF THE IEEE, VOL. 92, NO. 5, MAY 2004s458 be... |

203 |
Introduction to artificial neural systems
- Zurada
- 1992
(Show Context)
Citation Context ...constructing the LN using training data directly (constructive, or C-MLP2LN algorithm) is faster and usually more accurate. Since interpretation of the activation of the MLP network nodes is not easy =-=[106]-=-, a smooth transition from MLP to a logical type of network performing similar functions is advocated. This transition is achieved during network training by the following. 1) Increasing gradually the... |

196 | Extracting refined rules from knowledge-based neural networks
- Towell, Shavlik
- 1993
(Show Context)
Citation Context ...on of contributions of inputs that are not specified in rule antecedents. As shown by Sethi and Yoo [73], the number of search nodes is then reduced to . In the Subset algorithm of Towell and Shavlik =-=[74]-=-, inputs with largest weights are analyzed first, and if they are sufficient to activate the hidden node of the network irrespective of the values on other inputs, a new rule is recorded. Combinations... |

166 |
Neural Network Learning and Expert Systems
- Gallant
- 1993
(Show Context)
Citation Context ...literals that were present in the training set. Due to these restrictions, their method sometimes creates a rule that is too general. This drawback has been removed in the method developed by Gallant =-=[63]-=-. The difficulty comes from the inputs that are not specified in the rule provided as a candidate by the search procedure. Gallant takes all possible values for these inputs, and although rules genera... |

150 | Knowledge-based artificial neural networks
- Towell, Shavlik
- 1994
(Show Context)
Citation Context ...ent inputs; therefore, it is easier to analyze. If symbolic knowledge is used to specify initial weights, as it is done in the knowledge-based artificial neural networks (KBANN) of Towell and Shavlik =-=[76]-=-, weights are clustered before and after training. The search process is further simplified if the prototype weight templates (corresponding to symbolic rules) are used for comparison with the weight ... |

135 |
Foundations of Neuro-Fuzzy Systems
- Nauck, Klawonn, et al.
- 1997
(Show Context)
Citation Context ...ay seem that neurofuzzy systems should have advantages in application to rule extraction, since crisp rules are just a special case of fuzzy rules. Many neurofuzzy systems have been constructed [23], =-=[100]��-=-�[103]. However, there is a danger of overparametrization of such systems, leading to difficulty of finding optimal solutions even with the help of evolutionary algorithms or other global optimization... |

126 | Functional equivalence between radial basis function networks and fuzzy inference systems
- Jang, Sun
- 1993
(Show Context)
Citation Context ...prehensible models of the data may be approximated by rule-based systems in the same way. Neural networks based on separable localized activation functions are equivalent to fuzzy logic systems [14], =-=[92]-=-. Each node has a direct interpretation in terms of fuzzy rules, which eliminates the need for a search process. Gaussian functions are used for inserting and extracting knowledge into the radial basi... |

126 | An empirical comparison of pattern recognition, neural nets, and machine learning classification methods
- Weiss
- 1989
(Show Context)
Citation Context ...re conservative, assigning healthy persons (all errors are very close to the decision border) to one of the hypothyroid problem groups. Rules of similar quality have been found by Weiss and Kapouleas =-=[139]-=- using a heuristic version of the predictive value maximization (PVM) method and using the CART decision tree. The differences among PVM, CART, and C-MLP2LN for this dataset are rather 794 PROCEEDINGS... |

123 |
Inductive Logic Programming: Techniques and Applications
- Lavrec, Dzeroski
- 1997
(Show Context)
Citation Context ...e logic programming (ILP) is a subfield of ML concerned with inducing first-order predicate calculus logic rules (FOL rules) from data (examples and additional knowledge) expressed as Prolog programs =-=[56]-=-. Objects classified by FOL rules may have a relational, nested structure that cannot be expressed by an attribute-value vector. This is useful for sequential data (such as those in natural language a... |

119 | Multivariate decision trees - Brodley, Utgoff - 1995 |

117 | A general framework for adaptive processing of data structures
- Gori, M, et al.
- 1998
(Show Context)
Citation Context .... It could be of great importance to formulate clear challenges in these fields and provide more data for empirical tests. Algorithms that treat objects of complex structure already exist [25], [26], =-=[155]-=-. However, there is a balance between generality and efficiency of algorithms for analysis of complex objects, and much more work in this direction is needed to find an optimal balance. Going beyond p... |

111 |
Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering
- Kasabov
- 1996
(Show Context)
Citation Context ...de more flexible decision borders. Flexibility of the fuzzy approach depends on the choice of membership functions. Fuzzy logic classifiers frequently use a few membership functions per input feature =-=[13]��-=-�[15]. Triangular membership functions provide oval decision borders, similar to those provided by Gaussian functions (see Fig. 2). In fact, each fuzzy rule may be represented as a node of a network t... |

104 | ASSISTANT 86: A knowledge-elicitation tool for sophisticated users - Cestnik, Kononenko, et al. - 1987 |

103 | Error-based and entropy-based discretization of continuous features
- Kohavi, Sahami
- 1996
(Show Context)
Citation Context ...entropy or information theory to determine intervals that have low entropy or high mutual information with class labels [33]; 3) univariate decision trees to find the best splits for a single feature =-=[34]-=-; 4) latent variable models [35]; or 5) Chi-square statistics to merge intervals [36]. A discretization algorithm based on a separability criterion [37], described below, creates a small number of int... |

87 | Extracting tree-structured representations of trained networks
- Craven, Shavlik
- 1996
(Show Context)
Citation Context ...n some training data. The network is used as an “oracle,” providing as many training examples as needed. This approach has been used quite successfully by Craven and Shavlik in their TREPAN algori=-=thm [91]-=-, combining decision trees with neural networks. Decision trees are induced on the training data, plus the new data obtained by perturbing the training data. The additional training data are classifie... |

83 | system for induction of oblique decision trees - Kasif, Salzberg - 1994 |

80 |
Rough sets–theoretical aspects of reasoning about data
- Pawlak
- 1992
(Show Context)
Citation Context ...degree of membership . Since multivariate mappings are difficult to interpret, understanding the data using rules may be regarded as an attempt to discretize the mapping in some way. Rough set theory =-=[19]-=-, [20] can also be used to derive crisp logic propositional rules. In this theory, for two-class problems the lower approximation of the data is defined as a set of vectors, or a region of the feature... |

74 |
The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained artificial neural networks
- Tickle, Andrews, et al.
- 1998
(Show Context)
Citation Context ...g data are built in, and wide-margin classification provided by neural networks leads to more robust logical rules. Neural rule-extraction algorithms may be compared using six aspects (as proposed in =-=[61], and ex-=-tended in [11]): 1) the “expressive power” of the extracted rules (types of rules extracted); 2) the “quality” of the extracted rules (accuracy, fidelity comparing to the underlying network, c... |

74 |
Rule learning by searching on adapted nets
- Fu
- 1991
(Show Context)
Citation Context ... possible combinations of input features. Rules corresponding to the whole network are combined from rules for each network node. Local methods for extraction of conjunctive rules were proposed by Fu =-=[68]–[-=-71] and by Setiono and Liu [57]–[59], [72]. As with the global methods, the depth of search for good rules is restricted. The weights may be used to limit the search tree by providing the evaluation... |

73 | Using Sampling and Queries to Extract Rules from Trained Neural Networks
- Craven, Shavlik
- 1994
(Show Context)
Citation Context ...approach [82], described in some detail below, such a mapping is incorporated in the learning scheme. Rule extraction as learning (REAL) is a rather general technique introduced by Craven and Shavlik =-=[83]-=- for incremental generation of new rules (both conjunctive and -of- rules). If a new example is not classified correctly by the existing set of rules, a new rule based on the misclassified example is ... |

69 | Pattern recognition via linear programming: Theory and application to medical diagnosis. Large-scale numerical optimization - Mangasarian, Setiono, et al. - 1990 |

67 |
On the quasi-minimal solution of the general covering problem
- Michalski
- 1969
(Show Context)
Citation Context ... methods is not available at the time of this writing, but selected algorithms were presented in a textbook [24]. For over 30 years, Michalski has been working on the family of AQ covering algorithms =-=[51], cr-=-eating more than 20 major versions. In AQ concept description, rules for assigning cases to a class are built starting from a “seed” example selected for each class. A set of most general rules th... |

66 |
Feature space mapping as a universal adaptive system
- Duch, Diercksen
- 1995
(Show Context)
Citation Context ...le transfer functions (i.e., calculate products of functions, one for each feature) in a single hidden layer, are capable of creating the same decision borders as crisp, fuzzy, or rough set rule sets =-=[23]-=-. Propositional logic has in itself a limited expressive power and may be used only in domains where attribute-value language (i.e., vector feature space description) is sufficient to express knowledg... |

61 | Induction of logic programs: Foil and related systems
- Quinlan, Cameron-Jones
- 1995
(Show Context)
Citation Context ...ic [24], [25]. Since even the full first-order logic is computationally difficult to implement, various restrictions have been proposed to make the process of rule discovery computationally effective =-=[26]-=-. III. LINGUISTIC VARIABLES Logical rules (as do other attempts to verbalize knowledge) require symbolic inputs, called linguistic variables. This implies that the input data have to be quantized, i.e... |

55 | OC1: A Randomized Induction of Oblique Decision Trees
- Murthy, Kasif, et al.
- 1993
(Show Context)
Citation Context ...s, ’s are the coefficients that determine the orientation of the hyperplane, and is a threshold. The values of and are fine-tuned by perturbing their values to decrease the impurity of the split. OC=-=1 [46]-=- combines deterministic hill climbing with randomization to find the best multivariate node split. It first finds the best axis-parallel split at a node, then looks for a better split by searching for... |

49 | A new methodology of extraction, optimization and application of crisp and fuzzy logical rules
- Duch, Adamczak, et al.
- 2001
(Show Context)
Citation Context ... rules may differ, they always partition the whole feature space into some subspaces. A general form of a crisp rule is IF THEN Class (1) 1 The authors have included excerpts from their earlier paper =-=[11]-=- and [151], as they felt that this inclusion would be beneficial to the readers who are not necessarily specialists in computational intelligence techniques, and that doing so would enhance the tutori... |

47 | Neural Networks in Designing Fuzzy Systems for Real World Applications. Fuzzy Sets and Systems - K, Glesner - 1994 |

45 | Learning and Soft Computing - Kecman - 2001 |

44 | Large data sets lead to overly complex models: an explanation and a solution
- Oates, Jensen
- 1998
(Show Context)
Citation Context ...gers. When the number of parameters is of the order of the number of data vectors, predictive models may easily overfit the data. In some cases, even an abundance of data will not prevent overfitting =-=[3]-=-. Many irrelevant attributes may contribute to the final solution. Combining predictive models with a priori knowledge about the problem is usually difficult. Therefore, the use of the black-box model... |

43 |
Multi-Interval Discretization of Continuous Valued Attributes for Classification Learning
- Al-Fayyad, Irani
- 1993
(Show Context)
Citation Context ...ed on: 1) searching intervals that contain vectors from a single class [32]; 2) entropy or information theory to determine intervals that have low entropy or high mutual information with class labels =-=[33]-=-; 3) univariate decision trees to find the best splits for a single feature [34]; 4) latent variable models [35]; or 5) Chi-square statistics to merge intervals [36]. A discretization algorithm based ... |