## Using mutual information for selecting features in supervised neural net learning (1994)

Venue: | IEEE Transactions on Neural Networks |

Citations: | 226 - 1 self |

### BibTeX

@ARTICLE{Battiti94usingmutual,

author = {Roberto Battiti},

title = {Using mutual information for selecting features in supervised neural net learning},

journal = {IEEE Transactions on Neural Networks},

year = {1994},

volume = {5},

pages = {537--550}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract-This paper investigates the application of the mutual infor “ criterion to evaluate a set of candidate features and to select an informative subset to be used as input data for a neural network classifier. Because the mutual information measures arbitrary dependencies between random variables, it is suitable for assessing the “information content ” of features in complex classification tasks, where methods bases on linear relations (like the correlation) are prone to mistakes. The fact that the mutual information is independent of the coordinates chosen permits a robust estimation. Nonetheless, the use of the mutual information for tasks characterized by high input dimensionality requires suitable approximations because of the prohibitive demands on computation and samples. An algorithm is proposed that is based on a “greedy ” selection of the features and that takes both the mutual information with respect to the output class and with respect to the already-selected features into account. Finally the results of a series of experiments are discussed. Index Terms-Feature extraction, neural network pruning, di-mensionality reduction, mutual information, supervised learning,

### Citations

7283 |
A mathematical theory of communications
- Shannon
- 1948
(Show Context)
Citation Context ...be detected as soon as possible in the development process because in this case the only remedy is that of adding more features or considering more informative ones. Shannon’s information theory (see =-=[20]-=-) provides a suitable formalism for quantifying the above concepts. If the probabilities’ for the different classes are P(c); c = 1,. . . Nc, the initial uncertainty in the output class is measured by... |

3794 | The Self-organizing Map - Kohonen - 1990 |

395 | On the approximate realization of continuous mappings by neural networks - Funahashi - 1989 |

263 |
Independent coordinates for strange attractors from mutual information, Phys
- Fraser, Swinney
- 1986
(Show Context)
Citation Context ...[9] For example, in one dimension, one has: considers the implications for statistical decisionmaking, a H(F) = - J P(f) logP(f) df (3) field closely related to pattem recognition and classification, =-=[6]-=- uses the mutual information to find the optimal time delay to construct a multidimensional phase portrait of a dynamical Note that the entropies of continuous systems depend on system, with implicati... |

206 |
Boolean feature discovery in empirical learning
- Pagallo, Haussler
- 1990
(Show Context)
Citation Context ...R THE OFTICAL CHARACTER networks. In the machine learning literature the entropy and RECOGNlTION PROBLEM WITH LEARNINO VECTOR QUANTIZATION TRAINING mutual information concepts are used for example in =-=[17]-=- number of features MIFS (+OS) + olvql random + olvql and [15] to introduce relevant features for learning Boolean 3 6 8 11 32.1 62.9 70.9 81.3 32.29 (4.23) 55.66 (3.17) 61.97 (4.32) 75.53 (3.55) form... |

191 |
Analysis of hidden units in a layered network trained to classify sonar targets
- Gorman, Sejnowski
- 1988
(Show Context)
Citation Context ...ween Plb, Y) = N(z, 0, alr)N(Y, 070.1); sonar retums bounced off a metal cylinder and those bounced p2(z, Y) = ~ (z, 0.5, o.~)N(Y, 0~0.1) off a roughly cylindrical rock. The data set has been used in =-=[8]-=-, where a multilayer neural network is trained for the To simulate a real classification task, we extracted 1000 classification7. The purpose of the following tests is that of pattems with equal proba... |

146 |
Generalization by weight elimination with application to forecasting
- Weigend, Rumelhart, et al.
- 1991
(Show Context)
Citation Context ...functionality of the classifier (some examples from the literature have been cited in Section 11). The present approach is different from pruning methods acting during the learning phase (e.g., [16], =-=[21]-=-) because the dimensionality reduction is executed before learning starts. The main advantage is that irrelevant features are eliminated from the beginning and that a fast informative feedback about t... |

71 | Mutual information functions versus correlation functions
- Li
- 1990
(Show Context)
Citation Context ...ndent, for complex probability densities the concept of linear dependence is not a very useful one. A detailed investigation of the advantages of the MI versus the correlation is contained in [5] and =-=[12]-=-. 111. SELECTING FEATURES WITH THE MUTUAL INFORMATION In the development of a classifier one often is confronted with practical constraints on the hardware and on the time that is allotted to the task... |

47 | How to generate ordered maps by maximizing the mutual information between input and output signals - Linsker - 1989 |

19 | Minimum Class Entropy: A Maximum Information Approach To Layered Networks - Bichsel, Seitz - 1989 |

13 |
Discovering viewpoint-invariant relationships that characterize objects
- Zemel, Hinton
- 1992
(Show Context)
Citation Context ...e minimization of the conditional class entropy is the basis of a learning ‘About the notation: for simplicity we indicate the different probability algorithm that builds a multilayer network, and in =-=[23]-=- for densities with the same P() function. Its meaning is easily derived from the one case of unsupervised learning. variable contained. For example P( c) is the value of the density function for the ... |

8 |
Using the Karhunen–Loe’ve transformation in the back-propagation training algorithm
- Malki, Moghaddamjoo
- 1991
(Show Context)
Citation Context ...ng similar results in test cases, is applied before learning starts and therefore does not depend on the learning process. Other techniques are based on linear transformations of the input vector. In =-=[14]-=- the Karhunen-Loe’ve transformation is applied so that the transformed coordinates can be arranged in order of their “significance,” considering first the components corresponding to the major eigenve... |

8 |
Learning intemal representations by error propagation
- Rumelhart, Hinton, et al.
- 1987
(Show Context)
Citation Context ...he Mutual Information An operating classifier (consider for example a multilayer perceptron trained to classify pattems from a set of different classes with the backpropagation algorithm described in =-=[19]-=-) can be considered as a system that reduces the initial uncertainty, to be defined precisely later, by “consuming” the information contained in the input vector. In the ideal case the final uncertain... |

4 |
Skeletonization: A Technique for Trimming the Fat from a Neural Network via Relevance Assessment
- Mozer, Smolensky
- 1988
(Show Context)
Citation Context ...e Principal Component Analysis (whose result does not depend on a) does not help in choosing the most appropriate feature. Example 2: This example (the “rule-plus-exception’’ problem) is derived from =-=[16]-=- and used in [lo]. The classification problem on an input space with four binary variables is defined sz = ( &(2a4 - (3Q - q3 + - 1)(2 - a)3) -$33a - 1 - 2a2) &(I ) (16) 0 - 4 3 - 443 + (1 - a)(4a - 1... |

1 | A simple procedure for pruning back-propagation trained neural networks - Kamin - 1990 |

1 |
Learning Boolean formulae using decision trees
- etti-Spaccamela, Protasi
- 1990
(Show Context)
Citation Context ...rature the entropy and RECOGNlTION PROBLEM WITH LEARNINO VECTOR QUANTIZATION TRAINING mutual information concepts are used for example in [17] number of features MIFS (+OS) + olvql random + olvql and =-=[15]-=- to introduce relevant features for learning Boolean 3 6 8 11 32.1 62.9 70.9 81.3 32.29 (4.23) 55.66 (3.17) 61.97 (4.32) 75.53 (3.55) formulas with a tree representation. In this case and, in general,... |

1 |
Determining the relevant parameters for the classification on a multi-layer perceptron: Application to radar data
- Pemot, Vallet
- 1991
(Show Context)
Citation Context ...is applied so that the transformed coordinates can be arranged in order of their “significance,” considering first the components corresponding to the major eigenvectors of the correlation matrix. In =-=[18]-=- different feature evaluation methods are compared. In particular the method based on principal component analysis (PCA) evaluates the features according to the projection of the largest eigenvector o... |