## Network Information Criterion - Determining the Number of Hidden Units for an Artificial Neural Network Model (1994)

### Cached

### Download Links

Venue: | IEEE Transactions on Neural Networks |

Citations: | 148 - 8 self |

### BibTeX

@ARTICLE{Murata94networkinformation,

author = {Noboru Murata and Shuji Yoshizawa and Shun-ichi Amari},

title = {Network Information Criterion - Determining the Number of Hidden Units for an Artificial Neural Network Model},

journal = {IEEE Transactions on Neural Networks},

year = {1994},

volume = {5},

pages = {865--872}

}

### Years of Citing Articles

### OpenURL

### Abstract

The problem of model selection, or determination of the number of hidden units, can be approached statistically, by generalizing Akaike's information criterion (AIC) to be applicable to unfaithful (i.e., unrealizable) models with general loss criteria including regularization terms. The relation between the training error and the generalization error is studied in terms of the number of the training examples and the complexity of a network which reduces to the number of parameters in the ordinary statistical theory of the AIC. This relation leads to a new Network Information Criterion (NIC) which is useful for selecting the optimal network model based on a given training set. 3 IEEE Transactions on Neural Networks, Vol. 5, No. 6, pp. 865--872, November 1994 y Department of Mathematical Engineering and Information Physics, Faculty of Engineering, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113, Japan. 1 Introduction In engineering fields, one of the most important applicati...

### Citations

2723 |
Learning internal representations by error propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...examples observed from the actual system (Widrow [1], Amari [2], White [3], etc.). For instance, the back-propagation method is used for learning of multi-layered perceptrons with sigmoidal functions =-=[4]-=-. An important but difficult problem is to determine the optimal number of parameters. In other words, we wish to determine the number of hidden units needed to mimic the system by using only input-ou... |

1840 |
A new look at the statistical model identification
- Akaike
- 1974
(Show Context)
Citation Context ...his problem, we need to consider the relation among the complexity of a model, the performance for the training data and the number of examples, for example using Akaike's Information Criterion (AIC) =-=[5]-=- and the Minimum Description Length (MDL) [6]. There has been some research intending to apply these principles (e.g. Forgel [7], Moody [8], Wahba [9], Wada and Kawato [10]). The present paper is a de... |

249 |
Stochastic complexity and modeling
- Rissanen
- 1986
(Show Context)
Citation Context ... among the complexity of a model, the performance for the training data and the number of examples, for example using Akaike's Information Criterion (AIC) [5] and the Minimum Description Length (MDL) =-=[6]-=-. There has been some research intending to apply these principles (e.g. Forgel [7], Moody [8], Wahba [9], Wada and Kawato [10]). The present paper is a detailed version of a short note by Murata, Yos... |

169 | The Effective Number of Parameters: An Analysis of generalization and regularization in nonlinear learning systems
- Moody
- 1992
(Show Context)
Citation Context ...ples, for example using Akaike's Information Criterion (AIC) [5] and the Minimum Description Length (MDL) [6]. There has been some research intending to apply these principles (e.g. Forgel [7], Moody =-=[8]-=-, Wahba [9], Wada and Kawato [10]). The present paper is a detailed version of a short note by Murata, Yoshizawa and Amari [11], giving a most general solution to this problem. The present paper treat... |

144 | Learning in artificial neural networks: A statistical perspective. Neural computation - White - 1989 |

105 |
A theory of adaptive pattern classifiers
- Amari
- 1967
(Show Context)
Citation Context ...nt method which eventually minimizes a certain loss function. Learning is carried out based on a training set which consists of a number of examples observed from the actual system (Widrow [1], Amari =-=[2]-=-, White [3], etc.). For instance, the back-propagation method is used for learning of multi-layered perceptrons with sigmoidal functions [4]. An important but difficult problem is to determine the opt... |

43 |
Statistical theory of learning curves under entropic loss criterion
- Amari, Murata
- 1993
(Show Context)
Citation Context ...These evaluations elucidate the relation between the training error and the generalization error in terms of the complexity of a network and the number of training examples (See also Amari and Murata =-=[12]-=- and Murata et al. [13]). In section 4, based on this relation, we propose the Network Information Criterion (NIC), which reduces to the AIC in an ordinary statistical setting. The criterion leads to ... |

14 |
Distribution of information statistics and validity criteria of models
- Takeuchi
- 1976
(Show Context)
Citation Context ...e U is common to all the models within a hierarchical structure. Therefore, it is not effective to apply this type of criteria to non hierarchical models. This fact was pointed out by Takeuchi (1976) =-=[15]-=- and is known to specialists of the AIC but is still not well known by those who apply the AIC. Recently, Hagiwara et al. [16] have cast doubt on validity of applying the AIC method to multi-layered p... |

12 | Learning curves, model selection, and complexity of neural networks
- Murata, Yoshizawa, et al.
- 1993
(Show Context)
Citation Context ...date the relation between the training error and the generalization error in terms of the complexity of a network and the number of training examples (See also Amari and Murata [12] and Murata et al. =-=[13]-=-). In section 4, based on this relation, we propose the Network Information Criterion (NIC), which reduces to the AIC in an ordinary statistical setting. The criterion leads to the effective number m3... |

9 | Three topics in ill-posed problems - Wahba - 1987 |

7 |
A criterion for determining the number of parameters in an artificial neural network model
- Murata, Yoshizawa, et al.
- 1991
(Show Context)
Citation Context ...some research intending to apply these principles (e.g. Forgel [7], Moody [8], Wahba [9], Wada and Kawato [10]). The present paper is a detailed version of a short note by Murata, Yoshizawa and Amari =-=[11]-=-, giving a most general solution to this problem. The present paper treats a family of stochastic neural networks of the feed forward type, which means that a network does not have any recurrent conne... |

3 |
A Statistical Theory of Adaptation
- Widrow
- 1963
(Show Context)
Citation Context ...dient descent method which eventually minimizes a certain loss function. Learning is carried out based on a training set which consists of a number of examples observed from the actual system (Widrow =-=[1]-=-, Amari [2], White [3], etc.). For instance, the back-propagation method is used for learning of multi-layered perceptrons with sigmoidal functions [4]. An important but difficult problem is to determ... |

3 |
Nonuniqueness of connecting weights and AIC in multi-layered neural networks
- Hagiwara, Toda
- 1993
(Show Context)
Citation Context ... to non hierarchical models. This fact was pointed out by Takeuchi (1976) [15] and is known to specialists of the AIC but is still not well known by those who apply the AIC. Recently, Hagiwara et al. =-=[16]-=- have cast doubt on validity of applying the AIC method to multi-layered perceptrons. They pointed out that there are some critical values of parameters where multi-layered perceptrons are reduced to ... |

2 |
An information criterion for optimal neural network selection
- Forgel
- 1991
(Show Context)
Citation Context ...ber of examples, for example using Akaike's Information Criterion (AIC) [5] and the Minimum Description Length (MDL) [6]. There has been some research intending to apply these principles (e.g. Forgel =-=[7]-=-, Moody [8], Wahba [9], Wada and Kawato [10]). The present paper is a detailed version of a short note by Murata, Yoshizawa and Amari [11], giving a most general solution to this problem. The present ... |

2 |
Estimation of generalization capability by combination of new information criterion and cross validation
- Wada, Kawato
- 1991
(Show Context)
Citation Context ...e training data, and the number of examples, such as the AIC [5] and the MDL [6]. There are some researches intending to apply these principles (e.g. Forgel [7], Moody [8], Wahba [9], Wada and Kawato =-=[10]-=-). The present paper is a detailed version of a short note by Murata, Yoshizawa and Amari [11], giving a most general solution to this problem. The present paper treats a hierarchy of stochastic neura... |

1 |
Mitsuo Kawato. Estimation of generalization capability by combination of new information criterion and cross validation
- Wada
- 1991
(Show Context)
Citation Context ... Information Criterion (AIC) [5] and the Minimum Description Length (MDL) [6]. There has been some research intending to apply these principles (e.g. Forgel [7], Moody [8], Wahba [9], Wada and Kawato =-=[10]-=-). The present paper is a detailed version of a short note by Murata, Yoshizawa and Amari [11], giving a most general solution to this problem. The present paper treats a family of stochastic neural n... |

1 |
A Statistical Asymptotic Study on Learning
- Murata
- 1992
(Show Context)
Citation Context ...he proof is given by Amari [2]. It should be noted that Q 3 can be written as Q 3 = rrD(q 3 ; p(` 3 )): Moreover it can be shown that the distribution of ~ ` approaches a normal distribution as "=-= ! 0 [14]. Lemma 2 The -=-distribution ~ ��( ~ `) approaches the normal distribution N (` 3 ; "4 01 Q 3 G 3 ) (6) as t !1 and " ! 0. A brief proof of this lemma is given in appendix A. Roughly speaking, this lemm... |

1 |
A Statistical Theory of Adaptation
- Widraw
- 1963
(Show Context)
Citation Context ...stic gradient descent method which eventually minimizes a loss function. Learning is carried out based on a training set which consists of a number of examples observed from the actual system (Widraw =-=[1]-=-, Amari [2], White [3], etc.). For instance, the back-propagation method is used for multi-layered networks [4]. An important but difficult problem is to determine the number of parameters or the numb... |