## Subspace information criterion for model selection (2001)

### Cached

### Download Links

- [sugiyama-www.cs.titech.ac.jp]
- [ogawa-www.cs.titech.ac.jp]
- [ogawa-www.cs.titech.ac.jp]
- [ftp.cs.titech.ac.jp]
- DBLP

### Other Repositories/Bibliography

Venue: | Neural Computation |

Citations: | 47 - 30 self |

### BibTeX

@ARTICLE{Sugiyama01subspaceinformation,

author = {Masashi Sugiyama and Hidemitsu Ogawa},

title = {Subspace information criterion for model selection},

journal = {Neural Computation},

year = {2001},

volume = {13},

pages = {2001}

}

### OpenURL

### Abstract

The problem of model selection is considerably important for acquiring higher levels of generalization capability in supervised learning. In this paper, we propose a new criterion for model selection called the subspace information criterion (SIC), which is a generalization of Mallows ’ C L. It is assumed that the learning target function belongs to a specified functional Hilbert space and the generalization error is defined as the Hilbert space squared norm of the difference between the learning result function and target function. SIC gives an unbiased estimate of the generalization error so defined. SIC assumes the availability of an unbiased estimate of the target function and the noise covariance matrix, which are generally unknown. A practical calculation method of SIC for least mean squares learning is provided under the assumption that the dimension of the Hilbert space is less than the number of training examples. Finally, computer simulations in two examples show that SIC works well even when the number of training examples is small.

### Citations

9810 | The Nature of Statistical Learning Theory - Vapnik - 1995 |

5290 |
Neural networks for pattern recognition
- Bishop
- 1995
(Show Context)
Citation Context ...subspace S required for obtaining an approximation in a certain level of precision grows exponentially with the dimension L of the input space D, a concept referred to as the curse of dimensionality (=-=Bishop, 1995-=-). This phenomenon generally results in large computational complexity, so that learning procedures are infeasible to compute in real time. However, thanks to good properties of the reproducing kernel... |

3228 | An introduction to the bootstrap
- Efron, Tibshirani
- 1993
(Show Context)
Citation Context ...1, 1993; Noda et al., 1996; Fujikoshi & Satoh, 1997; Satoh et al., 1997; Hurvich et al., 1998; Simonoff, 1998; McQuarrie & Tsai, 1998). The other approach is to use the bootstrap method (Efron, 1979; =-=Efron & Tibshirani, 1993-=-) for numerically evaluating the bias when the expected log-likelihood is estimated by the log-likelihood. The idea of the bootstrap bias correction is first introduced by Wong (1983) and Efron (1986)... |

2689 | Estimating the dimension of a model - Schwarz - 1978 |

2344 |
A new look at the statistical model identification
- Akaike
(Show Context)
Citation Context ...ly evaluate the error for unknown input points. In contrast, model selection methods explicitly evaluating the generalization error have been studied from various standpoints: information statistics (=-=Akaike, 1974-=-; Takeuchi, 1976; Konishi & Kitagawa, 1996), Bayesian statistics (Schwarz, 1978; Akaike, 1980; MacKay, 1992), stochastic complexity (Rissanen, 1978, 1987, 1996; Yamanishi, 1998), and structural risk m... |

1239 | Modeling by Shortest Data Description - Rissanen - 1978 |

1041 |
Bootstrap methods: Another look at the jackknife
- Efron
- 1979
(Show Context)
Citation Context ...ai, 1989, 1991, 1993; Noda et al., 1996; Fujikoshi & Satoh, 1997; Satoh et al., 1997; Hurvich et al., 1998; Simonoff, 1998; McQuarrie & Tsai, 1998). The other approach is to use the bootstrap method (=-=Efron, 1979-=-; Efron & Tibshirani, 1993) for numerically evaluating the bias when the expected log-likelihood is estimated by the log-likelihood. The idea of the bootstrap bias correction is first introduced by Wo... |

832 | Theory of Reproducing kernels - Aronszajn |

829 | Learning Representations by Back-Propagating Error - Rumelhart, Hinton, et al. - 1986 |

695 | Networks for Approximation and Learning - Poggio, Girosi - 1990 |

689 |
Handbook of Mathematical Functions with Formulas, Graphs and Mathematical Tables
- Abramowitz, Stegun
- 1964
(Show Context)
Citation Context ...p } n p=0, (60) � 1 −1 f(x)g(x)dx. (61)sSubspace Information Criterion for Model Selection 19 The reproducing kernel of Sn is expressed by using the Christoffel-Darboux formula (see e.g. Szegö, 1939; =-=Abramowitz & Segun, 1964-=-; Freud, 1966) as KSn(x, x ′ ⎧ n +1 ⎪⎨ 2(x − x )= ⎪⎩ ′ ) [Pn+1(x)Pn(x ′ ) − Pn(x)Pn+1(x ′ )] if x �= x ′ , (n +1) 2 2(1 − x 2 ) [Pn(x) 2 − 2xPn(x)Pn+1(x)+Pn+1(x) 2 ] if x = x ′ (62) , where Pn(x) is t... |

643 |
Neural networks and the bias/variance dilemma
- Geman, Bienenstock, et al.
- 1992
(Show Context)
Citation Context ... of ˆ fθ. The unbiased learning result ˆ fu and the learning operator Xu are used for this purpose. The generalization error of ˆ fθ is decomposed into the bias and variance (see e.g. Takemura, 1991; =-=Geman et al., 1992-=-; Efron & Tibshirani, 1993): Eɛ� ˆ fθ − f� 2 = �Eɛ ˆ fθ − f� 2 + Eɛ� ˆ fθ − Eɛ ˆ fθ� 2 . (10) It follows from Eqs.(4) and (3) that Eq.(10) yields Eɛ� ˆ fθ − f� 2 = �Xθz − f� 2 + Eɛ�Xθɛ� 2 = �Xθz − f� ... |

574 | Bayesian interpolation - Mackay - 1992 |

508 | Bootstrap methods and their application - Davison, Hinkley - 1997 |

484 |
Smoothing noisy data with spline functions
- Craven, Wahba
- 1979
(Show Context)
Citation Context ...f arbitrary linear regression models. It is called CL or the unbiased risk estimate (Wahba, 1990). CL may require a good estimate of the noise variance. In contrast, the generalized cross-validation (=-=Craven & Wahba, 1979-=-; Wahba, 1990), which is an extension of the traditional cross-validation (Mosteller & Wallace, 1963; Allen, 1974; Stone, 1974; Wahba, 1990), is the criterion for finding the model minimizing the pred... |

401 | Theory of Optimal Experiments - Fedorov - 1972 |

263 | Regression and time series model selection in small samples - Hurvich, Tsai - 1989 |

248 | Generalized Inverses: Theory and Applications
- Ben-Israel, Greville
- 1974
(Show Context)
Citation Context ...ng any h ∈ H1 as (Schatten, 1970) (f ⊗ g) h = 〈h, g〉f. 3 An operator X is called the Moore-Penrose generalized inverse of an operator A if X satisfies the following four conditions (see Albert, 1972; =-=Ben-Israel & Greville, 1974-=-). AXA = A, XAX = X, (AX) ∗ = AX, and (XA) ∗ = XA. The Moore-Penrose generalized inverse is unique and denoted as A † .sSubspace Information Criterion for Model Selection 10 Proposition 1 (Ogawa, 1992... |

213 | An equivalence between sparse approximation and support vector machines - Girosi - 1998 |

178 | A Resource-Allocating Network for Function Interpolation - Platt - 1991 |

160 | Network information criterion – determining the number of hidden units for an artificial neural network - Murata, Yoshizawa, et al. - 1994 |

154 |
The relationship between variable selection and data augmentation and a method for prediction
- Allen
- 1974
(Show Context)
Citation Context ...stimate of the noise variance. In contrast, the generalized cross-validation (Craven & Wahba, 1979; Wahba, 1990), which is an extension of the traditional cross-validation (Mosteller & Wallace, 1963; =-=Allen, 1974-=-; Stone, 1974; Wahba, 1990), is the criterion for finding the model minimizing the predictive training error without the knowledge of noise variance. Li (1986) showed the asymptotic optimality of CL a... |

131 | In: Regression and Moore–Penrose Pseudoinverse - Albert - 1972 |

122 |
Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion
- Hurvich, Simonoff
- 1998
(Show Context)
Citation Context ...kelihood for each model. This type of modification can be found in many articles (e.g. Sugiura, 1978; Hurvich & Tsai, 1989, 1991, 1993; Noda et al., 1996; Fujikoshi & Satoh, 1997; Satoh et al., 1997; =-=Hurvich et al., 1998-=-; Simonoff, 1998; McQuarrie & Tsai, 1998). The other approach is to use the bootstrap method (Efron, 1979; Efron & Tibshirani, 1993) for numerically evaluating the bias when the expected log-likelihoo... |

111 |
A theory of adaptive pattern classifiers
- Amari
- 1967
(Show Context)
Citation Context ...wn input points can be estimated. This ability is called the generalization capability. So far, many supervised learning methods have been developed, including the stochastic gradient descent method (=-=Amari, 1967-=-), the back-propagation algorithm (Rumelhart et al., 1986a, 1986b), regularization learning (Tikhonov & Arsenin, 1977; Poggio & Girosi, 1990), Bayesian inference (Savage, 1954; MacKay, 1992), projecti... |

94 | How biased is the apparent error rate of a prediction rule - Efron - 1986 |

78 | Norm Ideals of Completely Continuous Operators - Schatten - 1960 |

54 |
Ortogonal polynomials
- Freud
- 1971
(Show Context)
Citation Context ...)g(x)dx. (61)sSubspace Information Criterion for Model Selection 19 The reproducing kernel of Sn is expressed by using the Christoffel-Darboux formula (see e.g. Szegö, 1939; Abramowitz & Segun, 1964; =-=Freud, 1966-=-) as KSn(x, x ′ ⎧ n +1 ⎪⎨ 2(x − x )= ⎪⎩ ′ ) [Pn+1(x)Pn(x ′ ) − Pn(x)Pn+1(x ′ )] if x �= x ′ , (n +1) 2 2(1 − x 2 ) [Pn(x) 2 − 2xPn(x)Pn+1(x)+Pn+1(x) 2 ] if x = x ′ (62) , where Pn(x) is the Legendre p... |

51 |
The kernel function and conformal mapping
- Bergman
- 1970
(Show Context)
Citation Context ...)} M m=1. In the LMS learning case, a model θ refers to a subspace S of H. Here, we assume that S has the reproducing kernel (see Aronszajn,sSubspace Information Criterion for Model Selection 9 1950; =-=Bergman, 1970-=-; Wahba, 1990; Saitoh, 1988, 1997). Let KS(x, x ′ ) be the reproducing kernel of S, and D be the domain of functions in S. Then KS(x, x ′ ) satisfies the following conditions. • For any fixed x ′ in D... |

41 |
Likelihood and Bayes procedure
- Akaike
- 1980
(Show Context)
Citation Context ...ly evaluating the generalization error have been studied from various standpoints: information statistics (Akaike, 1974; Takeuchi, 1976; Konishi & Kitagawa, 1996), Bayesian statistics (Schwarz, 1978; =-=Akaike, 1980-=-; MacKay, 1992), stochastic complexity (Rissanen, 1978, 1987, 1996; Yamanishi, 1998), and structural risk minimization (Vapnik, 1995; Cherkassky et al., 1999). Particularly, information-statistics-bas... |

39 |
Model complexity control for regression using VC generalization bounds
- Cherkassky
- 1999
(Show Context)
Citation Context ... Kitagawa, 1996), Bayesian statistics (Schwarz, 1978; Akaike, 1980; MacKay, 1992), stochastic complexity (Rissanen, 1978, 1987, 1996; Yamanishi, 1998), and structural risk minimization (Vapnik, 1995; =-=Cherkassky et al., 1999-=-). Particularly, information-statistics-based methods have been extensively studied. Akaike’s informationsSubspace Information Criterion for Model Selection 3 criterion (AIC) (Akaike, 1974) is one of ... |

23 | Neural Networks Learning, Generalization and Over Learning - Ogawa - 1992 |

20 | A corrected Akaike information criterion for vector autoregressive model selection - Hurvich, Tsai - 1993 |

15 |
Selection of variables for fitting equations to data
- Gorman, Toman
- 1966
(Show Context)
Citation Context ...els of generalization capability. The problem of model selection has been studied mainly in the field of statistics. Mallows (1964) proposed CP for the selection of subset-regression models (see also =-=Gorman & Toman, 1966-=-; Mallows, 1973). CP gives an unbiased estimate of the predictive training error, i.e., the error between estimated and true values at sample points contained in the training set. Mallows (1973) exten... |

15 | Projection filter regularization of ill-conditioned problem - Ogawa - 1987 |

15 | Distribution of information statistics and validity criteria of models - Takeuchi - 1978 |

14 | Bootstrapping log likelihood and EIC, an extension of AIC - Ishiguro, Sakamoto, et al. - 1997 |

13 | Parametric Projection Filter for Image and Signal Restoration - Oja, Ogawa - 1986 |

11 |
Modified AIC and Cp in Multivariate Linear Regression
- Fujikoshi, Satoh
- 1997
(Show Context)
Citation Context ...xact unbiased estimate of the expected log-likelihood for each model. This type of modification can be found in many articles (e.g. Sugiura, 1978; Hurvich & Tsai, 1989, 1991, 1993; Noda et al., 1996; =-=Fujikoshi & Satoh, 1997-=-; Satoh et al., 1997; Hurvich et al., 1998; Simonoff, 1998; McQuarrie & Tsai, 1998). The other approach is to use the bootstrap method (Efron, 1979; Efron & Tibshirani, 1993) for numerically evaluatin... |

10 | Bias of the corrected AIC criterion for underfitted regression and time series models - Hurvich, Tsai - 1991 |

7 | On the selection of statistical models by AIC - Takeuchi - 1983 |

5 | Modern mathematical statistics - Takemura - 1991 |

4 |
Proceedings of the first US/Japan conference on the frontiers of statistical modeling: An informational approach
- Bozdogan
- 1994
(Show Context)
Citation Context ...ion Criterion for Model Selection 3 criterion (AIC) (Akaike, 1974) is one of the most eminent methods of this type. Many successful applications of AIC to real world problems have been reported (e.g. =-=Bozdogan, 1994-=-; Akaike & Kitagawa, 1994, 1995; Kitagawa & Gersch, 1996). AIC assumes that models are faithful 1 . Takeuchi (1976) extended AIC to be applicable to unfaithful models. This criterion is called Takeuch... |

3 | Nonuniqueness of connecting weights and AIC in multi-layered neural networks - Hagiwara, Toda - 1993 |

2 |
A bootstrap variant of AIC for state space model selection
- Cavanaugh
- 1997
(Show Context)
Citation Context ...he bootstrap bias correction is first introduced by Wong (1983) and Efron (1986), and then it is formalized as a model selection criterion by Ishiguro et al. (1997) (see also Davison & Hinkley, 1992; =-=Cavanaugh & Shumway, 1997-=-; Shibata, 1997). In the neural network community, AIC has been extended to a different direction. Murata et al. (1994) generalized the loss function of TIC, and proposed the network information crite... |

2 | Modified AIC and C p in multivariate linear regression - Fujikoshi - 1997 |

2 | Neural network information criterion for the optimal number of hidden units - Onoda - 1995 |

2 | Estimation of generalization capability by combination of new information criterion and cross validation - Wada, Kawato - 1991 |

1 |
The practice of time series analysis I. Tokyo: Asakura Syoten
- Akaike
- 1994
(Show Context)
Citation Context ...r Model Selection 3 criterion (AIC) (Akaike, 1974) is one of the most eminent methods of this type. Many successful applications of AIC to real world problems have been reported (e.g. Bozdogan, 1994; =-=Akaike & Kitagawa, 1994-=-, 1995; Kitagawa & Gersch, 1996). AIC assumes that models are faithful 1 . Takeuchi (1976) extended AIC to be applicable to unfaithful models. This criterion is called Takeuchi’s modification of AIC (... |

1 |
Subspace Information Criterion for Model Selection 23
- Bishop
- 1995
(Show Context)
Citation Context ...subspace S required for obtaining an approximation in a certain level of precision grows exponentially with the dimension L of the input space D, a concept referred to as the curse of dimensionality (=-=Bishop, 1995-=-). This phenomenon generally results in large computational complexity, so that learning procedures are infeasible to compute in real time. However, thanks to good properties of the reproducing kernel... |