## Bayesian Approach for Neural Networks - Review and Case Studies (2001)

Venue: | Neural Networks |

Citations: | 21 - 10 self |

### BibTeX

@ARTICLE{Lampinen01bayesianapproach,

author = {Jouko Lampinen and Aki Vehtari},

title = {Bayesian Approach for Neural Networks - Review and Case Studies},

journal = {Neural Networks},

year = {2001},

volume = {14},

pages = {14--3}

}

### Years of Citing Articles

### OpenURL

### Abstract

We give a short review on the Bayesian approach for neural network learning and demonstrate the advantages of the approach in three real applications. We discuss the Bayesian approach with emphasis on the role of prior knowledge in Bayesian models and in classical error minimization approaches. The generalization capability of a statistical model, classical or Bayesian, is ultimately based on the prior assumptions. The Bayesian approach permits propagation of uncertainty in quantities which are unknown to other assumptions in the model, which may be more generally valid or easier to guess in the problem. The case problems studied in this paper include a regression, a classification, and an inverse problem. In the most thoroughly analyzed regression problem, the best models were those with less restrictive priors. This emphasizes the major advantage of the Bayesian approach, that we are not forced to guess attributes that are unknown, such as the number of degrees of freedom in the model, non-linearity of the model with respect to each input variable, or the exact form for the distribution of the model residuals.

### Citations

5246 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...ardo and Smith, 1994; Gelman et at., 1995). For neural networks the Bayesian approach was pioneered in (Buntine and Weigend, 1991; MacKay, 1992; Neal, 1992) and reviewed in (MacKay, 1995; Neat, 1996; =-=Bishop, 1995-=-). With neural networks the main difficulty in model building is controlling the complexity of the model. It is well known that the optimal number of degrees of freedom in the model depends on the num... |

4311 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...7 4. Bayesian MLP 12 1 25 5. Bayesian MLP + ARD 11 1sKNN LOOCV, K-nearest-neighbor classification, where K is chosen by leave-one-out cross-validation 5, andsCART, Classification And Regression Tree (=-=Breiman et al., 1984-=-). CV error estimates are collected in Table 6. The differences are not very significant, partly due to having only 8-fold cross-validation, but mostly because the different images had very different ... |

1435 | Bayesian Data Analysis
- Gelman, Carlin, et al.
- 1995
(Show Context)
Citation Context ...ger tailed distributions), and integrating over the posterior distribution of v in predictions. Some advice on the design of the hierarchical prior structures and robust noise models can be found in (=-=Gelman et al., 1995-=-). A typical attribute that is difficult to guess in advance in complex statistical models is the correct number of degrees of freedom, as it depends on the number of the training samples, distributio... |

1358 |
Statistical decision theory and Bayesian analysis. (2nd Ed
- Berger
- 1985
(Show Context)
Citation Context ...by constructing the posterior conditional probabilities for the unobserved variables of interest, given the observed data sample and prior assumptions. Good references for Bayesian data analysis are (=-=Berger, 1985-=-; Bernardo and Smith, 1994; Gelman et at., 1995). For neural networks the Bayesian approach was pioneered in (Buntine and Weigend, 1991; MacKay, 1992; Neal, 1992) and reviewed in (MacKay, 1995; Neat, ... |

1336 |
Monte Carlo sampling methods using Markov chains and their applications
- Hastings
- 1970
(Show Context)
Citation Context ... results in roughly (truncated) exponential prior on v (Geweke, 1993; Spiegelhalter et al., 1996). Another simple way to sample for v, without discretization, is by the Metropolis-Hastings algorithm (=-=Hastings, 1970-=-), which gave in our experiments equal results but slightly slower convergence. Lampinen & Vehtari, Bayesian approach for neural networks - Review and case studies 7 3.2 Priors for the Model Parameter... |

1135 |
Bayesian Theory
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ...g the posterior conditional probabilities for the unobserved variables of interest, given the observed data sample and prior assumptions. Good references for Bayesian data analysis are (Berger, 1985; =-=Bernardo and Smith, 1994-=-; Gelman et at., 1995). For neural networks the Bayesian approach was pioneered in (Buntine and Weigend, 1991; MacKay, 1992; Neal, 1992) and reviewed in (MacKay, 1995; Neat, 1996; Bishop, 1995). With ... |

1115 | Bayes factors
- Kass, Raftery
- 1995
(Show Context)
Citation Context ...d in comparing the probabilities of the models, hence the term evidence of the model. (A widely used Bayesian model choice method between two models is based on Bayes Factors, p(DIgfl)/p(DIgf2), see (=-=Kass and Raftery, 1995-=-)). The more common notation of Bayes formula, with gf dropped, more easily causes misinterpreting the denominator P (D) as some kind of probability of obtaining data D in the studied problem (or prio... |

866 | An introduction to variational methods for graphical models
- Jordan, Gharamani, et al.
- 1998
(Show Context)
Citation Context ... which aims to approximate the posterior distribution by minimizing the Kullback-Leibler divergence between the true posterior and a parametric approximating distribution, variational approximations (=-=Jordan et al., 1998-=-) for approximating the integration by a tractable problem, and mean field approach (Winther, 1998), where the problem is simplified by neglecting certain dependencies between the random variables. It... |

642 | Bayesian Learning for Neural Networks
- Neal
- 1996
(Show Context)
Citation Context ... networks. We concentrate on MLP networks and Markov Chain Monte Carlo methods for computing the integrations, following the approach introduced in (Neal, 1992). A detailed treatment can be found in (=-=Neal, 1996-=-), which also describes the use of the Flexible Bayesian Modeling (FBM) software package 1, that was the main tool used in the case problems reviewed in this paper. The result of Bayesian modeling is ... |

569 | Approximate statistical test for comparing supervised classification learning algorithms
- Dietterich
- 1998
(Show Context)
Citation Context .... This method exhibits a somewhat elevated probability of type I error (to suggest a difference when no difference exists) and low type 1I error (to miss a difference when it exists), as analyzed in (=-=Dietterich, 1998-=-). The reference methods were early-stopped committee (MLP ESC), and a Gaussian Process model (Neal, 1999), which is a non-parametric regression method, with priors imposed directly on the correlation... |

560 |
Theory of Probability
- Jeffreys
- 1961
(Show Context)
Citation Context ...ion 3. A lot of work has been done to find "non-informative" priors that could be used to specify complete lack of knowledge of a parameter value. Some approaches are uniform priors, Jeffrey=-=s' prior (Jeffreys, 1961), and ref-=-erence priors (Berger and Bernardo, 1992). See (Kass and Wasserman, 1996) for a review and (Yang and Berger, 1997) for a large catalog of different "non-informative" priors for various stati... |

426 | A practical Bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...s. Good references for Bayesian data analysis are (Berger, 1985; Bernardo and Smith, 1994; Gelman et at., 1995). For neural networks the Bayesian approach was pioneered in (Buntine and Weigend, 1991; =-=MacKay, 1992-=-; Neal, 1992) and reviewed in (MacKay, 1995; Neat, 1996; Bishop, 1995). With neural networks the main difficulty in model building is controlling the complexity of the model. It is well known that the... |

409 | Neural network ensembles, cross validation, and active learning
- Krogh, Vedelsby
- 1995
(Show Context)
Citation Context ...is used to train the model. These limitations can easily be alleviated by using a committee of early stopping MLPs, with different partitioning of the data to training and stopping sets for each MLP (=-=Krogh and Vedelsby, 1995-=-). When used with caution early stopping committee is a good baseline method for MLPs. The target function and data are shown in Fig. 4. The modeling test was repeated 100 times with different realiza... |

254 | No free lunch theorems for search
- Wolpert, Macready
- 1995
(Show Context)
Citation Context ...tter than random. In other words, if we do not assume anything a priori, the learning algorithm cannot learn anything from the training data that would generalize to the off-training set samples. In (=-=Wolpert and Macready, 1995-=-; Wolpert, 1996b) the cross-validation (CV) method for model selection was analyzed in more depth and it was shown that the NFL theorem applies to CV also. The basic result in the papers is, that with... |

181 |
The selection of prior distributions by formal rules
- Kass, Wasserman
- 1996
(Show Context)
Citation Context ... could be used to specify complete lack of knowledge of a parameter value. Some approaches are uniform priors, Jeffreys' prior (Jeffreys, 1961), and reference priors (Berger and Bernardo, 1992). See (=-=Kass and Wasserman, 1996) for a review and (-=-Yang and Berger, 1997) for a large catalog of different "non-informative" priors for various statistical models. Among Bayesians the use of "non-informative" priors is often referr... |

154 | Probable Networks and Plausible Predictions- A Review of Practical Bayesian Methods for Supervised Neural Networks.” Unpublished manuscript
- MacKay
- 1986
(Show Context)
Citation Context ...is are (Berger, 1985; Bernardo and Smith, 1994; Gelman et at., 1995). For neural networks the Bayesian approach was pioneered in (Buntine and Weigend, 1991; MacKay, 1992; Neal, 1992) and reviewed in (=-=MacKay, 1995-=-; Neat, 1996; Bishop, 1995). With neural networks the main difficulty in model building is controlling the complexity of the model. It is well known that the optimal number of degrees of freedom in th... |

144 | Evaluation of Gaussian Processes and Other Methods for Non-Linear Regression
- Rasmussen
- 1996
(Show Context)
Citation Context ... the posterior distribution. Note that the results are in general not very sensitive to the choices made in the hyperprior level, as discussed in section 2.1 and confirmed in many studies (see e.g., (=-=Rasmussen, 1996-=-)). However, this should be checked in serious analysis, especially if the form of the prior needs to be compromised for reasons of computational convenience. In the framework used in this study (see ... |

131 | The lack of a priori distinction between learning algorithms
- Wolpert
- 1996
(Show Context)
Citation Context ...he necessary link between the training samples and the not yet measured future samples. Recently, some important no-free-lunch (NFL) theorems have been proven, that help to understand this issue. In (=-=Wolpert, 1996-=-a,b) Wolpert shows, that if the class of approximating functions is not limited, any learning algorithm (i.e., procedure for choosing the approximating function) can as readily perform worse or better... |

127 |
Bayesian back-propagation
- Buntine, Weigend
- 1991
(Show Context)
Citation Context ...sample and prior assumptions. Good references for Bayesian data analysis are (Berger, 1985; Bernardo and Smith, 1994; Gelman et at., 1995). For neural networks the Bayesian approach was pioneered in (=-=Buntine and Weigend, 1991-=-; MacKay, 1992; Neal, 1992) and reviewed in (MacKay, 1995; Neat, 1996; Bishop, 1995). With neural networks the main difficulty in model building is controlling the complexity of the model. It is well ... |

125 |
Monte Carlo statistical methods. Springer Texts in Statistics
- Robert, Casella
- 2004
(Show Context)
Citation Context ...an learning for MLPs (Neal, 1996). Good introduction to basic MCMC methods and many applications in statistical data analysis can be found in (Gilks et al., 1996) and a more theoretical treatment in (=-=Robert and Casella, 1999-=-). In MCMC the complex integrals in the marginalization are approximated via drawing samples from the joint Lampinen & Vehtari, Bayesian approach for neural networks - Review and case studies 9 Latent... |

114 |
The predictive sample reuse method with applications
- Geisser
- 1975
(Show Context)
Citation Context ...odel depends on these guesses, which in practical applications makes it necessary to carefully validate the models, using, e.g, Bayesian posterior analysis (Gelman et al., 1995), or cross-validation (=-=Geisser, 1975-=-; Gelfand, 1996; Vehtari and Lampinen, 2000). This also implies that in practice the Bayesian approach is often more sensitive to the prior assumptions than more classical methods. This is discussed i... |

87 | Bayesian treatment of the independent student-t linear model
- Geweke
- 1993
(Show Context)
Citation Context ...tion noise model with unknown degrees of freedom. Thus similar treatment results, whether we assume normal residuals with different variances, or a common longer tailed t-distribution residual model (=-=Geweke, 1993-=-). The latter is preferable, as it leads to simpler noise models, and will be discussed in more detail below. In heteroscedastic problems, the noise variance can be functionally dependent on some expl... |

75 | Assessing convergence of Markov Chain Monte Carlo algorithms
- Brooks, Roberts, et al.
- 1998
(Show Context)
Citation Context ...workstation. For convergence diagnostics we used visual inspection of trends and the potential scale reduction method (Gelman, 1996). Alternative convergence diagnostics have been reviewed, e.g., in (=-=Brooks and Roberts, 1999-=-; Robert and Casella, 1999). See (Vehtad et al., 2000) for discussion on the choice of the starting values and the number of chains. Choosing the initial values with early-stopping can be used to redu... |

67 |
Regression and classification using Gaussian process priors
- Neal
- 1998
(Show Context)
Citation Context ...nce exists) and low type II error (to miss a difference when it exists), as analyzed in (Dietterich, 1998). The reference methods were early-stopped committee (MLP ESC), and a Gaussian Process model (=-=Neal, 1999-=-), which is a non-parametric regression method, with priors imposed directly on the correlation function of the resulting approximation. The GP approach is a very viable alternative for MLP models, at... |

62 |
Model determination using sampling-based methods
- Gelfand
- 1996
(Show Context)
Citation Context ... these guesses, which in practical applications makes it necessary to carefully validate the models, using, e.g, Bayesian posterior analysis (Gelman et al., 1995), or cross-validation (Geisser, 1975; =-=Gelfand, 1996-=-; Vehtari and Lampinen, 2000). This also implies that in practice the Bayesian approach is often more sensitive to the prior assumptions than more classical methods. This is discussed in more detail i... |

60 |
Inference and monitoring convergence
- Gelman
- 1996
(Show Context)
Citation Context ...of samples in the chain, which may require several hours of CPU-time on standard workstation. For convergence diagnostics we used visual inspection of trends and the potential scale reduction method (=-=Gelman, 1996-=-). Alternative convergence diagnostics have been reviewed, e.g., in (Brooks and Roberts, 1999; Robert and Casella, 1999). See (Vehtad et al., 2000) for discussion on the choice of the starting values ... |

55 | Bayesian non-linear modelling for the prediction competition
- MacKay
- 1994
(Show Context)
Citation Context ... or to include different options in the models and integrate over them, which is the "correct" Bayesian approach. Earlier published studies with conclusions comparable to those of this paper=-= include (MacKay, 1994-=-; Neal, 1996; Thodberg, 1996; Rasmussen, 1996; Vivarelli and Williams, 1997; Husmeier et al., 1998; Neal, 1998; Penny and Roberts, 1999). Cited studies include both regression and classification cases... |

43 | Assessing relevance determination methods using DELVE
- Neal
- 1998
(Show Context)
Citation Context ... not comparable to the ARD coefficients of the hidden layer weights. Note that adding input-to-output weights makes the model less identifiable and may slow down the convergence of MCMC considerably (=-=Neal, 1998-=-). In the following simple example we demonstrate how the non-linearity of the input has the largest effect on the relevance score of the ARD, instead of the predictive or causal importance. The targe... |

40 | Curvature-driven smoothing: a learning algorithm for feedforward networks
- Bishop
- 1993
(Show Context)
Citation Context ...hich is a widely used reguladzation method in, e.g., inverse problems, functions with large derivatives of chosen order are penalized. With an MLP model, minimizing the curvature (second derivative) (=-=Bishop, 1993-=-) or training the derivatives to given target values (Lampinen and Selonen, 1997) leads to rather complex treatise as the partial derivatives of the non-linear models depend on all the other inputs an... |

40 | Bayesian training of backpropagation networks by the Hybrid Monte Carlo method
- Neal
- 1992
(Show Context)
Citation Context ...nces for Bayesian data analysis are (Berger, 1985; Bernardo and Smith, 1994; Gelman et al., 1995). For neural networks the Bayesian approach was pioneered in (Buntine and Weigend, 1991; MacKay, 1992; =-=Neal, 1992-=-) and reviewed in (MacKay, 1995; Neal, 1996; Bishop, 1995). With neural networks the main difficulty in model building is controlling the complexity of the model. It is well known that the optimal num... |

37 | A review of bayesian neural networks with an application to near infrared spectroscopy
- Thodberg
- 1996
(Show Context)
Citation Context ...ptions in the models and integrate over them, which is the "correct" Bayesian approach. Earlier published studies with conclusions comparable to those of this paper include (MacKay, 1994; Ne=-=al, 1996; Thodberg, 1996-=-; Rasmussen, 1996; Vivarelli and Williams, 1997; Husmeier et al., 1998; Neal, 1998; Penny and Roberts, 1999). Cited studies include both regression and classification cases. In classification, the lik... |

35 | Sequential Monte Carlo methods to train neural work models - Freitas, Niranjan, et al. - 2000 |

31 |
On the development of reference priors
- Berger, Bernardo
- 1992
(Show Context)
Citation Context ...ind "non-informative" priors that could be used to specify complete lack of knowledge of a parameter value. Some approaches are uniform priors, Jeffreys' prior (Jeffreys, 1961), and referenc=-=e priors (Berger and Bernardo, 1992). See (Kass an-=-d Wasserman, 1996) for a review and (Yang and Berger, 1997) for a large catalog of different "non-informative" priors for various statistical models. Among Bayesians the use of "non-inf... |

23 | Ensemble Learning in Bayesian Neural Networks
- Barber, Bishop
- 1998
(Show Context)
Citation Context ...s for approximating the integrations in neural network models include, e.g., Markov Chain Monte Carlo techniques for numerical integration, discussed in more detail in section 3.4, ensemble learning (=-=Barber and Bishop, 1998-=-), which aims to approximate the posterior distribution by minimizing the Kullback-Leibler divergence between the true posterior and a parametric approximating distribution, vadational approximations ... |

23 |
Electrical impedance tomography with basis constraints," Inverse Problems 13(2),pp
- Vauhkonen, Kaipio, et al.
- 1997
(Show Context)
Citation Context ...verse problem in EIT, estimating the conductivity distribution from the surface potentials, is known to be severely ill-posed, thus some reguladzation methods must be used to obtain feasible results (=-=Vauhkonen et al., 1997-=-). Fig. 6 shows a simulated example of the EIT problem. The volume bounded by the circles in the image represent gas bubble floating in liquid. The conductance of the gas is much lower than that of th... |

21 |
Regression with input-dependent noise: A Bayesian treatment
- Bishop, Qazaz
- 1997
(Show Context)
Citation Context ...atory variables, typically on some subset of the model inputs, so that the model for the noise variance might be (02) n = F(xn; 0noise) q- 6 (14) 6sInv-gamma(cr0, vo) (15) with fixed cr0 and vo. See (=-=Bishop and Qazaz, 1997-=-) for example of input dependent noise model, where a separate MLP model is used to estimate the dependence of the noise variance on the inputs. Often in practical problems the Gaussian residual model... |

20 | Bayesian neural networks for classification: How useful is the evidence framework? Neural Networks
- Roberts, Penny
- 1998
(Show Context)
Citation Context ...blished studies with conclusions comparable to those of this paper include (MacKay, 1994; Neal, 1996; Thodberg, 1996; Rasmussen, 1996; Vivarelli and Williams, 1997; Husmeier et al., 1998; Neal, 1998; =-=Penny and Roberts, 1999-=-). Cited studies include both regression and classification cases. In classification, the likelihood model (usually) contains no hyperparameters, whereas in regression problems the noise model is a cr... |

11 | Using Bayesian neural network to solve the inverse problemin electrical impedance tomography
- Lampinen, Vehtari, et al.
- 1999
(Show Context)
Citation Context ... In this section we report results on using Bayesian MLPs for solving the ill-posed inverse problem in electrical impedance tomography (EIT). The full report of the proposed approach is presented in (=-=Lampinen et al., 1999-=-). The aim in EIT is to recover the internal structure of an object from surface measurements. A number of electrodes are attached to the surface of the object and current patterns are injected throug... |

10 | Using Bayesian neural networks to classify forest scenes - Vehtari, Heikkonen, et al. - 1998 |

8 |
Information about hyperparameters in hierarchical models
- Goel, DeGroot
- 1981
(Show Context)
Citation Context ...ation that is needed to be able to learn a generalizing model (Lemm, 1999). By using "non-informative" priors, the fixed, or guessed, choices can be moved to higher levels of hierarchical mo=-=dels. In (Goel and Degroot, 1981-=-) it was shown that in hierarchical models the training data contains less information of hyperparameters which are higher in the hierarchy, so that the prior and posterior for the hyperparameters bec... |

8 | Using background knowledge in multilayer perceptron learning
- Lampinen, Selonen
- 1997
(Show Context)
Citation Context ...ems, functions with large derivatives of chosen order are penalized. With an MLP model, minimizing the curvature (second derivative) (Bishop, 1993) or training the derivatives to given target values (=-=Lampinen and Selonen, 1997-=-) leads to rather complex treatise as the partial derivatives of the non-linear models depend on all the other inputs and weights. A convenient commonly used prior distribution is Gaussian, which in l... |

8 |
Bayesian Field Theory
- Lemm
- 1999
(Show Context)
Citation Context ...e NFL theorems, this requires that the hypothesis space is already so constrained, that it contains the sufficient amount of prior information that is needed to be able to learn a generalizing model (=-=Lemm, 1999). By usin-=-g "non-informative" priors, the fixed, or guessed, choices can be moved to higher levels of hierarchical models. In (Goel and Degroot, 1981) it was shown that in hierarchical models the trai... |

7 | Prior information and generalized questions
- Lemm
- 1996
(Show Context)
Citation Context ...however, that the role of prior knowledge is equally important in any other approach, including the Maximum Likelihood. Basically, all generalization is based on the prior knowledge, as discussed in (=-=Lemm, 1996-=-, 1999); the training samples provide information only at those points, and the prior knowledge provides the necessary link between the training samples and the not yet measured future samples. Recent... |

7 | Using bayesian neural networks to classify segmented images
- Vivarelli, Williams
- 1997
(Show Context)
Citation Context ...e over them, which is the "correct" Bayesian approach. Earlier published studies with conclusions comparable to those of this paper include (MacKay, 1994; Neal, 1996; Thodberg, 1996; Rasmuss=-=en, 1996; Vivarelli and Williams, 1997-=-; Husmeier et al., 1998; Neal, 1998; Penny and Roberts, 1999). Cited studies include both regression and classification cases. In classification, the likelihood model (usually) contains no hyperparame... |

6 | Bayesian neural networks with correlating residuals - Vehtari, Lampinen - 1999 |

6 |
Bayesian back-propagation. Complex systems
- Buntine, Weigend
- 1991
(Show Context)
Citation Context ...sample and prior assumptions. Good references for Bayesian data analysis are (Berger, 1985; Bernardo and Smith, 1994; Gelman et al., 1995). For neural networks the Bayesian approach was pioneered in (=-=Buntine and Weigend, 1991-=-; MacKay, 1992; Neal, 1992) and reviewed in (MacKay, 1995; Neal, 1996; Bishop, 1995). With neural networks the main difficulty in model building is controlling the complexity of the model. It is well ... |

5 | Bayesian Mean Field Algorithms for Neural Networks and Gaussian Processes
- Winther
- 1998
(Show Context)
Citation Context ...ween the true posterior and a parametric approximating distribution, vadational approximations (Jordan et al., 1998) for approximating the integration by a tractable problem, and mean field approach (=-=Winther, 1998-=-), where the problem is simplified by neglecting certain dependencies between the random variables. It is worth noticing, that also in full hierarchical Bayesian models there are large amounts of fixe... |

4 | On MCMC sampling in Bayesian MLP neural networks - Vehtari, Särkkä, et al. - 2000 |

3 | Empirical evaluation of Bayesian sampling for neural classifiers
- Husmeier, Penny, et al.
- 1998
(Show Context)
Citation Context ...rect" Bayesian approach. Earlier published studies with conclusions comparable to those of this paper include (MacKay, 1994; Neal, 1996; Thodberg, 1996; Rasmussen, 1996; Vivarelli and Williams, 1=-=997; Husmeier et al., 1998-=-; Neal, 1998; Penny and Roberts, 1999). Cited studies include both regression and classification cases. In classification, the likelihood model (usually) contains no hyperparameters, whereas in regres... |

2 | Bayesian training of backpropagation networks by the hybrid Monte Carlo method - Neat - 1992 |