## Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities (2002)

### Cached

### Download Links

- [www.lce.hut.fi]
- [www.lce.hut.fi]
- [www.lce.hut.fi]
- [becs.aalto.fi]
- [www.lce.hut.fi]
- [www.lce.hut.fi]
- [www.lce.hut.fi]
- DBLP

### Other Repositories/Bibliography

Venue: | Neural Computation |

Citations: | 27 - 11 self |

### BibTeX

@ARTICLE{Vehtari02bayesianmodel,

author = {Aki Vehtari and Jouko Lampinen},

title = {Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities},

journal = {Neural Computation},

year = {2002},

volume = {14},

pages = {2439--2468}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this work, we discuss practical methods for the assessment, comparison, and selection of complex hierarchical Bayesian models. A natural way to assess the goodness of the model is to estimate its future predictive capability by estimating expected utilities. Instead of just making a point estimate, it is important to obtain the distribution of the expected utility estimate, as it describes the uncertainty in the estimate. The distributions of the expected utility estimates can also be used to compare models, for example, by computing the probability of one model having a better expected utility than some other model. We propose an approach using crossvalidation predictive densities to obtain expected utility estimates and Bayesian bootstrap to obtain samples from their distributions. We also discuss the probabilistic assumptions made and properties of two practical cross-validation methods, importance sampling and k-fold cross-validation. As illustrative examples, we use MLP neural networks and Gaussian Processes (GP) with Markov chain Monte Carlo sampling in one toy problem and two challenging real-world problems.

### Citations

3926 | Classification and Regression Trees - Breiman, Friedman, et al. - 1984 |

3737 | Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images - Geman, Geman - 1984 |

1250 | Bayesian Data Analysis
- Gelman, Carlin, et al.
- 1995
(Show Context)
Citation Context ...ion to the distribution of random variable. Having samples of z1,...,zn of a random variable Z, itisassumed that posterior probabilities for the zi have Dirichlet distribution Di(1,...,1) (see, e.g., =-=Gelman, Carlin, Stern, & Rubin, 1995-=-, Appendix A) and values of Z that are not observed have zero posterior probability. Sampling from the Dirichlet distribution gives BB samples from the distribution of the distribution of Z and thus s... |

1041 | Bayesian Theory - Bernardo, Smith - 2000 |

986 | Bayes factors - Kass, Raftery - 1995 |

908 | Monte Carlo Statistical Methods
- Robert, Casella
- 1999
(Show Context)
Citation Context ...ng the correct answer. On the other hand, the same problem applies to any variance or convergence diagnostics method based on finite samples of any indirect Monte Carlo method (see, e.g., Neal, 1993; =-=Robert & Casella, 1999-=-). Even in simple models like the Bayesian linear model, leaving one very influential data point out may change the posterior so much that the variance of the weights is very large or infinite (Perugg... |

830 | Reversible jump markov chain monte carlo computation and Bayesian model determination
- Green
- 1995
(Show Context)
Citation Context ... is useful only when selecting between a few models. If we have many model candidates, for example if doing variable selection, we can use some other methods like the variable dimension MCMC methods (=-=Green, 1995-=-; Carlin & Chib, 1995; Stephens, 2000) for model selection and still use the expected utilities for final model assessment. This approach is discussed in more detail by Vehtari (2001, chap. 4). We hav... |

723 |
Cross-validatory choice and assessment of statistical predictions
- Stone
- 1974
(Show Context)
Citation Context ... assessed in order to find out whether it is useful in a given problem. The cross-validation methods for model assessment and comparison have been proposed by several authors: for early accounts see (=-=Stone, 1974-=-; Geisser, 1975) and for more recent review see (Gelfand, Dey, & Chang, 1992; Shao, 1993). The cross-validation predictive density dates at least to (Geisser & Eddy, 1979) and review of cross-validati... |

607 | Bayesian Learning for Neural Networks
- Neal
- 1996
(Show Context)
Citation Context ...by Gelfand et al. (1992), Gelfand (1996), and Draper (1995, 1996). 4 Illustrative examples As illustrative examples, we use MLP networks and Gaussian processes with Markov Chain Monte Carlo sampling (=-=Neal, 1996-=-, 1997, 1999; Lampinen & Vehtari, 2001) in one toy problem: MacKay’s robot arm, and two real world problems: concreteBayesian Model Assessment and Comparison Using Cross-Validation 15 quality estimat... |

563 | Probabilistic inference using Markov chain Monte Carlo methods
- Neal
- 1993
(Show Context)
Citation Context ...ling is giving the correct answer. On the other hand, the same problem applies to any variance or convergence diagnostics method based on finite samples of any indirect Monte Carlo method (see, e.g., =-=Neal, 1993-=-; Robert & Casella, 1999). Even in simple models like the Bayesian linear model, leaving one very influential data point out may change the posterior so much that the variance of the weights is very l... |

531 | Statistical Test for Comparing Supervised Classification Learning Algorithms - Dietterich - 1998 |

527 |
Theory of Probability
- Jeffreys
- 1961
(Show Context)
Citation Context ... postulate (parsimony principle), it is useful to start from simpler models and then test if more complex model would give significantly better predictions. See discussion of simplicity postulate in (=-=Jeffreys, 1961-=-). Although possible overestimation of the variability due to training sets being slightly different (see the previous section) makes these comparisons slightly conservative, the error is small and in... |

401 |
Monte Carlo strategis in scientific computing
- Liu
- 2001
(Show Context)
Citation Context ...ective sample sizes based on an approximation of the variance of importance weights computed as where w (i) j m (i) eff = 1/ m∑ j=1 (w (i) j )2 , (16) are normalized weights (Kong, Liu, & Wong, 1994; =-=Liu, 2001-=-, chap. 2.5.3). We propose to examine the distribution of the effective sample sizes by checking the minimum and some quantiles and by plotting m (i) eff in increasing order (see examples in section 4... |

400 | A practical bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...’s robot arm In this section we illustrate some basic issues of the expected utilities computed by using the cross-validation predictive densities. A very simple “robot arm” toyproblem (first used by =-=MacKay, 1992-=-) was selected, so that the complexity of the problem would not hide the main points that we want to illustrate. The task is to learn the mapping from joint angles to position for an imaginary robot a... |

320 |
Bayesian inference in econometric models using Monte Carlo integration
- Geweke
- 1989
(Show Context)
Citation Context ...imate it with the Monte Carlo method E(h(θ)) ≈ ∑ L l=1 h( θj)w( ˙ θj) ˙ ∑L l=1 w( , (13) θj) ˙ where the factors w( θj) ˙ = f ( θj)/g( ˙ θj) ˙ are called importance ratios or importance weights. See (=-=Geweke, 1989-=-) for the conditions of the convergence of the importance sampling estimates. The quality of the importance sampling estimates depends heavily on the variability of the importance sampling weights, wh... |

224 | The jackknife and the bootstrap for general stationary observations
- Künsch
- 1989
(Show Context)
Citation Context ... near independent MCMC samples (estimated by autocorrelations (Neal, 1993, chap. 6; Chen et al., 2000, chap. 3)). However, if MCMC samples were highly dependent, we could use dependent weights in BB (=-=Künsch, 1989-=-, 1994). 2.4 Model comparison with expected utilities The distributions of the expected utility estimates can be used for comparing different models. Difference of the expected utilities of two models... |

185 |
Sequential imputations and Bayesian missing data problems
- Kong, Liu, et al.
- 1994
(Show Context)
Citation Context ... heuristic measure of effective sample sizes based on an approximation of the variance of importance weights computed as where w (i) j m (i) eff = 1/ m∑ j=1 (w (i) j )2 , (16) are normalized weights (=-=Kong, Liu, & Wong, 1994-=-; Liu, 2001, chap. 2.5.3). We propose to examine the distribution of the effective sample sizes by checking the minimum and some quantiles and by plotting m (i) eff in increasing order (see examples i... |

182 | Bayesian model choice: asymptotics and exact calculations - Gelfand, Dey - 1994 |

171 | Posterior predictive assessment of model fitness via realized discrepancies (with discussion). Statistica Sinica - Gelman, Meng, et al. - 1996 |

156 |
Rational decisions
- GOOD
- 1952
(Show Context)
Citation Context ... and two real world problems (section 4). 1.1 Expected utilities In prediction and decision problems, it is natural to assess the predictive ability of the model by estimating the expected utilities (=-=Good, 1952-=-; Bernardo & Smith,Bayesian Model Assessment and Comparison Using Cross-Validation 3 1994). Utility measures the relative values of consequences. By using application specific utilities, the expected... |

149 |
Bayesian model choice via Markov Chain Monte Carlo methods
- Carlin, Chib
- 1995
(Show Context)
Citation Context ...ly when selecting between a few models. If we have many model candidates, for example if doing variable selection, we can use some other methods like the variable dimension MCMC methods (Green, 1995; =-=Carlin & Chib, 1995-=-; Stephens, 2000) for model selection and still use the expected utilities for final model assessment. This approach is discussed in more detail by Vehtari (2001, chap. 4). We have tried to follow the... |

147 | Introduction to monte carlo methods
- MacKay
- 1996
(Show Context)
Citation Context ...e (Peruggia, 1997). Moreover, even if leave-one-out posteriors are similar to the full posterior, importance sampling in high dimensions suffers from large variation in importance weights (see, e.g., =-=MacKay, 1998-=-). Flexible nonlinear models like MLP have usually a high number of parameters and a large number of degrees of freedom (all data points may be influential). We demonstrate in section 4.1 a simple cas... |

143 | Inference for the generalization error
- Nadeau, Bengio
- 2003
(Show Context)
Citation Context ...tion we briefly discuss assumptions made on future data distribution in the approach described in this work and in related approaches (see, e.g., Rasmussen et al., 1996; Neal, 1998; Dietterich, 1998; =-=Nadeau & Bengio, 2000-=-, and references therein), where the goal is to compare (not assess) the performance of methods (algorithms) instead of the single models conditioned on the given training data. Assume that the traini... |

137 | The intrinsic Bayes factor for model selection and prediction - J, Pericchi - 1996 |

134 |
Approximate Bayesian inference with the weighted likelihood bootstrap
- Newton, Raftery
- 1994
(Show Context)
Citation Context ...Peruggia, 1997), but if analytical solutions are inapplicable, we have to estimate this from the weights obtained. It is customary to examine the distribution of weights with various plots (see, e.g, =-=Newton & Raftery, 1994-=-; Gelman et al., 1995, chap. 10; Peruggia, 1997). We prefer plotting the cumulative normalized weights (see examples in section 4.1). As we get n such plotsBayesian Model Assessment and Comparison Us... |

124 | Monte Carlo implementation of Gaussian process models for Bayesian regression and classification - Neal - 1997 |

123 | Monte Carlo Methods in Bayesian Computation - Chen, Shao, et al. - 2000 |

123 | The lack of a priori distinctions between learning algorithms
- Wolpert
- 1996
(Show Context)
Citation Context ...ifference between the distributions � and � + . If we do not assume anything about the distribution � + we cannot predict the behavior of the model in a new domain as stated by no-free-lunch theorem (=-=Wolpert, 1996-=-a,b). Even if the distributions � and � + have only few dimensions, it is very hard to quantify differences and estimate their effect on expected utilities. If the applications are similar (e.g., pape... |

123 |
Linear model selection by cross-validation
- Shao
- 1993
(Show Context)
Citation Context ...ion methods for model assessment and comparison have been proposed by several authors: for early accounts see (Stone, 1974; Geisser, 1975) and for more recent review see (Gelfand, Dey, & Chang, 1992; =-=Shao, 1993-=-). The cross-validation predictive density dates at least to (Geisser & Eddy, 1979) and review of cross-validation and other predictive densities appears in (Gelfand & Dey, 1994; Gelfand, 1996). Berna... |

113 | Bayesianly justifiable and relevant frequency calculations for the applied statistician - Rubin, B - 1984 |

103 |
The predictive sample reuse method with applications. JAm Stat Assoc
- Geisser
- 1975
(Show Context)
Citation Context ...order to find out whether it is useful in a given problem. The cross-validation methods for model assessment and comparison have been proposed by several authors: for early accounts see (Stone, 1974; =-=Geisser, 1975-=-) and for more recent review see (Gelfand, Dey, & Chang, 1992; Shao, 1993). The cross-validation predictive density dates at least to (Geisser & Eddy, 1979) and review of cross-validation and other pr... |

98 | Fractional Bayes Factors for Model Comparisons - O’Hagan - 1995 |

93 |
Model determination using predictive distributions with implementation via sampling-based methods
- Gelfand, Dey, et al.
- 1992
(Show Context)
Citation Context ...en problem. The cross-validation methods for model assessment and comparison have been proposed by several authors: for early accounts see (Stone, 1974; Geisser, 1975) and for more recent review see (=-=Gelfand, Dey, & Chang, 1992-=-; Shao, 1993). The cross-validation predictive density dates at least to (Geisser & Eddy, 1979) and review of cross-validation and other predictive densities appears in (Gelfand & Dey, 1994; Gelfand, ... |

92 | Introduction to radial basis function networks
- Orr
- 1996
(Show Context)
Citation Context ...y|x (i) , D (\i) ∫ , M) = p(y|x (i) ,θ,D (\i) , M)p(θ|D (\i) , M)dθ. (10) For simple models, the LOO-CV-predictive densities may be computed quickly using analytical solutions (see, e.g., Shao, 1993; =-=Orr, 1996-=-; Peruggia, 1997), but models that are more complex usually require a full model fitting for each of the n predictive densities. When using the Monte Carlo methods it means that we have to sample from... |

88 |
The Bayesian bootstrap
- Rubin
- 1981
(Show Context)
Citation Context ...may fail. In addition, the above approximation ignores the uncertainty in the estimates of ui’s due to Monte Carlo error. We propose a quick and generic approach based on the Bayesian bootstrap (BB) (=-=Rubin, 1981-=-), which can handle variability due to Monte Carlo integration, bias correction estimation, and the approximation of the future data distribution, as well as arbitrary summary quantities and gives goo... |

87 | Blind deconvolution via sequential imputations - Liu, Chen - 1995 |

80 |
A predictive approach to model selection
- Geisser, Eddy
- 1979
(Show Context)
Citation Context ...veral authors: for early accounts see (Stone, 1974; Geisser, 1975) and for more recent review see (Gelfand, Dey, & Chang, 1992; Shao, 1993). The cross-validation predictive density dates at least to (=-=Geisser & Eddy, 1979-=-) and review of cross-validation and other predictive densities appears in (Gelfand & Dey, 1994; Gelfand, 1996). Bernardo and Smith (1994, chap. 6) also discuss briefly how cross-validation approximat... |

71 |
Expected information as expected utility
- Bernardo
- 1979
(Show Context)
Citation Context ...information-theoretic Kullback-Leibler (KL) divergence between the model and the unknown distribution of the data, and equivalently, it corresponds to maximization of the expected information gained (=-=Bernardo, 1979-=-). An application specific utility may measure the expected benefit or cost. For simplicity we use term utility also for costs, although better word would be risk. Also, instead of negating cost, we r... |

63 | Regression and classification using Gaussian process priors (with discussion - Neal - 1999 |

57 |
Model determination using sampling-based methods
- Gelfand
- 1996
(Show Context)
Citation Context ...ang, 1992; Shao, 1993). The cross-validation predictive density dates at least to (Geisser & Eddy, 1979) and review of cross-validation and other predictive densities appears in (Gelfand & Dey, 1994; =-=Gelfand, 1996-=-). Bernardo and Smith (1994, chap. 6) also discuss briefly how cross-validation approximates the formal Bayes procedure of computing the expected utilities. We synthesize and extend the previous work ... |

57 |
Inference and monitoring convergence
- Gelman
- 1996
(Show Context)
Citation Context ...one with the FBM 1 software and Matlab-code partly derived from the FBM and Netlab 2 toolbox. For convergence diagnostics, we used a visual inspection of trends, the potential scale reduction method (=-=Gelman, 1996-=-) and the Kolmogorov-Smirnov test (Robert & Casella, 1999). Importance weights for MLP and GP were computed as described in (Vehtari, 2001, Ch 3.2.2). 4.1 Toy problem: MacKay’s robot arm In this secti... |

54 | A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods - Burman - 1989 |

41 | Assessing relevance determination methods using delve - Neal - 1998 |

36 |
Markov Chain Monte Carlo methods for computing Bayes factors: A comparative review
- Han, Carlin
- 2001
(Show Context)
Citation Context ... difficult to compute (Kass & Raftery, 1995). However, it may be possible to estimate unnormalized prior predictive likelihoods for large number of models relatively fast (see, e.g., Ntzoufras, 1999; =-=Han & Carlin, 2001-=-), so that prior predictive approach may be used to aid model selection as discussed by Vehtari (2001, chap. 4). 3.2 Posterior predictive densities Posterior predictive densities are naturally used fo... |

29 |
A rank statistics approach to the consistency of a general bootstrap
- Mason, Newton
- 1992
(Show Context)
Citation Context ...i,b were the probability that Z = zi; that is, we calculate ˙φb = ∑n i=1 gi,bzi. The distribution of the values of ˙φb; b = 1,...,B is the BB distribution of the mean E[Z]. See (Lo, 1987; Weng, 1989; =-=Mason & Newton, 1992-=-) for some important properties of the BB. The assumption that all possible distinct values of Z have been observed is usually wrong, but with moderate n and not very thick tailed distributions, infer... |

28 | Bayesian deviance, the effective number of parameters, and the comparison of arbitrarily complex models. Unpublished manuscript
- J, Best, et al.
- 1998
(Show Context)
Citation Context ...be remembered that: “Selecting a single model is always complex procedure involving background knowledge and other factors as the robustness of inferences to alternative models with similar support” (=-=Spiegelhalter, Best, & Carlin, 1998-=-, p. 3). 3 Relations to other predictive approaches In this section, we discuss the relations of the cross-validation predictive densities to prior predictive densities and Bayes factors (section 3.1)... |

27 |
Posterior Bayes factors (with discussion
- Aitkin
- 1991
(Show Context)
Citation Context ...known to underestimate the generalization error of flexible models (see also examples in section 4). Comparison of the joint posterior predictive densities leads to the posterior Bayes factor (PoBF) (=-=Aitkin, 1991-=-). The posterior predictive densities should generally not be used either for assessing model performance, except as an estimate of the upper (or lower if smaller value is better) limit for the expect... |

27 |
A large sample study of the Bayesian bootstrap
- Lo
- 1987
(Show Context)
Citation Context ... the mean of Z as if gi,b were the probability that Z = zi; that is, we calculate ˙φb = ∑n i=1 gi,bzi. The distribution of the values of ˙φb; b = 1,...,B is the BB distribution of the mean E[Z]. See (=-=Lo, 1987-=-; Weng, 1989; Mason & Newton, 1992) for some important properties of the BB. The assumption that all possible distinct values of Z have been observed is usually wrong, but with moderate n and not very... |

19 | A cross-validatory method for dependent data - Burman, Chow, et al. - 1994 |

19 | 1997a)Discussion on `On Bayesian analysis of mixtures with an unknown number of components' �by - Stephens |