## Bayesian Deviance, the Effective Number of Parameters, and the Comparison of Arbitrarily Complex Models (1998)

Citations: | 28 - 7 self |

### BibTeX

@TECHREPORT{Spiegelhalter98bayesiandeviance,,

author = {David J. Spiegelhalter and Nicola G. Best and Bradley P. Carlin},

title = {Bayesian Deviance, the Effective Number of Parameters, and the Comparison of Arbitrarily Complex Models},

institution = {},

year = {1998}

}

### OpenURL

### Abstract

We consider the problem of comparing complex hierarchical models in which the number of parameters is not clearly defined. We follow Dempster in examining the posterior distribution of the log-likelihood under each model, from which we derive measures of fit and complexity (the effective number of parameters). These may be combined into a Deviance Information Criterion (DIC), which is shown to have an approximate decision-theoretic justification. Analytic and asymptotic identities reveal the measure of complexity to be a generalisation of a wide range of previous suggestions, with particular reference to the neural network literature. The contributions of individual observations to fit and complexity can give rise to a diagnostic plot of deviance residuals against leverages. The procedure is illustrated in a number of examples, and throughout it is emphasised that the required quantities are trivial to compute in a Markov chain Monte Carlo analysis, and require no analytic work for new...

### Citations

2321 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ... actually true, nor do we wish to formulate a decision problem of strict model choice. For a non-hierarchical model with p parameters and n observations, the Bayes (or Schwarz) information criterion (=-=Schwarz, 1978-=-) given by BIC = \Gamma2 log p(yj `)+p log n has been widely promoted, but its implementation for hierarchical models has been controversial. This is due to the uncertainty concerning the proper value... |

1329 | Generalized Additive Models
- Hastie, Tibshirani
- 1990
(Show Context)
Citation Context ...ation of the effective number of parameters with the trace of the `hat' matrix is a standard result in linear modelling, and extends to the general class of smoothing and generalised additive models (=-=Hastie and Tibshirani, 1990-=-)[Sec 3.5], and is also the conclusion of Hodges and Sargent (1998) in the context of general linear models. The advantage of using the deviance formulation for specifying p D is that all matrix manip... |

1142 |
Spatial interaction and the statistical analysis of lattice systems
- BESAG
- 1974
(Show Context)
Citation Context ...dels 1 and 2, fl i are exchangeable random effects with a Normal prior distribution having zero mean and precision �� fl , and OE i are spatial random effects with a conditional autoregressive pri=-=or (Besag, 1974) given -=-by OE i jOE ni �� Normal( 1 n i X j2A i OE j ; 1 n i �� OE ) : Bayesian deviance 5 Model D D(`) p D DIC 1 pooled 381.7 380.7 1.0 382.7 2 cov 248.7 238.6 2.1 242.8 3 saturated 56.0 3.1 52.9 108... |

1116 |
Pattern Recognition and Neural Networks
- Ripley
- 1996
(Show Context)
Citation Context ...tions, these methods clearly cannot be directly applied (Gelfand and Dey, 1994). The most ambitious attempts to tackle this problem appear in the neural network literature (Moody, 1992; MacKay, 1995; =-=Ripley, 1996-=-). In the next section we follow Dempster (1974) (recently reprinted as Dempster (1997b)) in basing comparisons on the posterior distributions of the deviance (-2 log-likelihood + some standardising f... |

1042 | Bayesian Theory - Bernardo, Smith - 1994 |

523 | DJC: Bayesian interpolation - MacKay - 1992 |

443 | On Bayesian analysis of mixtures with an unknown number of components (with discussion - Richardson, Green - 1998 |

433 |
Bayesian inference in statistical analysis
- Box, Tiao
- 1992
(Show Context)
Citation Context ...aging (Draper, 1995). Selecting a single model is a complex procedure involving background knowledge and other factors such as the robustness of inferences to alternative models with similar support (=-=Box and Tiao, 1973-=-): model choice may be unnecessary in the first place and is certainly very difficult to formalise. We rather view DIC as a method for screening alternative formulations in order to produce a list of ... |

316 | Random-effects models for longitudinal data - Laird, Ware - 1982 |

289 |
Approximate inference in generalized linear mixed models
- Breslow, Clayton
- 1993
(Show Context)
Citation Context ...tial distribution of lip cancer in Scotland To illustrate the practical application of our suggestion, we analyse data on the rates of lip cancer in 56 counties in Scotland (Clayton and Kaldor, 1987; =-=Breslow and Clayton, 1993-=-). The data include observed (y i ) and expected (E i ) numbers of cases for each county i (where the expected counts are based on the age- and sex-standardised national rate applied to the population... |

283 | Bayes and Empirical Bayes Methods for Data Analysis - Carlin, Louis - 1996 |

183 |
Bayesian Model Choice: Asymptotics and Exact Calculations
- Gelfand, Dey
- 1994
(Show Context)
Citation Context ...g, University of Minnesota, Minneapolis, MN 55455-0392, USA: e-mail brad@muskie.biostat.umn.edu Bayesian deviance 2 generally outnumber observations, these methods clearly cannot be directly applied (=-=Gelfand and Dey, 1994-=-). The most ambitious attempts to tackle this problem appear in the neural network literature (Moody, 1992; MacKay, 1995; Ripley, 1996). In the next section we follow Dempster (1974) (recently reprint... |

171 |
Statistical theory: the prequential approach
- Dawid
- 1984
(Show Context)
Citation Context ... and postdictive criteria that assess assumptions conditional on the observed data. Predictive criteria: The basis for such criteria can be thought of as a sequential series of predictive statements (=-=Dawid, 1984-=-) which, if a full probability model is being assessed, becomes the marginal likelihood p(y) = R p(yjOE)p(OE)dOE. The resulting Bayes factors (Kass and Raftery, 1995) may be used to obtain posterior p... |

169 | The Effective Number of Parameters: An Analysis of generalization and regularization in nonlinear learning systems
- Moody
- 1992
(Show Context)
Citation Context ...generally outnumber observations, these methods clearly cannot be directly applied (Gelfand and Dey, 1994). The most ambitious attempts to tackle this problem appear in the neural network literature (=-=Moody, 1992-=-; MacKay, 1995; Ripley, 1996). In the next section we follow Dempster (1974) (recently reprinted as Dempster (1997b)) in basing comparisons on the posterior distributions of the deviance (-2 log-likel... |

149 | Network information criterion—determining the number of hidden units for an artificial neural network model. Neural Networks - Murata, Yoshizawa, et al. - 1994 |

145 |
Generalized Linear Models (2nd Edition
- McCullagh, Nelder
- 1989
(Show Context)
Citation Context ...identifiability in the above parameterisations. However, this does not influence the model comparison which is based only on the fitted ` i 's. For this Poisson model we adopt the classical deviance (=-=McCullagh and Nelder, 1989-=-)[p 34] D S (`) = 2 X i y i log y i e ` i E i \Gamma (y i \Gamma e ` i E i ) obtained by taking \Gamma2 log f(y) = \Gamma2 P i log p(y i j` i = log y i E i ) = 208:0 as the standardising factor. For e... |

140 | Probable networks and plausible predictions – a review of practical Bayesian models for supervised neural networks
- Mackay
- 1995
(Show Context)
Citation Context ...number observations, these methods clearly cannot be directly applied (Gelfand and Dey, 1994). The most ambitious attempts to tackle this problem appear in the neural network literature (Moody, 1992; =-=MacKay, 1995-=-; Ripley, 1996). In the next section we follow Dempster (1974) (recently reprinted as Dempster (1997b)) in basing comparisons on the posterior distributions of the deviance (-2 log-likelihood + some s... |

102 | Analysis of multivariate Probit models - Chib, Greenberg - 1998 |

89 | Bayes factor and model uncertainty
- Kass, Raftery
- 1995
(Show Context)
Citation Context ...ther suggestions (Aitkin, 1991). Bayesian model comparison using Schwarz's information criterion as a Bayes factor approximation also requires specification of the number of parameters in each model (=-=Kass and Raftery, 1995-=-). Unfortunately, in complex hierarchical models, in which parameters MRC Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge CB2 2SR, UK: e-mail david.spiegelhalter@mrc-bsu.cam.ac... |

71 |
Expected information as expected utility
- Bernardo
- 1979
(Show Context)
Citation Context ...sume the `true' model is p(Y rep j`), and the loss in using an estimate ~ ` is given by L(`; ~ `) = E Yrep j` [\Gamma2 log p(Y rep j ~ `)]; the predicted loss using a proper logarithmic scoring rule (=-=Bernardo, 1979-=-).. Denote \Gamma2 log p(Y rep j ~ `) by D rep ( ~ `). Then following the approach of Ripley (1996)[p33], this loss can be broken down into L(`; ~ `) = E Yrep j` [D rep ( ~ `) \Gamma D rep (`)] + E Yr... |

64 | Bayes estimates for the linear model (with discussion - Lindley, Smith - 1972 |

61 | Model choice: a minimum posterior predictive loss approach - Gelfand, Ghosh - 1998 |

61 | Predictive model selection - Laud, Ibrahim - 1995 |

56 | Statistical theory and methodology in science and engineering, 2 nd Ed - Brownlee - 1965 |

44 |
Probability Forecasting
- Dawid
- 1986
(Show Context)
Citation Context ...erior Bayes factors (PDF) with AIC and BIC (Aitkin, 1997). We feel happier with our use of the log-likelihood in that the log probability ordinate is a proper scoring rule for evaluating predictions (=-=Dawid, 1986-=-), in contrast to the ordinate itself. In addition, the resulting penalty for complexity in the PDF appears insufficient. Laud and Ibrahim (1995) and Gelfand and Ghosh (1998) suggest minimising a pred... |

29 | A likelihood-based method for analysing longitudinal binary responses - Fitzmaurice, Laird - 1993 |

28 | The direct use of likelihood for significance testing - Dempster - 1997 |

27 |
Posterior Bayes factors (with discussion
- Aitkin
- 1991
(Show Context)
Citation Context ...complexity is accompanied by better fit, models are compared by trading these two quantities off using likelihood ratio tests, Akaike's information criterion, or one of a number of other suggestions (=-=Aitkin, 1991-=-). Bayesian model comparison using Schwarz's information criterion as a Bayes factor approximation also requires specification of the number of parameters in each model (Kass and Raftery, 1995). Unfor... |

20 | Counting degrees of freedom in hierarchical and other richly parameterised models - Hodges, Sargent - 1998 |

20 | The Schwartz criterion and related methods for normal linear models - Pauler - 1998 |

16 |
Empirical Bayes estimates of agestandardised relative risks for use in disease mapping
- Clayton, Kaldor
- 1987
(Show Context)
Citation Context ...A running example: the spatial distribution of lip cancer in Scotland To illustrate the practical application of our suggestion, we analyse data on the rates of lip cancer in 56 counties in Scotland (=-=Clayton and Kaldor, 1987-=-; Breslow and Clayton, 1993). The data include observed (y i ) and expected (E i ) numbers of cases for each county i (where the expected counts are based on the age- and sex-standardised national rat... |

13 |
Scale mixtures of normality
- Andrews, Mallows
- 1974
(Show Context)
Citation Context ...log( �� d�� ) \Gamma 2 log \Gamma( d+1 2 ) + 2 log \Gamma( d 2 ) o A well-known alternative to direct fitting of many symmetric but nonnormal error distributions is through scale mixtures of n=-=ormals (Andrews and Mallows, 1974). From p.210 of Car-=-lin and Louis Bayesian deviance 22 (1996), we have the alternate t d formulation Model 5: y i �� Normal(�� i ; 1 w i �� ); w i �� 1 d �� 2 d = Gamma( d 2 ; d 2 ) ; and correspondin... |

12 | Bayesian analysis of realistically complex models - Best, Spiegelhalter, et al. - 1996 |

12 |
Markov Chain Monte Carlo Methods in Practice
- Gilks, Richardson, et al.
- 1996
(Show Context)
Citation Context ...w models. 1 Introduction The development of Markov chain Monte Carlo (MCMC) has made it possible to fit increasingly large classes of models with the aim of exploring real-world complexities of data (=-=Gilks et al., 1996-=-). Being able to fit such models naturally leads to the wish to compare alternative formulations with the aim of identifying a class of succinct plausible models: for example, we might ask whether we ... |

7 |
The calibration of P-values, posterior Bayes factors and the AIC from the posterior distribution of the likelihood
- Aitkin
- 1997
(Show Context)
Citation Context ...tes of their predictive ability on a replicate dataset. Aitkin (1991) suggested using the posterior mean of the likelihood, and contrasts the resulting posterior Bayes factors (PDF) with AIC and BIC (=-=Aitkin, 1997-=-). We feel happier with our use of the log-likelihood in that the log probability ordinate is a proper scoring rule for evaluating predictions (Dawid, 1986), in contrast to the ordinate itself. In add... |

5 | Hierarchical generalised linear models (with discussion - Lee, Nelder - 1996 |

5 |
BUGS: Bayesian inference Using
- Spiegelhalter, Thomas, et al.
- 1996
(Show Context)
Citation Context ...E i \Gamma (y i \Gamma e ` i E i ) obtained by taking \Gamma2 log f(y) = \Gamma2 P i log p(y i j` i = log y i E i ) = 208:0 as the standardising factor. For each model we ran an MCMC sampler in BUGS (=-=Spiegelhalter et al., 1996-=-a) for 5000 interations following a burn-in period of 1000 iterations. As suggested by Dempster (1974), Figure 1 shows a kernel-density smoothed plot of the resulting posterior distributions of the de... |

5 | Evaluation of highly complex modeling procedures with Binomial and Poisson data. Unpublished manuscript - Ye, Wong - 1997 |

3 | Beta-binomial Anova for proportons - Crowder - 1978 |

3 |
Assessment and propogation of model uncertainty
- Draper
- 1995
(Show Context)
Citation Context ...-fit, and the other a penalty for increasing model complexity. We need to emphasise that we do not recommend that DIC be used as a strict criterion for model choice or as a basis for model averaging (=-=Draper, 1995-=-). Selecting a single model is a complex procedure involving background knowledge and other factors such as the robustness of inferences to alternative models with similar support (Box and Tiao, 1973)... |

3 | A Bayesian model selection criterion - Raghunathan - 1988 |

3 | Bayesian graphical modelling applied to random effects in meta-analysis - SMITH, SPIEGELHALTER, et al. - 1995 |

1 | Commentary on the paper by Murray Aitkin, and on discussion by Mervyn Stone - Dempster - 1997 |

1 |
BUGS Examples Volume 1, Version 0.5, (version ii). MRC Biostatistics Unit, Cambridge. Bayesian deviance 31
- Spiegelhalter, Thomas, et al.
- 1996
(Show Context)
Citation Context ...E i \Gamma (y i \Gamma e ` i E i ) obtained by taking \Gamma2 log f(y) = \Gamma2 P i log p(y i j` i = log y i E i ) = 208:0 as the standardising factor. For each model we ran an MCMC sampler in BUGS (=-=Spiegelhalter et al., 1996-=-a) for 5000 interations following a burn-in period of 1000 iterations. As suggested by Dempster (1974), Figure 1 shows a kernel-density smoothed plot of the resulting posterior distributions of the de... |