## Bayesian Data Analysis for Data Mining (2002)

Venue: | In Handbook of Data Mining |

Citations: | 1 - 0 self |

### BibTeX

@INPROCEEDINGS{Madigan02bayesiandata,

author = {David Madigan and Greg Ridgeway},

title = {Bayesian Data Analysis for Data Mining},

booktitle = {In Handbook of Data Mining},

year = {2002},

pages = {103--132},

publisher = {MIT Press}

}

### OpenURL

### Abstract

Introduction The Bayesian approach to data analysis computes conditional probability distribu- tions of quantities of interest (such as future observables) given the observed data. Bayesian analyses usually begin with a .full probability model - a joint probability dis- tribution for all the observable and unobservable quantities under study - and then use Bayes' theorem (Bayes, 1763) to compute the requisite conditional probability distributions (called poster'Joy distributions). The theorem itself is innocuous enough. In its simplest form, if Q denotes a quantity of interest and D denotes data, the theorem states: P(ql D) P(;lq) X P(q)/P(). This theorem prescribes the basis for statistical learning in the probabilistic frame- work. With p(Q) regarded as a probabilistic statement of prior knowledge about Q before obtaining the data D, p(QI D) becomes a revised probabilistic statement of our knowledge about Q in the light of the data (Bernardo and Smith, 1994, p.2). The marginal lik

### Citations

7556 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...es{directed, undirected, or both{ lead to di®erent classes of probabilistic models. In what follows we will only consider acyclic directed models, also known as Bayesian Networks 1 (see, for example,=-= Pearl, 1988-=-). Spiegelhalter and Lauritzen (1990) presented a Bayesian analysis of acyclic directed graphical Markov models and this topic continues to attract research attention. Here we sketch the basic framewo... |

4204 |
PE.(1973) Pattern Classification and Scene Analysis [M
- Duda, Hart
(Show Context)
Citation Context ...important sub-class of graphical models that scale well and often yield surprisingly good predictive performance (see Hand and Yu, 2001, or Lewis, 1998). The classical Naive Bayes model (for example, =-=Duda and Hart, 1973-=-) imposes a conditional independence constraim, 34 namely that the predictor variables, say, ',...,', are conditionally independent given the response variable y. Figure 10 shows a graphical Markov mo... |

4100 |
Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images
- Geman, Geman
- 1984
(Show Context)
Citation Context ...s that it results in an irreducible and aperiodic chain. The second advantage is that there is no need to compute the normalization constant of f(/)[x) since it cancels out in (8). The Gibbs sampler (=-=Geman and Geman, 1984-=-) is a special case of the MetropolisHastings algorithm and is especially popular. If/) is a multidimensional parameter, 19 the Gibbs sampler sequentially updates each of the components of 0 from the ... |

1546 | Bayesian Data Analysis - Gelman, Carlin, et al. - 1997 |

1385 |
Monte Carlo Sampling Methods Using Markov Chains and Their
- Hastings
- 1970
(Show Context)
Citation Context ...hain with a few basic strategies. However, there is still a bit of art involved in creating an efficient chain and assessing the chain's convergence. Figure 4 shows the Metropolis-Hastings algorithm (=-=Hastings, 1970-=-), a very general MCMC algorithm. Assume that we have a single draw/) from f(/)[x) and a proposal density for a new draw, q(O[/)). If we follow step 2 of the MCMC algorithm then the distribution of/)2... |

1194 |
Bayesian Theory
- Bernardo, Smith
- 2000
(Show Context)
Citation Context ...p(Q) regarded as a probabilistic statement of prior knowledge about Q before obtaining the data D, p(QI D) becomes a revised probabilistic statement of our knowledge about Q in the light of the data (=-=Bernardo and Smith, 1994-=-, p.2). The marginal likelihood of the data, p(D), serves as normalizing constant. Computing is the big issue confronting a data miner working in the Bayesian framework. The computations required by B... |

1148 | A Bayesian method for the induction of probabilistic networks form data - Cooper, Herskovits - 1992 |

1017 | The Elements of - Hastie, Tibshirani, et al. - 2009 |

962 | Learning Bayesian networks: The combination of knowledge and statistical data - Heckerman, Geiger, et al. - 1995 |

951 | Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika 82
- Green
- 1995
(Show Context)
Citation Context ...h as web-page visits over time or gene expression data), can detect clusters within clusters or overlapping clusters, and can make formal inference about the number of clusters. Reversible jump MCMC (=-=Green, 1995-=-), an MCMC algorithm than incorporates jumps between spaces of varying dimension, provides a flexible framework for Bayesian analysis of model-based clustering. Richardson and Green (1997) is key a re... |

653 | Bayesian learning for neural networks - Neal - 1996 |

650 | Markov Chain Monte Carlo in practice - Gilks, Richardson, et al. - 1996 |

519 |
Bayesian classification (AutoClass): theory and results
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context ...key a reference. See also Fraley and Raftery (1998) and the MCLUST software available from Raftery's website: http://www.stat.washington.edu/raftery/Research/Mclust/mclust.html, the AutoClass system (=-=Cheeseman and Stutz, 1996-=-), and the SNOB system (Wallace and Dowe, 2000). Cadez and Smyth (1999) and Cadez et al. (2000) present an EM algorithm for model-based clustering and describe several applications. 35 6 Available Sof... |

497 | On Bayesian analysis of mixtures with an unknown number of components (with discussion - Richardson, Green - 1997 |

460 |
An essay towards solving a problem in the doctrine of chances
- Bayes
- 1763
(Show Context)
Citation Context ...d data. Bayesian analyses usually begin with a .full probability model - a joint probability distribution for all the observable and unobservable quantities under study - and then use Bayes' theorem (=-=Bayes, 1763-=-) to compute the requisite conditional probability distributions (called poster'Joy distributions). The theorem itself is innocuous enough. In its simplest form, if Q denotes a quantity of interest an... |

389 | Naive (bayes) at forty: The independence assumption in information retrieval - Lewis - 1998 |

263 | Bayesian Graphical Models for Discrete Data - Madigan, York - 1995 |

230 | Bayesian Model Averaging for Linear Regression Models - Raftery, Madigan, et al. - 1997 |

223 | Bayesian Model Choice: Asymptotics and Exact Calculations - GELFAND, DEY - 1994 |

208 | Sequential updating of conditional probabilities on directed graphical structures - Spiegelhalter, Lauritzen - 1990 |

206 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond - Williams - 1999 |

188 | Bayesian measures of model complexity and fit (with discussion - Spiegelhalter, Best, et al. - 2002 |

177 | Bayesian model choice via Markov chain Monte Carlo methods - Carlin, Chib - 1995 |

164 |
The History of Statistics: The Measurement of Uncertainty before 1900
- Stigler
- 1986
(Show Context)
Citation Context ...r centuries and variations across sub-populations continue to attract research attention. In 1781 the illustrious French scientist Pierre Simon Laplace presented a Bayesian analysis of the sex ratio (=-=Stigler, 1990-=-). He used data concerning 493,472 Parisian births between 1745 and 1770. Let n 493,472 and y,...,y, denote the sex associated with each of the births,s1 if the ith birth is females0 is the ith birth ... |

149 | Independence properties of directed markov fields - Lauritzen, Dawid, et al. - 1990 |

134 | Marginal likelihood from the Metropolis-Hastings output - Chib, Jeliazkov - 2001 |

132 | Assessment and propagation of model uncertainty (with discussion - Draper - 1995 |

127 | Bayesian Statistical Modeling - Congdon - 2001 |

110 |
Bayesian model averaging: a tutorial (with discussion
- Hoeting, Madigan, et al.
- 1999
(Show Context)
Citation Context ...hat one of the candidate models actually generated the data, but empirical evidence suggests that Bayesian model averaging usually provides better predictions than any single model (see, for example, =-=Hoeting et al., 1999-=-), sometimes substamially better. Predictive distributions from Bayesian model averaging usually have bigger variances, more faithfully reflecting the real predictive uncertainty. Draper (1995) provid... |

109 | Bayesian parameter estimation via variational methods - Jaakkola, Jordan - 2000 |

89 |
Applications of a general propagation algorithm for probabilistic expert systmes
- Dawid
- 1992
(Show Context)
Citation Context ...these probabilities are specified, the calculation of specific conditional probabilities such as Pt(/? [ A) can proceed via a series of local calculations without storing the full joint distribution (=-=Dawid, 1992-=-). To facilitate Bayesian learning for the five parameters, Spiegelhalter and Lauritzen (1990) and Cooper and Herkovits (1992) make two key assumptions that greatly simplify subsequent analysis. First... |

86 |
Pattern Classi®cation and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...important sub-class of graphical models that scale well and often yield surprisingly good predictive performance (see Hand and Yu, 2001, or Lewis, 1998). The classical Naive Bayes model (for example, =-=Duda and Hart, 1973-=-) imposes a conditional independence constraint, 34snamely that the predictor variables, say, x 1;:::;x k, are conditionally independent given the response variable y. Figure 10 shows a graphical Mark... |

74 | Bayes point machines
- Herbrich, Graepel, et al.
(Show Context)
Citation Context ...orithmic methods such as neural networks and support vector machines yield natural Bayesian analogues. See, for example, the Gaussiaz P'ocess models of Williams (1998), or the Bayes Poizt Machizes of =-=Herbrich et al., 2001-=-). In fact, Herbrich et al. present Bayes point machines as an algorithmic approximation to Bayesian inference for kernel-based predictive models. Combining multiple complex hierarchical models such a... |

72 | Bayesian Survival Analysis - Ibrahim, Chen, et al. - 2001 |

66 | A sequential particle filter method for static models - Chopin |

63 | Bayesian CART model search (with discussion - Chipman, George, et al. - 1998 |

62 | Empirical bayes screening for multi-item associations - DuMouchel, Pregibon - 2001 |

62 | Applied Bayesian Forecasting and Time Series Analysis - Pole, West, et al. - 1994 |

56 |
Idiot bayes ? not so stupid after all
- Hand, Yu
- 2001
(Show Context)
Citation Context ...y simply by considering conditional distributions. Naive Bayes models represem an important sub-class of graphical models that scale well and often yield surprisingly good predictive performance (see =-=Hand and Yu, 2001-=-, or Lewis, 1998). The classical Naive Bayes model (for example, Duda and Hart, 1973) imposes a conditional independence constraim, 34 namely that the predictor variables, say, ',...,', are conditiona... |

48 | Bayesian data mining in large frequency tables, with an application to the FDA Spontaneous Reporting System (with discussion - DuMouchel |

42 | Accounting for model uncertainty in survival analysis improves predictive performance (with discussion - RAFTERY, MADIGAN, et al. - 1996 |

36 | Bayes and Empirical Bayes Methods for Data Analysis, Second Edition. Chapman and Hall/CRC: Boca Raton - Carlin, Louis - 2000 |

35 | Statistical modeling: The two cultures (with discussion - BREIMAN |

35 | Bayesian model averaging in proportional hazard models: Assessing the risk of stroke - Volinsky - 1997 |

34 | Bayesian treed models - Chipman, George, et al. |

34 | MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions
- Wallace, Dowe
- 2000
(Show Context)
Citation Context ...8) and the MCLUST software available from Raftery's website: http://www.stat.washington.edu/raftery/Research/Mclust/mclust.html, the AutoClass system (Cheeseman and Stutz, 1996), and the SNOB system (=-=Wallace and Dowe, 2000-=-). Cadez and Smyth (1999) and Cadez et al. (2000) present an EM algorithm for model-based clustering and describe several applications. 35 6 Available Software Section 5.1 described the BUGS software.... |

32 | Bayesian estimation of hidden Markov chains: a stochastic implementation - Robert, Celeux, et al. - 1993 |

30 | A method for simultaneous variable selection and outlier identification in linear regression - Hoeting, Raftery, et al. - 1996 |

30 | Bayesian Forecasting and Dynamic Models, second edition - West, Harrison - 2008 |

29 | Eliciting prior information to enhance the predictive performance of Bayesian graphical models - Madigan, Garvin, et al. - 1995 |