## Population Markov Chain Monte Carlo (2003)

### Cached

### Download Links

- [ite.gmu.edu]
- [www.cs.bham.ac.uk]
- [www.ics.uci.edu]
- [ite.gmu.edu]
- [ite.gmu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 12 - 2 self |

### BibTeX

@INPROCEEDINGS{Laskey03populationmarkov,

author = {Kathryn Blackmond Laskey and James Myers},

title = {Population Markov Chain Monte Carlo},

booktitle = {Machine Learning},

year = {2003},

pages = {175--196},

publisher = {University Press}

}

### OpenURL

### Abstract

Stochastic search algorithms inspired by physical and biological systems are applied to the problem of learning directed graphical probability models in the presence of missing observations and hidden variables. For this class of problems, deterministic search algorithms tend to halt at local optima, requiring random restarts to obtain solutions of acceptable quality. We compare three stochastic search algorithms: a Metropolis-Hastings Sampler (MHS), an Evolutionary Algorithm (EA), and a new hybrid algorithm called Population Markov Chain Monte Carlo, or popMCMC. PopMCMC uses statistical information from a population of MHSs to inform the proposal distributions for individual samplers in the population. Experimental results show that popMCMC and EAs learn more efficiently than the MHS with no information exchange. Populations of MCMC samplers exhibit more diversity than populations evolving according to EAs not satisfying physics-inspired local reversibility conditions. KEY WORDS: Markov Chain Monte Carlo, Metropolis-Hastings Algorithm, Graphical Probabilistic Models, Bayesian Networks, Bayesian Learning, Evolutionary Algorithms Machine Learning MCMC Issue 1 5/16/01 1.

### Citations

7493 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...andard EA and a population of independent Metropolis-Hastings samplers (MHS). As an experimental testbed, we have chosen the problem of learning directed graphical models, or Bayesian Networks (BNs) [=-=Pearl, 1988-=-; Jensen, 1994]. We consider problems with missing observations and hidden variables, because most current approaches to such problems, which rely on local deterministic search, are acknowledged to be... |

4055 |
Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images
- Geman, Geman
- 1984
(Show Context)
Citation Context ...dels of physical systems that seek a state of minimal free energy. Markov Chain Monte Carlo algorithms have also been used more recently in statistical inference and artificial intelligence [1], [2], =-=[3]-=-. Statistical physicists model physical systems in terms of their macrostates and microstates. A macrostate is a system's observable components, such as temperature and pressure. A microstate is the n... |

2526 | Equation of state calculations by fast computing machines - Metropolis, Rosenbluth, et al. - 1953 |

2168 |
An Introduction to Probability Theory and Its Applications
- Feller
- 1971
(Show Context)
Citation Context ...d balance or local reversibility (Gilks, et al. 1996; Neal, 1993): ( ) mis obs = ( ) mis obs . (14) ( n) ( c) ( c) ( c) ( c) ( n) ( n) ( n) T y | y p( G , x | x ) T y | y p( G , x | x ) This implies [=-=Feller, 1968-=-] that the distribution p( G, xmis | xobs) is a stationary distribution for the chain. For our problem, the configuration space is finite. Thus, if all configurations are reachable from all other conf... |

1609 | Statistical Analysis with Missing Data - Little, Rubin - 1987 |

1508 | Bayesian data analysis
- Gelman, Carlin, et al.
- 1995
(Show Context)
Citation Context ...s not destroy ergodicity or worsen the convergence rate. Some authors have suggested using a population of MCMC samplers to assess the variability in results from different runs of the sampler [e.g., =-=Gelman, et al., 1995-=-]. Multiple runs can be used to develop tests of convergence that compare within-sampler and between-sampler variation in the solution [Gelman and Rubin, 1992]. It has been suggested that performance ... |

1368 |
Monte Carlo sampling methods using Markov chains and their applications
- Hastings
- 1970
(Show Context)
Citation Context ... not compare their sampler with a standard evolutionary algorithm. 3.2 Metropolis-Hastings Sampler The first algorithm we considered was a Metropolis-Hastings sampler (MHS) [Metropolis, et al., 1953; =-=Hastings, 1970-=-]. For purposes of comparison with the other algorithms, we ran a population of independent samplers in parallel, but these samplers did not exchange information with each other. A MHS is a Markov cha... |

1349 | Local computations with probabilities on graphical structures and their application to expert systems (with discussion - Lauritzen, Spiegelhalter - 1988 |

1176 | Bayes factors
- Kass, Raftery
- 1995
(Show Context)
Citation Context ...tween-sampler variation in the solution [Gelman and Rubin, 1992]. It has been suggested that performance might be improved by exchanging information among multiple samplers running in parallel [e.g., =-=Kass and Raftery, 1995; -=-Geyer, 1991]. This approach is subject to the same difficulty as any adaptive sampler – how to use the information in a way that ensures that convergence to the target stationary distribution and de... |

1171 | Bayesian Theory - Bernardo, Smith - 1994 |

1165 | Graphical Models - Lauritzen - 1996 |

1140 | A Bayesian Method for the Induction of Probabilistic Networks from Data
- Cooper, Herskovits
- 1992
(Show Context)
Citation Context ...ion (1). In Bayesian learning, a prior distribution is defined over graph structures and local distributions, and the cases are used to infer a posterior distribution. The most common approach [e.g., =-=Cooper and Herskovits, 1992] is -=-to assign a prior probability q(G) to each graph and independent Dirichlet distributions conditional distributions θ ijc: Machine Learning MCMC Issue 3 5/16/01 ∏ (1) g( θ 1 , K . θ | G) for each ... |

981 | An Introduction to Bayesian Networks - Jensen - 1996 |

941 |
Evolutionary Algorithms in Theory and Practice
- Bäck
- 1996
(Show Context)
Citation Context ...from the sampler. In examining ways to incorporate information exchange, it seems natural to consider evolutionary algorithms (EAs), a class of stochastic algorithms modeled after biological systems [=-=Back, 1996-=-; Fogel, 1991; Schwefel, 1995; Holland, 1995]. In an EA, a population of simulated solutions evolves according to a Darwinian process of survival of Machine Learning MCMC Issue 1 5/16/01sthe fittest. ... |

926 |
An Analysis of the Behavior of a Class of Genetic Adaptive Systems
- DeJong
- 1975
(Show Context)
Citation Context ...arch spaces an EA may converge prematurely to a sub-optimal mode, leaving modes with better solutions unexplored. The EA community has tried many approaches to alleviate this problem such as niching [=-=DeJong 1975-=-], speciation [Spears 1994] and adaptive mutation [Kitano 1990]. To date, there is no agreed upon approach, but there are many promising prospects. We conjectured that the Hastings correction in (18) ... |

903 | A Tutorial on Learning With Bayesian Networks - Heckerman - 1995 |

817 |
Inference from Iterative Simulation Using Multiple Sequences
- Gelman, Rubin
- 1992
(Show Context)
Citation Context ...s from different runs of the sampler [e.g., Gelman, et al., 1995]. Multiple runs can be used to develop tests of convergence that compare within-sampler and between-sampler variation in the solution [=-=Gelman and Rubin, 1992-=-]. It has been suggested that performance might be improved by exchanging information among multiple samplers running in parallel [e.g., Kass and Raftery, 1995; Geyer, 1991]. This approach is subject ... |

639 |
Markov Chain Monte Carlo in practice
- Gilks, Richardson, et al.
- 1996
(Show Context)
Citation Context ...as models of physical systems that seek a state of minimal free energy. Markov Chain Monte Carlo algorithms have also been used more recently in statistical inference and artificial intelligence [1], =-=[2]-=-, [3]. Statistical physicists model physical systems in terms of their macrostates and microstates. A macrostate is a system's observable components, such as temperature and pressure. A microstate is ... |

594 | Probabilistic inference using markov chain monte carlo methods
- Neal
- 1993
(Show Context)
Citation Context ... R y | y A y | y . (13) ( n ) ( c ) y ≠ y It is straightforward to verify that this transition distribution satisfies a condition known as detailed balance or local reversibility (Gilks, et al. 1996=-=; Neal, 1993-=-): ( ) mis obs = ( ) mis obs . (14) ( n) ( c) ( c) ( c) ( c) ( n) ( n) ( n) T y | y p( G , x | x ) T y | y p( G , x | x ) This implies [Feller, 1968] that the distribution p( G, xmis | xobs) is a stat... |

558 |
Uniform Crossover in Genetic Algorithms
- Syswerda
- 1989
(Show Context)
Citation Context ... the first parent. This process continues μ times until the next generation is populated. The crossover operator we used for both missing data and structure is called parameterized uniform crossover =-=[Syswerda 1989-=-], [DeJong and Spears 1990]. Parameterized uniform crossover selects a subset of the genes at random and exchanges the values of the genes between the parents. Figure 5 illustrates the crossover opera... |

507 |
Evolution and Optimum Seeking
- Schwefel
- 1995
(Show Context)
Citation Context ...ining ways to incorporate information exchange, it seems natural to consider evolutionary algorithms (EAs), a class of stochastic algorithms modeled after biological systems [Back, 1996; Fogel, 1991; =-=Schwefel, 1995-=-; Holland, 1995]. In an EA, a population of simulated solutions evolves according to a Darwinian process of survival of Machine Learning MCMC Issue 1 5/16/01sthe fittest. Information exchange between ... |

451 | Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion - DEMPSTER, LAIRD, et al. - 1977 |

330 | A tutorial on learning Bayesian networks
- Heckerman
- 1995
(Show Context)
Citation Context ...for the i c ik c parameters [e.g., Cooper and Herskovits, 1992]. If stronger prior information is available, it can be incorporated by specifying a non-uniform member of the natural conjugate family [=-=Heckerman and Geiger, 1995-=-]. Another convenient feature of the above family of prior distributions is the existence of a local decomposition of the marginal likelihood, or the probability of the observations conditional only o... |

328 |
Designing neural networks using genetic algorithms with graph generation system
- Kitano
- 1990
(Show Context)
Citation Context ...de, leaving modes with better solutions unexplored. The EA community has tried many approaches to alleviate this problem such as niching [DeJong 1975], speciation [Spears 1994] and adaptive mutation [=-=Kitano 1990-=-]. To date, there is no agreed upon approach, but there are many promising prospects. We conjectured that the Hastings correction in (18) might be useful in mitigating the problem of genetic drift, be... |

258 | Bayesian graphical models for discrete data - Madigan, York - 1995 |

247 | The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks - Beinlich, Suermondt, et al. - 1989 |

245 | Learning Bayesian networks with local structure - Friedman, Goldszmidt - 1996 |

230 |
The EM algorithm for graphical association models with missing data
- Lauritzen
- 1995
(Show Context)
Citation Context ...ms with missing observations and hidden variables, because most current approaches to such problems, which rely on local deterministic search, are acknowledged to be prone to halting at local optima [=-=Lauritzen, 1995-=-; Friedman, 1988a,b]. The stochastic nature of our algorithms allows movement away from local optima, while information exchange allows building blocks of good solutions to percolate within a populati... |

223 | The Bayesian Structural EM Algorithm
- Friedman
- 1998
(Show Context)
Citation Context ...mplexity to the learning problem. A version of the EM algorithm has been applied to the problem of learning BNs with missing observations under the assumption that observations are missing at random [=-=Friedman, 1998-=-a,b]. EM and its variants are local hill climbing searches, and thus can become trapped at local optima, especially for certain patterns of missing observations. Stochastic search is attractive as a w... |

215 |
Markov chain monte carlo maximum likelihood
- Geyer
- 1991
(Show Context)
Citation Context ...in the solution [Gelman and Rubin, 1992]. It has been suggested that performance might be improved by exchanging information among multiple samplers running in parallel [e.g., Kass and Raftery, 1995; =-=Geyer, 1991].-=- This approach is subject to the same difficulty as any adaptive sampler – how to use the information in a way that ensures that convergence to the target stationary distribution and desirable asymp... |

189 | Statistical theory: the prequential approach - Dawid - 1984 |

140 | Learning Bayesian Networks is NP-Hard
- Chickering, Geiger, et al.
- 1994
(Show Context)
Citation Context ...] and Heckerman, et al., [10] have developed equations for this distribution referred to as the Bayesian Dirichlet score. For complete data the Bayesian Dirichlet is closed but the problem is NP hard =-=[11]. Fortunat-=-ely, researchers have developed algorithms for finding "good" networks using greedy deterministic search methods over networks. The problem becomes far more complex with incomplete data beca... |

131 | Learning Belief Networks in the presence of Missing Values and Hidden Variables
- Friedman
- 1997
(Show Context)
Citation Context ...mplexity to the learning problem. A version of the EM algorithm has been applied to the problem of learning BNs with missing observations under the assumption that observations are missing at random [=-=Friedman, 1998-=-a,b]. EM and its variants are local hill climbing searches, and thus can become trapped at local optima, especially for certain patterns of missing observations. Stochastic search is attractive as a w... |

129 | Assessment and propagation of model uncertainty (with discussion - Draper - 1995 |

79 | Structure learning of Bayesian networks by genetic algorithms: a performance analysis of control parameters - Larranaga, Yurramendi, et al. - 1996 |

77 |
An Introduction to Probability Theory and its Applications. Third ed
- Feller
- 1968
(Show Context)
Citation Context ... persists forever once reached. If p(s) is a stationary distribution, then (5) can be rewritten as p p ( ) ( ' ) ( | ) s s T s s n x = �� (3) If a Markov chain satisfies certain regularity conditi=-=ons [4]-=-, then it converges to a unique stationary distribution. We can construct a Markov chain with a specified Boltzman distribution as its stationary distribution by ensuring that the transition probabili... |

71 | Principles of Quantum Mechanics - Shankar - 1981 |

70 | Adaptive Markov chain Monte carlo through regeneration - Gilks, Roberts, et al. - 1998 |

69 | An analysis of the interacting roles of population size and crossover in genetic algorithms", Parallel Problem Solving from Nature
- Jong, Spears, et al.
- 1991
(Show Context)
Citation Context .... This process continues μ times until the next generation is populated. The crossover operator we used for both missing data and structure is called parameterized uniform crossover [Syswerda 1989], =-=[DeJong and Spears 1990-=-]. Parameterized uniform crossover selects a subset of the genes at random and exchanges the values of the genes between the parents. Figure 5 illustrates the crossover operation on graph structures. ... |

67 |
et al. Equations of state calculations by fast computing machines
- Metropolis, AW, et al.
(Show Context)
Citation Context ...tzman distribution). There are several common ways to construct a sampler that satisfies detailed balance. We applied one of the most common sampling approaches, known as Metropolis-Hastings sampling =-=[5]-=-, [6]. The Metropolis-Hastings algorithm samples from a joint distribution by repeatedly generating random changes to the variables and then accepting or rejecting the changes in a way that preserves ... |

64 | Bayes factors and choice criteria for linear models - Smith, Spiegelhalter - 1980 |

57 |
A Markov chain framework for the simple genetic algorithm
- Davis, Principe
- 1993
(Show Context)
Citation Context ...o genetic reproduction. Asymptotic behavior of EAs is also typically analyzed using Markov chains, but unlike MCMC, characterizing the stationary distribution of an EA can be difficult [deJong, 1975; =-=Davis and Principe, 1993-=-]. This paper describes a modification of an MCMC sampler to incorporate information exchange among solutions in a population of Metropolis-Hastings samplers. The individual samplers are adaptive, usi... |

55 | Simple subpopulation schemes
- Spears
- 1994
(Show Context)
Citation Context ...erge prematurely to a sub-optimal mode, leaving modes with better solutions unexplored. The EA community has tried many approaches to alleviate this problem such as niching [DeJong 1975], speciation [=-=Spears 1994-=-] and adaptive mutation [Kitano 1990]. To date, there is no agreed upon approach, but there are many promising prospects. We conjectured that the Hastings correction in (18) might be useful in mitigat... |

49 | Bayesian Model Averaging: A
- Hoeting, Madigan, et al.
- 1999
(Show Context)
Citation Context ...e is evidence that at least in some applications averaging multiple models can improve predictive performance over use of the single best model found by the search process [Madigan and Raftery, 1994; =-=Hoeting, et al., 1996-=-]. We compared predictive performance of the single best model against predictive performance of a prediction constructed by averaging the predictions of all models in the population. Because the stat... |

49 | Asymptotic model selection for directed networks with hidden variables
- Geiger, Heckerman, et al.
- 1996
(Show Context)
Citation Context ...ata problem use a combination of the deterministic greedy search and parameter approximation approaches such as the Expectation-Maximization algorithm [12], [13], [14], [15] and Laplace approximation =-=[16]. Because -=-the parameter space is multi-modal and complex, see [13], [15], [17], [18], these algorithms "get stuck" on the nearest local maximum requiring multiple random restarts. Our approach to solv... |

43 | The Structure of Scientific Revolutions, 3rd edition - Kuhn |

35 | Calculus of variation: with applications to physics and engineering - Weinstock - 1974 |

29 | Bayesian Inference, in: Kendall’s Advanced Theory of Statistics 2A - O’Hagan - 1994 |

23 | Learning bayesian network from incomplete data with stochastic search algorithms
- Myers, Laskey, et al.
- 1999
(Show Context)
Citation Context ..."get stuck" on the nearest local maximum requiring multiple random restarts. Our approach to solving this problem is to use stochastic search algorithms such as evolutionary algorithms and M=-=CMC [19], [20], [21]. These algori-=-thms avoid the problem of "getting stuck" on the nearest local maximum by introducing stochastic "jumps" over the parameter space. In both families of algorithms the jumps are bias... |

22 | Graphical models and exponential families
- Geiger, Meek
- 1998
(Show Context)
Citation Context ... approximation approaches such as the Expectation-Maximization algorithm [12], [13], [14], [15] and Laplace approximation [16]. Because the parameter space is multi-modal and complex, see [13], [15], =-=[17], [18], th-=-ese algorithms "get stuck" on the nearest local maximum requiring multiple random restarts. Our approach to solving this problem is to use stochastic search algorithms such as evolutionary a... |