## Adaptively scaling the Metropolis algorithm using expected squared jumped distance (2003)

Citations: | 16 - 0 self |

### BibTeX

@TECHREPORT{Pasarica03adaptivelyscaling,

author = {Cristian Pasarica and Andrew Gelman},

title = {Adaptively scaling the Metropolis algorithm using expected squared jumped distance},

institution = {},

year = {2003}

}

### OpenURL

### Abstract

Using existing theory on efficient jumping rules and on adaptive MCMC, we construct and demonstrate the effectiveness of a workable scheme for improving the efficiency of Metropolis algorithms. A good choice of the proposal distribution is crucial for the rapid convergence of the Metropolis algorithm. In this paper, given a family of parametric Markovian kernels, we develop an algorithm for optimizing the kernel by maximizing the expected squared jumped distance, an objective function that characterizes the Markov chain under its d-dimensional stationary distribution. The algorithm uses the information accumulated by a single path and adapts the choice of the parametric kernel in the direction of the local maximum of the objective function using multiple importance sampling techniques. We follow a two-stage approach: a series of adaptive optimization steps followed by an MCMC run with fixed kernel. It is not necessary for the adaptation itself to converge. Using several examples, we demonstrate the effectiveness of our method, even for cases in which the Metropolis transition kernel is initialized at very poor values.

### Citations

2526 | Equation of state calculations by fast computing machines - Metropolis, Rosenbluth, et al. - 1953 |

1368 | Monte Carlo sampling methods using Markov chains and their applications - Hastings - 1970 |

1021 | Monte Carlo Statistical Methods - Robert, Casella - 1999 |

859 |
Numerical Recipes in C
- PRESS, A
- 1992
(Show Context)
Citation Context ...m. We return to this issue and discuss other choices of objective function in Section 2.4. We optimize the empirical estimator (3) using a numerical optimization algorithm such as Brent’s (see, e.g., =-=Press et al., 2002-=-) as we further discuss in Section 2.6. In Section 4 we discuss the computation time needed for the optimization. 2.3 Iterative optimization of the jumping kernel If the starting point is not in the n... |

629 | Markov Chains and Stochastic Stability - MEYN, TWEEDIE - 1993 |

431 | Stochastic volatility: likelihood inference and comparison with ARCH models - Kim, Shephard, et al. - 1998 |

401 |
Stochastic Approximation Algorithms and Applications
- Kushner, Yin
- 1997
(Show Context)
Citation Context ...ping kernel Jγ(·, ·) after a fixed number of steps, and illustrates it with several examples. We also compare our procedure with the Robbins-Monro stochastic optimization algorithm (see, for example, =-=Kushner and Yin, 2003-=-). We describe our algorithm in Section 2 in general and in Section 3 discuss implementation with Gaussian kernels. Section 4 includes several examples, and we conclude with discussion and open proble... |

365 | Bayesian inference in econometric models using monte carlo integration - Geweke - 1989 |

362 | Algorithms for Minimization without Derivatives
- Brent
- 1973
(Show Context)
Citation Context ...ses, we apply our method with two starting values of (0.01,50)×2.4/ √ d for d = 25. We use an optimization procedure that is a combination of golden search and successive parabolic interpolation (see =-=Brent, 1973-=-) on the interval [0.01, 100]. Insert Figure 3 here “Extreme starting points” 4.2 Correlated normal target distribution We next illustrate adaptive scaling for a target distribution� with unknown � co... |

189 |
Sampling Techniques. Third edition
- Cochran
- 1977
(Show Context)
Citation Context ...xample constraining the acceptance probability to the range [0, 1], and has a lower variance than the mean estimator if the correlation between the numerator and denominator is sufficiently high (see =-=Cochran, 1977-=-). Other choices for the empirical estimator include the mean estimator hT and estimators that use control variates that sum to 1 to correct for bias (see, for example, the regression and difference e... |

169 |
Weak convergence and optimal scaling of random walk Metropolis algorithms
- Roberts, Gelman, et al.
- 1997
(Show Context)
Citation Context ... distribution the optimal scaling of the jumping kernel is cd = 2.4/ √ d (Gelman, Roberts, and Gilks, 1996). Another approach is to coerce the acceptance probability to a preset value (e.g., 23%; see =-=Roberts, Gelman, and Gilks, 1997-=-) with covariance kernel set by matching moments; these can be difficult to apply due to the complicated form of target distribution which makes the optimal acceptance probability value or analytic mo... |

140 |
Efficient Metropolis jumping rules
- Gelman, Roberts, et al.
- 1996
(Show Context)
Citation Context ...03). These criteria are usually defined based on theoretical optimality results, for example for a d-dimensional normal target distribution the optimal scaling of the jumping kernel is cd = 2.4/ √ d (=-=Gelman, Roberts, and Gilks, 1996-=-). Another approach is to coerce the acceptance probability to a preset value (e.g., 23%; see Roberts, Gelman, and Gilks, 1997) with covariance kernel set by matching moments; these can be difficult t... |

138 |
Spatial statistics and bayesian computation
- Besag, Green
- 1993
(Show Context)
Citation Context ...posed transition kernels {Jγ}γ∈Γ, where Γ is some finite-dimensional domain, in order to explore the target distribution π. Measures of efficiency in low dimensional Markov chains are not unique (see =-=Besag and Green, 1993-=-, Gelman, Roberts, and Gilks, 1996, and Andrieu and Robert, 2001). We shall maximize the expected squared jumped distance (ESJD): ESJD(γ) △ = EJγ[|θt+1 − θt| 2 ] = 2(1 − ρ1) · varπ(θt), for a one-dime... |

105 | An adaptive Metropolis Algorithm
- Haario, Saksman, et al.
- 2001
(Show Context)
Citation Context ...ce simulations from the target distribution: the Markovian property or time-homogeneity of the transition kernel is lost, and ergodicity can be proved only under some very restrictive conditions (see =-=Haario, Saksman, and Tamminen, 2001-=-, Holden, 1998, and Atchadé and Rosenthal, 2003). Adaptive methods that preserve the Markovian properties using regeneration have the challenge of estimation of regeneration times, which is difficult ... |

97 | Optimal scaling for various Metropolis-Hastings algorithms - Roberts, Rosenthal - 2001 |

78 | On adaptive Markov chain Monte Carlo algorithms
- Atchadé, Rosenthal
- 2005
(Show Context)
Citation Context ...ew York, NY 10027, gelman@stat.columbia.edu 1smatching some criteria under the invariant distribution (e.g., Haario, Saksman, and Tamminen, 1999, Laskey and Myers, 2003, Andrieu and Robert, 2001, and =-=Atchadé and Rosenthal, 2003-=-). These criteria are usually defined based on theoretical optimality results, for example for a d-dimensional normal target distribution the optimal scaling of the jumping kernel is cd = 2.4/ √ d (Ge... |

73 | Bayesian data analysis, second edition - Gelman, Carlin, et al. - 2003 |

70 | Adaptive Markov chain Monte carlo through regeneration - Gilks, Roberts, et al. - 1998 |

67 | On the convergence of Monte Carlo maximum likelihood calculations
- Geyer
- 1994
(Show Context)
Citation Context ... on compact sets with probability 1. Proof. See Appendix. < ∞, (10) The convergence of the maximizer of hT to the maximizer of h is attained under the additional conditions of Geyer (1994). Theorem. (=-=Geyer, 1994-=-, Theorem 4) Assume that (γT)T, γ∗ are the unique maximizers of (hT)T and h, respectively and they are contained in a compact set. If there exist a sequence ǫT → 0 such that hT(γT |γ0) ≥ sup T(hT(γT |... |

62 |
Weighted average importance sampling and defensive mixture distributions
- Hesterberg
- 1995
(Show Context)
Citation Context ... 4. In our algorithm, the “pilot data” used to estimate h will come from a series of different 5sjumping kernels. The function h can be estimated using the method of multiple importance sampling (see =-=Hesterberg, 1995-=-), yielding the following algorithm based on adaptively updating the jumping kernel after steps T1, T1 + T2, T1 + T2 + T3 + · · · . For k = 1, 2, 3, . . ., 1. Run the Metropolis algorithm for Tk steps... |

37 | Controlled MCMC for optimal sampling
- Andrieu, Robert
- 2001
(Show Context)
Citation Context ...istics, Columbia University, New York, NY 10027, gelman@stat.columbia.edu 1smatching some criteria under the invariant distribution (e.g., Haario, Saksman, and Tamminen, 1999, Laskey and Myers, 2003, =-=Andrieu and Robert, 2001-=-, and Atchadé and Rosenthal, 2003). These criteria are usually defined based on theoretical optimality results, for example for a d-dimensional normal target distribution the optimal scaling of the ju... |

37 | Adaptive proposal distribution for random walk Metropolis algorithm
- Haario, Sakman, et al.
- 1999
(Show Context)
Citation Context ...rk, NY 10027, pasarica@stat.columbia.edu ‡ Department of Statistics, Columbia University, New York, NY 10027, gelman@stat.columbia.edu 1smatching some criteria under the invariant distribution (e.g., =-=Haario, Saksman, and Tamminen, 1999-=-, Laskey and Myers, 2003, Andrieu and Robert, 2001, and Atchadé and Rosenthal, 2003). These criteria are usually defined based on theoretical optimality results, for example for a d-dimensional normal... |

37 | Coupling and ergodicity of adaptive MCMC - Rosenthal, Roberts - 2007 |

34 |
Slice sampling (with discussion
- Neal
- 2003
(Show Context)
Citation Context ...cal weighting of posterior simulations (as in Haario et al., 1999). We also anticipate that these methods can be generalized to optimize over more general MCMC algorithms, for example slice sampling (=-=Neal, 2003-=-) and Langevin algorithms that involve a translation parameter as well as a scale for the jumping kernel and can achieve higher efficiencies then symmetric Metropolis algorithms (see Roberts and Rosen... |

30 | Some adaptive Monte Carlo methods for Bayesian inference - Tierney, Mira - 1999 |

28 | Spatial statistics and Bayesian - BESAG, GREEN - 1993 |

25 | On Markov chain Monte Carlo acceleration - Gelfand, Sahu - 1994 |

23 | Estimation and optimization of functions - Geyer - 1996 |

19 | Identification of regeneration times in MCMC simulation, with application to adaptive schemes - Brockwell, Kadane - 2002 |

17 | Ordering and improving the performance of Monte Carlo Markov chains - Mira - 2001 |

11 | DRAM: Efficient adaptive MCMC - Haario, Laine, et al. - 2006 |

8 | The r project for statistical computing - Project - 2006 |

7 | Adaptive chains - Holden - 1998 |

7 | The R project for statistical computing. www.r-project.org - Project - 2000 |

7 | On the robustness of optimal scaling for random walk Metropolis algorithms - Bedard |

6 | Monte Carlo Statistical Methods. New-York - Robert, Casella - 1999 |

5 | Bayesian analysis of serial dilution assays - Gelman, Chew, et al. |

1 | Constrained maximum Monte Carlo maximum likelihood for dependent data (with discussion - Geyer, Thompson - 1992 |

1 | π(x)dx ) π(x)dx ) π(x)dx π(x)dx < ∞.(6.3) Andrieu - C, Robert - 2001 |

1 |
jumping rules
- Gelman, Roberts, et al.
- 1996
(Show Context)
Citation Context ...√ d (Roberts, Gelman, and Gilks, 1997). These results are based on the asymptotic limit of infinite-dimensional iid target distributions only, but in practice can be applied to dimensions as low as 5(=-=Gelman, Roberts, and Gilks, 1996-=-). these results appear in Roberts and Rosenthal (2001). Extensions of Another approach is to coerce the acceptance probability to a preset value (e.g. 44% for one-dimensional target). This can be dif... |

1 | serial dilution assays. Biometrics, to appear. Bayesian analysis of - Gelman, Carlin, et al. - 2003 |

1 | sup Jφ(‖y − x‖)‖y − x‖ φ∈(γ−ɛ,γ+ɛ) 2 dy sup φ∈(γ−ɛ,γ+ɛ) Jφ(‖y − x‖)‖y − x‖ 2 dy ) Jγ−ɛ(‖y − x‖)‖y − x‖ 2 dy π(x)dx ) π(x)dx ) π(x)dx Jγ+ɛ(‖y - C, Robert - 2001 |

1 | Optimally combining importance sampling techniques for Monte Carlo rendering - Veach, Guibas - 1995 |