## Shotgun stochastic search for “large p” regression (2007)

Venue: | Journal of the American Statistical Association |

Citations: | 17 - 3 self |

### BibTeX

@ARTICLE{Hans07shotgunstochastic,

author = {Chris Hans and Adrian Dobra and Mike West},

title = {Shotgun stochastic search for “large p” regression},

journal = {Journal of the American Statistical Association},

year = {2007}

}

### OpenURL

### Abstract

Model search in regression with very large numbers of candidate predictors raises challenges for both model specification and computation, and standard approaches such as Markov chain Monte Carlo (MCMC) and step-wise methods are often infeasible or ineffective. We describe a novel shotgun stochastic search (SSS) approach that explores “interesting” regions of the resulting, very high-dimensional model spaces to quickly identify regions of high posterior probability over models. We describe algorithmic and modeling aspects, priors over the model space that induce sparsity and parsimony over and above the traditional dimension penalization implicit in Bayesian and likelihood analyses, and parallel computation using cluster computers. We discuss an example from gene expression cancer genomics, comparisons with MCMC and other methods, and theoretical and simulationbased aspects of performance characteristics in large-scale regression model search. We also provide software implementing the methods.

### Citations

824 | Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82
- Green
- 1995
(Show Context)
Citation Context ...lex patterns of collinearity that are typical with many variables. MCMC algorithms designed to explore the posterior distribution over regression model spaces (e.g., George and McCulloch, 1993, 1997; =-=Green, 1995-=-; Madigan and York, 1995; Geweke, 1996; Raftery et al., 1997; Brown et al., 1998b) rely on Gibbs sampling (Gelfand and Smith, 1990) or Metropolis-Hastings algorithms, but are increasingly ineffective ... |

740 |
Sampling-based approaches to calculating marginal densities
- GELFAND, SMITH
- 1990
(Show Context)
Citation Context ...istribution over regression model spaces (e.g., George and McCulloch, 1993, 1997; Green, 1995; Madigan and York, 1995; Geweke, 1996; Raftery et al., 1997; Brown et al., 1998b) rely on Gibbs sampling (=-=Gelfand and Smith, 1990-=-) or Metropolis-Hastings algorithms, but are increasingly ineffective due to slow convergence in higher dimensions. Outside of the regression model context, MCMC approaches have been used for model sp... |

330 |
Variable selection via Gibbs sampling
- George, Mcculloch
- 1993
(Show Context)
Citation Context ...l space with the increasingly complex patterns of collinearity that are typical with many variables. MCMC algorithms designed to explore the posterior distribution over regression model spaces (e.g., =-=George and McCulloch, 1993-=-, 1997; Green, 1995; Madigan and York, 1995; Geweke, 1996; Raftery et al., 1997; Brown et al., 1998b) rely on Gibbs sampling (Gelfand and Smith, 1990) or Metropolis-Hastings algorithms, but are increa... |

328 |
Exploration, normalization and summaries of high density oligonucleotide array probe level data
- Irizarry, Hobbs, et al.
- 2003
(Show Context)
Citation Context ...ive) and n1 = 48 high risk (high node positive cases). Gene expression data is available on Affymetrix HU95aV2 oligonucleotide microarrays, which were processed using the current standard RMA method (=-=Irizarry et al., 2003-=-a,b), to generate summary estimates of expression levels of each gene in each sample. This primary RMA data was then further screened and normalized, and we selected a total of 4,512 genes showing evi... |

256 | Bayesian Model Selection in Social Research
- Raftery
- 1995
(Show Context)
Citation Context ...ity, p(γ|y) ∝ p(y|γ)p(γ), is evaluated for each model generated in SSS. BIC can be viewed as an approximation to the marginal likelihood of a give model, p(y|γ), under a reference prior distribution (=-=Raftery, 1995-=-) and so could be used in similar fashion. Other scores such as R 2 and AIC can be used, but the user would have to decide how to use these scores to move from model to model across iterations, i.e. h... |

226 | Bayesian graphical models for discrete data
- Madigan, York
- 1995
(Show Context)
Citation Context ...of collinearity that are typical with many variables. MCMC algorithms designed to explore the posterior distribution over regression model spaces (e.g., George and McCulloch, 1993, 1997; Green, 1995; =-=Madigan and York, 1995-=-; Geweke, 1996; Raftery et al., 1997; Brown et al., 1998b) rely on Gibbs sampling (Gelfand and Smith, 1990) or Metropolis-Hastings algorithms, but are increasingly ineffective due to slow convergence ... |

218 | Summaries of Affymetrix GeneChip probe level data - Irizarry, Bolstad, et al. - 2003 |

192 |
2001, ‘Predicting the Clinical Status of Human Breast Cancer by Using Gene Expression Profiles
- West, Blanchette, et al.
(Show Context)
Citation Context ... derived gene expression profiles to aid in prognosis – in this case, improved prediction of low versus high risk based on genomic information could feed into decisions about postsurgical treatments (=-=West et al., 2001-=-; Huang et al., 2002; Nevins et al., 2003; Huang et al., 2003; Pittman et al., 2004). Prediction of lymph node status based on gene expression profiles is a challenging problem, due to the complex het... |

186 | Bayesian model averaging for linear regression models
- Raftery, Madigan, et al.
- 1997
(Show Context)
Citation Context ...many variables. MCMC algorithms designed to explore the posterior distribution over regression model spaces (e.g., George and McCulloch, 1993, 1997; Green, 1995; Madigan and York, 1995; Geweke, 1996; =-=Raftery et al., 1997-=-; Brown et al., 1998b) rely on Gibbs sampling (Gelfand and Smith, 1990) or Metropolis-Hastings algorithms, but are increasingly ineffective due to slow convergence in higher dimensions. Outside of the... |

136 | Nonparametric regression using Bayesian variable selection - Smith, Kohn - 1996 |

132 | Sparse graphical models for exploring gene expression data
- Dobra, Hans, et al.
- 2004
(Show Context)
Citation Context ...when δ > 2, so we typically set δ = 3. To find p(γ |y), we first compute the marginal likelihood p(y|γ ) = ∫ p(y|θ, γ )p(θ|γ ) dθ, which has a closedform solution under the foregoing formulation (see =-=Dobra et al. 2004-=-). Then, by Bayes’ theorem, the posterior probability of any model is p(γ |y) ∝ p(y|γ )p(γ ). 3.2 Binary Regression In the case of independent binary outcomes, yi, consider the logistic regression p(y... |

108 |
Regression by Leaps and Bounds
- Furnival, Wilson
- 1974
(Show Context)
Citation Context ...ises modeling and computational challenges as the number of candidate predictor variables increases. Standard methods including stepwise methods, leaps-and-bounds and Markov chain Monte Carlo (MCMC) (=-=Furnival and Wilson, 1974-=-; Clyde and George, 2004) can often quickly find “good” models when the number of predictors is relatively small. Stepwise methods are infeasible in higher dimensional problems, are prone to entrapmen... |

91 |
The analysis and selection of variables in linear regression
- Hocking
- 1976
(Show Context)
Citation Context ...), often can quickly identify “good” models when the number of predictors is relatively small. In higher-dimensional problems, stepwise methods are prone to entrapment in local maxima of model space (=-=Hocking 1976-=-), and often do not provide an adequate representation of the model space with the increasingly complex patterns of collinearity that are typical with many variables. MCMC algorithms designed to explo... |

81 |
et al: Gene expression predictors of breast cancer outcomes. Lancet
- Huang, SH, et al.
(Show Context)
Citation Context ...his case, improved prediction of low versus high risk based on genomic information could feed into decisions about postsurgical treatments (West et al., 2001; Huang et al., 2002; Nevins et al., 2003; =-=Huang et al., 2003-=-; Pittman et al., 2004). Prediction of lymph node status based on gene expression profiles is a challenging problem, due to the complex heterogeneity of the disease in terms of genetic/genomic and env... |

75 |
Bayesian Model Selection
- Raftery
- 1995
(Show Context)
Citation Context ...for each model generated in SSS. The Bayesian information criterion (BIC) can be viewed as an approximation to the marginal likelihood of a given model, p(y|γ ), under a reference prior distribution (=-=Raftery 1995-=-) and so could be used in similar fashion. Other scores, such as R 2 and the Akaike information criterion (AIC), can be used, but the user would have to decide how to normalize the scores into a proba... |

66 |
Computing Bayes factors by combining simulation and asymptotic approximations
- DiCiccio, Kass, et al.
- 1995
(Show Context)
Citation Context ... values of the regression coefficients. The marginal likelihood, p(y|γ), is not available in closed form but can be approximated via the Laplace approximation ˆp(y|γ) = (2π) p/2 | ˆ Σ| 1/2 h( ˆ β|γ) (=-=DiCiccio et al., 1997-=-), where h(β|γ) = p(y|β, γ)p(β|γ), and ( ˆΣ ∂ = − 2 log h( ˆ β|γ) ∂ ˆ θi∂ ˆ θj We find ˆ β = arg max β p(y|β, γ)p(β|γ), the maximum a posteriori estimate of β, via Newton’s method. ) −1 . 3.3 Prior ov... |

63 | Variable Selection and Model Comparison in Regression
- Geweke
- 1996
(Show Context)
Citation Context ... typical with many variables. MCMC algorithms designed to explore the posterior distribution over regression model spaces (e.g., George and McCulloch, 1993, 1997; Green, 1995; Madigan and York, 1995; =-=Geweke, 1996-=-; Raftery et al., 1997; Brown et al., 1998b) rely on Gibbs sampling (Gelfand and Smith, 1990) or Metropolis-Hastings algorithms, but are increasingly ineffective due to slow convergence in higher dime... |

50 | Experiments in stochastic computation for high-dimensional graphical models - Jones, Carvalho, et al. |

34 | Bayesian variable selection in clustering high-dimensional data - Tadesse, Sha, et al. - 2005 |

26 | D (2002) Parameter priors for directed acyclic graphical models and the characterization of several probability distributions. Annals of statistics 30 - Geiger, Heckerman |

25 |
Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes
- Pittman, Huang, et al.
- 2004
(Show Context)
Citation Context ...rediction of low versus high risk based on genomic information could feed into decisions about postsurgical treatments (West et al., 2001; Huang et al., 2002; Nevins et al., 2003; Huang et al., 2003; =-=Pittman et al., 2004-=-). Prediction of lymph node status based on gene expression profiles is a challenging problem, due to the complex heterogeneity of the disease in terms of genetic/genomic and environmental factors, an... |

18 |
Reversible-Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination
- Green
- 1995
(Show Context)
Citation Context ...plex patterns of collinearity that are typical with many variables. MCMC algorithms designed to explore the posterior distribution over regression model spaces (e.g., George and McCulloch 1993, 1997; =-=Green 1995-=-; Madigan and York 1995; Geweke 1996; Raftery, Madigan, and Hoeting 1997; Brown, Vannucci, and Fearn 1998b) rely on Gibbs sampling (Gelfand and Smith 1990) or on the Metropolis–Hastings algorithm but ... |

12 | Bayesian wavelength selection in multicomponent analysis
- Brown, Vannucci, et al.
- 1998
(Show Context)
Citation Context ...lgorithms designed to explore the posterior distribution over regression model spaces (e.g., George and McCulloch, 1993, 1997; Green, 1995; Madigan and York, 1995; Geweke, 1996; Raftery et al., 1997; =-=Brown et al., 1998-=-b) rely on Gibbs sampling (Gelfand and Smith, 1990) or Metropolis-Hastings algorithms, but are increasingly ineffective due to slow convergence in higher dimensions. Outside of the regression model co... |

10 |
M.: Gene expression profiling and genetic markers in glioblastoma survival. Cancer Research 65
- Rich, Jones, et al.
- 2005
(Show Context)
Citation Context ... both dimension and the subtlety of predictive relationships in the context of noise and complex patterns of collinearity. Two recent examples in cancer genomics studies, one using linear regression (=-=Rich et al. 2005-=-) and one using logistic regression (Dressman et al. 2006), have illustrated this in connection with both predictive utility and variable selection/identification in challenging contexts. We note that... |

8 |
Kl: Gene expression profiles of multiple breast cancer phenotypes and response to neoadjuvant chemotherapy. Clin Cancer Res 2006, 12(3 Pt
- HK, Hans, et al.
(Show Context)
Citation Context ...onships in the context of noise and complex patterns of collinearity. Two recent examples in cancer genomics studies, one using linear regression (Rich et al. 2005) and one using logistic regression (=-=Dressman et al. 2006-=-), have illustrated this in connection with both predictive utility and variable selection/identification in challenging contexts. We note that applications outside of regression are possible as well.... |

7 | Gene expression profiling for prediction of clinical characteristics of breast cancer
- Huang
- 2002
(Show Context)
Citation Context ...ssion profiles to aid in prognosis – in this case, improved prediction of low versus high risk based on genomic information could feed into decisions about postsurgical treatments (West et al., 2001; =-=Huang et al., 2002-=-; Nevins et al., 2003; Huang et al., 2003; Pittman et al., 2004). Prediction of lymph node status based on gene expression profiles is a challenging problem, due to the complex heterogeneity of the di... |

7 | Towards Integrated Clinico-Genomic Models for Personalized Medicine: Combining Gene Expression Signatures and Clinical Factors in Breast Cancer Outcomes Prediction,” Human Molecular Genetics
- Nevins, Huang, et al.
- 2003
(Show Context)
Citation Context ...d in prognosis – in this case, improved prediction of low versus high risk based on genomic information could feed into decisions about postsurgical treatments (West et al., 2001; Huang et al., 2002; =-=Nevins et al., 2003-=-; Huang et al., 2003; Pittman et al., 2004). Prediction of lymph node status based on gene expression profiles is a challenging problem, due to the complex heterogeneity of the disease in terms of gen... |