## Bayesian inference and optimal design in the sparse linear model

### Cached

### Download Links

Venue: | Workshop on Artificial Intelligence and Statistics |

Citations: | 59 - 12 self |

### BibTeX

@INPROCEEDINGS{Seeger_bayesianinference,

author = {Matthias W. Seeger and Martin Wainwright},

title = {Bayesian inference and optimal design in the sparse linear model},

booktitle = {Workshop on Artificial Intelligence and Statistics},

year = {},

pages = {2007}

}

### Years of Citing Articles

### OpenURL

### Abstract

The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007).

### Citations

4703 |
Matrix Analysis
- Horn, Johnson
- 1985
(Show Context)
Citation Context ...ings. We begin with the simpler non-degenerate representation: Σ −1 = X T X + Π = LL T , γ := L −1 (b (0) + b), where Π := diagπ here and elsewhere. L ∈ R n,n is the lower-triangular Cholesky factor (=-=Horn and Johnson, 1985-=-). Recall that b (0) = X T u. Note that h = L −T γ. The marginal Q(ai) = N(hi,σ 2 ρi) 771SEEGER is determined as hi = v T γ, ρi = ‖v‖ 2 , where v = L −1 δi. Here, δi is the Dirac unit vector with 1 a... |

3725 |
Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...is noted in Palmer et al. (2006). √ For the Laplace (2), we have that logti(ai) = −˜τ a2 i + log(˜τ/2), which is convex in a2i . A global tight lower bound is obtained using Legendre-Fenchel duality (=-=Boyd and Vandenberghe, 2002-=-), resulting in e −˜τ|ai| = sup N πi>0 U (ai|0,σ −2 πi)e −(τ2 /2)π −1 i . We can plug in the r.h.s. for ti(ai), then integrate out a in order to obtain a lower bound on logP(u). The outcome is quite s... |

1876 | Regression shrinkage and selection via the lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ... statistical notion of sparsity, with what some maximum a posteriori (MAP) treatments of the sparse linear model are aiming to do. In the latter approach, which is very prominent in machine learning (=-=Tibshirani, 1996-=-; Chen et al., 1999; Peeters and Westra, 2004), the mode â of the posterior P(a|X ,u) is found through convex optimisation (recall that the log posterior is concave), and â is treated as posterior est... |

1759 | Compressed sensing - Donoho - 2006 |

1685 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 2001
(Show Context)
Citation Context ...n of sparsity, with what some maximum a posteriori (MAP) treatments of the sparse linear model are aiming to do. In the latter approach, which is very prominent in machine learning (Tibshirani, 1996; =-=Chen et al., 1999-=-; Peeters and Westra, 2004), the mode â of the posterior P(a|X ,u) is found through convex optimisation (recall that the log posterior is concave), and â is treated as posterior estimate of a. â has t... |

1532 |
Iterative Methods for Sparse Linear Systems
- Saad
- 2003
(Show Context)
Citation Context ...N(hi,σ 2 ρi). Recall that Σ −1 = X T X + Π and h = Σ(b (0) + b). The quadratic criterion q(v) := δ T i v − (1/2)v T (X T X + Π)v can be minimised using the linear conjugate gradients (LCG) algorithm (=-=Saad, 1996-=-), requiring a MVM with X T X + Π per iteration, thus MVMs with X , X T , and O(n). At the minimum, we have v∗ = Σδi and q(v∗) = ρi/2, whence hi = v T ∗ (b (0) + b). We can also start from the degener... |

1497 |
Fundamentals of Speech Recognition
- Rabiner, Juang
- 1993
(Show Context)
Citation Context ...plementing the proposal of Lewicki and Olshausen (1999) is not publicly available. We can draw an analogy between the difference of learning X in OF and EP to current practices in speech recognition (=-=Rabiner and Juang, 2003-=-). Given a trained system, the recognition (or decoding) is done by searching for the most likely sequence, in what is called Viterbi decoding. However, training the system should be done by expectati... |

1327 | Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,”IEEE - Candès, Romberg, et al. - 2006 |

835 | An Introduction to Variational Methods for Graphical Models
- Jordan, Ghahramani, et al.
- 1998
(Show Context)
Citation Context ...trained choice for Q is the true posterior). 42 The variational characterisation is also known as mean field lower bound, because it is the defining feature of (structured) mean field approximations (=-=Jordan et al., 1997-=-). Once appropriate factorisation assumptions are placed on Q, the feasible set can be written analytically in terms of factors from these families, and the right hand side of (12) and its gradient 42... |

599 | Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research
- Olshausen, Field
- 1997
(Show Context)
Citation Context ...tivation for developing a good inference approximation here. 765SEEGER 2.3 Coding of Natural Images A second application of the sparse linear model is concerned with linear coding of natural images (=-=Olshausen and Field, 1997-=-; Lewicki and Olshausen, 1999), with the aim of understanding properties of visual neurons in the brain. Before we describe the setup, it is important to point out what our motivation is here, since i... |

568 | Probabilistic inference using Markov Chain Monte Carlo methods
- Neal
- 1993
(Show Context)
Citation Context ...o While variational approximations are fairly established in machine learning, the dominant methods for approximating Bayesian inference in statistics are Markov chain Monte Carlo (MCMC) simulations (=-=Neal, 1993-=-; Gilks et al., 1996). In these techniques, a Markov chain over latent variables of interest (and possibly additional auxiliary ones) is simulated, whose stationary distribution is the desired posteri... |

559 | Sparse Bayesian learning and the Relevance Vector Machine - Tipping |

521 | Bayesian interpolation
- MacKay
- 1992
(Show Context)
Citation Context ...) on ai, where πi is a scale parameter, then maximising the marginal likelihood P(u,π) w.r.t. π. Here, πi can be given a heavy-tailed hyperprior. The Occam’s razor effect embedded in empirical Bayes (=-=MacKay, 1992-=-) leads to πi becoming large for irrelevant components ai: a model with few relevant components is simpler than one with many, and if both describe the data well, the former is preferred under ARD. AR... |

443 | Graphical Models, Exponential Families, and Variational Inference - Wainwright, Jordan - 2008 |

369 | Optimally sparse representation in general (nonorthogonal) dictionaries vis 1 minimization - Donoho, Elad - 2003 |

360 | Theory of Optimal Experiments - Fedorov - 1972 |

328 | Marginal likelihood from the Gibbs output - Chib - 1995 |

327 | Information-based objective functions for active data selection
- MacKay
- 1992
(Show Context)
Citation Context ... can be computed for each candidate u∗ without doing the corresponding experiment. Optimal design is well developed in classical and Bayesian statistics (Fedorov, 1972; Chaloner and Verdinelli, 1995; =-=MacKay, 1991-=-), and access to this methodology is a key motivation for developing a good inference approximation here. 765SEEGER 2.3 Coding of Natural Images A second application of the sparse linear model is con... |

320 | Query by committee
- Seung, Opper, et al.
- 1992
(Show Context)
Citation Context ...motivated in Section 2. The topic is well-researched in classical and Bayesian statistics (Fedorov, 1972; Chaloner and Verdinelli, 1995). A variant is known in machine learning as active learning 19 (=-=Seung et al., 1992-=-). We follow MacKay (1991) here, whose setting is closest to ours. In the sparse linear model, a typical design problem can be formulated as follows. Given a set of candidate points x∗, at which of th... |

314 | Expectation propagation for approximate Bayesian inference. In Uncertainty in artificial intelligence: proceedings of the seventeenth conference - Minka - 2001 |

282 |
Derivative-free adaptive rejection sampling for Gibbs sampling
- Gilks
- 1992
(Show Context)
Citation Context ...y to use by non-experts, we concentrate on log-concave Laplace sparsity priors in the sequel. The importance of log-concavity has been recognised in statistics and Markov chain sampling (Pratt, 1981; =-=Gilks and Wild, 1992-=-; Park and Casella, 2005; Lovász and Vempala, 2003; Paninski, 2005), but has not received much attention so far in work on variational approximate inference. Our decision to prefer the Laplace sparsit... |

265 |
A family of algorithms for approximate Bayesian inference
- Minka
- 2001
(Show Context)
Citation Context ...ity priors which are not Gaussian, Bayesian inference in general is not analytically tractable anymore and has to be approximated. In this paper, we employ the expectation propagation (EP) algorithm (=-=Minka, 2001-=-b; Opper and Winther, 2000) for approximate Bayesian inference in the sparse linear model. Our motivation runs contrary to most machine learning applications of the sparse linear model considered so f... |

215 | Construction of a genetic toggle switch in Escherichia coli - Gardner, Cantor, et al. |

191 | A Variational Bayesian Framework for Graphical Models - Attias - 2000 |

177 | Bayesian experimental design: A review
- Chaloner, Verdinelli
- 1995
(Show Context)
Citation Context ...of information” is sought which can be computed for each candidate u∗ without doing the corresponding experiment. Optimal design is well developed in classical and Bayesian statistics (Fedorov, 1972; =-=Chaloner and Verdinelli, 1995-=-; MacKay, 1991), and access to this methodology is a key motivation for developing a good inference approximation here. 765SEEGER 2.3 Coding of Natural Images A second application of the sparse linea... |

148 | The entire regularization path for the support vector machine - Hastie, Rosset, et al. |

126 | Fast Sparse Gaussian Process Methods: The Informative Vector Machine
- Lawrence, Seeger, et al.
- 2002
(Show Context)
Citation Context ...re remarkable, given that no such problems occur in several other EP applications, for example Gaussian process classification (GPC) with probit or logit noise (Minka, 2001b; Opper and Winther, 2000; =-=Lawrence et al., 2003-=-), where less careful implementations still work fine, and even approximate Gaussian quadrature can be used. Several early attempts of ours led to complete failure of the algorithm on realistic data (... |

111 | Probabilistic framework for the adaptation and comparison of image codes
- Lewicki, Olshausen
- 1999
(Show Context)
Citation Context ...ook by empirical Bayesian marginal likelihood maximisation. Since current hypotheses about the development of early visual neurons in the brain are equivalent to a Bayesian sparse linear model setup (=-=Lewicki and Olshausen, 1999-=-), our method is useful to test and further refine these. There has been a lot of recent interest in signal processing in the problem of compressive sensing (Candès et al., 2006; Donoho, 2006; Ji and ... |

110 | Propagation algorithms for Variational Bayesian learning
- Ghahramani, Beal
- 2001
(Show Context)
Citation Context ...ich conclusions can be drawn, is subject to future work. A comparison between approximate inference techniques would be incomplete without including variational mean field Bayes (VMFB) (Attias, 2000; =-=Ghahramani and Beal, 2001-=-), maybe the most well known variational technique in the moment. It is also simply known as “variational Bayes” (see www.variational-bayes.org), although we understand this term as encompassing other... |

97 | The Bayesian lasso
- Park, Casella
- 2008
(Show Context)
Citation Context ...ortant), mainly because Bayesian experimental design is fundamentally driven by such uncertainty representations. While Bayesian inference can also be performed using Markov chain Monte Carlo (MCMC) (=-=Park and Casella, 2005-=-), our approach is much more efficient, especially in the context of sequential design, and can be applied to large-scale problems of interest in machine learning. Moreover, experimental design requir... |

84 | Reverse Engineering Gene Networks: Integrating Genetic Perturbations with Dynamical Modeling - Tegner, Yeung, et al. - 2003 |

81 | Adaptive sparseness for supervised learning - Figueiredo - 2003 |

72 | A variational method for learning sparse and overcomplete representations - Girolami |

71 | O.: Gaussian Processes for Classification: Mean-Field Algorithms - Opper, Winther |

58 |
Sharp thresholds for noisy and high–dimensional recovery of sparsity using ℓ1–constrained quadratic programming (lasso
- Wainwright
(Show Context)
Citation Context ...formulations of sparse estimation have been established, showing that in certain regimes they perfectly reconstruct very sparse signals in a minimax sense (Donoho and Elad, 2003; Candès et al., 2006; =-=Wainwright, 2006-=-). On the other hand, MAP as an approximation to Bayesian inference is fairly poor in this case. As noted in Section 3, a direct Laplace approximation is not well-defined for the sparse linear model. ... |

57 | Variational Methods for Inference and Estimation in Graphical Models
- Jaakkola
- 1997
(Show Context)
Citation Context ...ound on the log marginal likelihood logP(u) works by lower-bounding the sites ti(ai) by terms of Gaussian form. A powerful way of obtaining global lower bounds of simple form is exploiting convexity (=-=Jaakkola, 1997-=-). We can apply this approach to the sparse linear model with Laplace prior, which results in a method proposed by Girolami (2001). The general idea in the context of non-Gaussian linear models is not... |

55 | Untangling the wires: A strategy to trace functional interactions in signaling and gene networks - Kholodenko, Kiyatkin, et al. - 2002 |

53 | Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse approximations. Doctoral dissertation - Seeger - 2003 |

42 |
S.R.: On deriving the inverse of a sum of matrices
- Henderson, Searle
- 1981
(Show Context)
Citation Context ...rate representation), we require that πi ≥ κ at all times, where κ > 0 is a small constant (we use κ = 10−8 presently). This constraint is enforced in all EP updates. We can use the Woodbury formula (=-=Henderson and Searle, 1981-=-) in order to write Σ = ( X T X + Π ) −1 = Π −1 − Π −1 X T ( I + X Π −1 X T ) −1 X Π −1 . We represent this via the lower-triangular Cholesky factor L in LL T = I + X Π −1 X T . Furthermore, let γ := ... |

42 | Assessing Approximate Inference for Binary Gaussian Process Classification - Kuss, Rasmussen |

41 | Predictive automatic relevance determination by expectation propagation - Qi, Minka, et al. |

41 | Analysis of sparse bayesian learning - Faul, Tipping |

37 | Variational EM algorithms for non-Gaussian latent variable models
- Palmer, Wipf, et al.
- 2006
(Show Context)
Citation Context ... been applied to the sparse linear model by Tipping (2001), where the method was called sparse Bayesian learning (SBL). The derivation there makes use of scale mixture decompositions (Gneiting, 1997; =-=Palmer et al., 2006-=-) for the non-Gaussian prior sites. Namely, many all others could be set to zero. ARD works by placing a prior N(ai|0,σ2π −1 i univariate symmetric distributions can be represented in the form P(ai) =... |

36 | Mean field approaches to independent component analysis - Hojen-Sorensen, Winther, et al. - 2002 |

33 | Bayesian Learning for Neural Networks. Number 118 - Neal - 1996 |

31 | Compressed sensing and bayesian experimental design
- Seeger, Nickisch
- 2008
(Show Context)
Citation Context ... interest in our work here: identification of gene networks, and sparse coding of natural images, and we give remarks about applications to compressive sensing, which are subject to work in progress (=-=Seeger and Nickisch, 2008-=-). The importance of optimal design and hyperparameter estimation are motivated using these examples. 2.1 The Role of Sparsity Priors In order to obtain flexible inference methods, it often makes sens... |

31 | The Inverse Gaussian Distribution: Theory, Methodology, and Applications - Chhikara, Folks - 1989 |

29 | Matrix analysis. Cambridge university press - Horn, Johnson - 1990 |

27 | Bayesian Inference, volume 2B of Kendall’s Advanced Theory of Statistics - O’Hagan - 1994 |

27 | A Bayesian regression approach to the inference of regulatory networks from gene expression data - Rogers, Girolami - 2005 |