## Variational learning and bits-back coding: an information-theoretic view to Bayesian learning

### Cached

### Download Links

Venue: | IEEE Transactions on Neural Networks |

Citations: | 17 - 7 self |

### BibTeX

@ARTICLE{Honkela_variationallearning,

author = {Antti Honkela and Harri Valpola},

title = {Variational learning and bits-back coding: an information-theoretic view to Bayesian learning},

journal = {IEEE Transactions on Neural Networks},

year = {},

pages = {2004}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract—The bits-back coding first introduced by Wallace in 1990 and later by Hinton and van Camp in 1993 provides an interesting link between Bayesian learning and information-theoretic minimum-description-length (MDL) learning approaches. The bits-back coding allows interpreting the cost function used in the variational Bayesian method called ensemble learning as a code length in addition to the Bayesian view of misfit of the posterior approximation and a lower bound of model evidence. Combining these two viewpoints provides interesting insights to the learning process and the functions of different parts of the model. In this paper, the problem of variational Bayesian learning of hierarchical latent variable models is used to demonstrate the benefits of the two views. The code-length interpretation provides new views to many parts of the problem such as model comparison and pruning and helps explain many phenomena occurring in learning. Index Terms—Bits-back coding, ensemble learning, hierarchical latent variable models, minimum description length, variational Bayesian learning. I.

### Citations

9231 |
Elements of Information Theory
- Cover, Thomas
- 1990
(Show Context)
Citation Context ...uld be the optimal ? The code length in (6) can be written as (4) (5) (6) (7) where is the Kullback–Leibler divergence between the coding distribution and the posterior distribution of the parameters =-=[20]-=-. In the other term, is the model evidence which is independent of the values of and also independent of . Thus, the code length can be minimized by minimizing the Kullback–Leibler divergence . This c... |

7146 |
A mathematical theory of communication
- Shannon
- 1948
(Show Context)
Citation Context ...thing. The MDL principle leaves open many technical questions on what is a valid code but as it is not required to construct an actual code but rather to evaluate its length, Shannon’s coding theorem =-=[13]-=- can be used to obtain a lower bound for the code length. Shannon’s theorem states that data following a discrete distribution cannot, on average, be coded using less than (1) (2) bits/sample. Here de... |

1697 | Independent Component Analysis
- Hyvärinen, Karhunen, et al.
- 2001
(Show Context)
Citation Context ...tic point of view. Spikes are components whose energy is concentrated to a single observation with values at all other time instants being very close to zero. Spike-like signals maximize the kurtosis =-=[50]-=- of the signal as well as many contrast functions. If the dimensionality of the data is high compared to the number of samples, spikes can be found relatively easily and thus, algorithms attempting to... |

1246 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...there have been many specific algorithms based on the idea of finding the most compact representation for the data, some of which are more closely and some more distantly related to Bayesian learning =-=[6]-=-–[9]. An introduction to different approaches to minimum-encoding inference can be found in [10]–[12]. The fundamental idea behind minimum-encoding learning was distilled by Rissanen as the MDL princi... |

1171 |
Bayesian Theory
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ...aluate the posterior probabilities of different models. Different approximation techniques retain these properties to different extents. For a thorough introduction to Bayesian statistics, see, e.g., =-=[2]-=-, [3]. Another interesting view to learning is provided by information theory and involves finding a model that can be used to encode the data in a compact manner. The idea of using compact coding for... |

869 | An Introduction to Variational Methods for Graphical Models
- Jordan, Ghahramani, et al.
- 1999
(Show Context)
Citation Context ...is can be done for instance by decoupling variables from the model with so called variational transformations until the learning problem of the remaining structure can be solved in a tractable manner =-=[24]-=-. This sequential approach is very flexible but it is difficult to develop a general theory for it. An alternative approach is to fix a single structure for the approximation and then find the optimal... |

805 | A view of the EM algorithm that justifies incremental, sparse, and other variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...thods can be seen as specific examples of using the above methodology. The expectation-maximization (EM) algorithm, for instance, can be viewed as a specific method to minimize a cost function of (8) =-=[21]-=-. In this case, the set of unknown variables consists of some unobserved data and the standard model parameters . The approximation is chosen to be of the form with being restricted to a delta distrib... |

655 | Graphical models
- Jordan
- 2004
(Show Context)
Citation Context ...y a node is directly dependent only on the variables represented by the immediate parents of , i.e., the distribution of the values of is perfectly determined once the values of its parents are known =-=[44]-=-, [45]. Let us assume that we are using a fully factorial posterior approximation (14) The case of a general (not necessarily fully) factorial approximation is similar with joint densities of groups o... |

582 | Bayesian interpolation
- MacKay
- 1991
(Show Context)
Citation Context ...ined models. Thus, it is reasonable to approximate the marginalization by using only the single best model. Whichever of the alternatives is chosen, the key quantity to evaluate is the model evidence =-=[27]-=-, [28]. The ensemble learning procedure of minimizing the Kullback–Leibler divergence between an approximate posterior and the exact can also be applied to approximate the exact inference. Performing ... |

541 |
Stochastic complexity
- Rissanen
- 1987
(Show Context)
Citation Context ...e have been many specific algorithms based on the idea of finding the most compact representation for the data, some of which are more closely and some more distantly related to Bayesian learning [6]–=-=[9]-=-. An introduction to different approaches to minimum-encoding inference can be found in [10]–[12]. The fundamental idea behind minimum-encoding learning was distilled by Rissanen as the MDL principle ... |

427 |
A formal theory of inductive inference
- Solomonoff
- 1964
(Show Context)
Citation Context ...eory and involves finding a model that can be used to encode the data in a compact manner. The idea of using compact coding for inductive inference was first proposed by Solomonoff in the early 1960s =-=[4]-=-. His approach was based on universal turing machines which limits its usefulness in practice. In 1968, Wallace and Boulton proposed the first learning algorithm based on minimum encoding and more cla... |

323 |
An Information Measure for Classification
- Wallace, Boulton
- 1968
(Show Context)
Citation Context ...n universal turing machines which limits its usefulness in practice. In 1968, Wallace and Boulton proposed the first learning algorithm based on minimum encoding and more classical statistical models =-=[5]-=-. Their approach, later known as MML inference, was interpreted as a tractable approximation to exact Bayesian inference. Since then there have been many specific algorithms based on the idea of findi... |

276 |
Graphical models for machine learning and digital communication
- Frey
- 1998
(Show Context)
Citation Context ...IONAL METHODS In 1990, Wallace presented an interesting new coding scheme to be used together with the MDL/MML learning principle [14]. The same idea has been later developed by numerous authors [15]–=-=[19]-=-. The name of bits-back coding is due to Hinton and van Camp [15], [16]. Wallace’s scheme is based on the idea of using a code with redundant codewords and selecting the codeword according to some aux... |

250 | Statistical field theory - Parisi, Shankar - 1988 |

238 | Independent factor analysis
- Attias
- 1999
(Show Context)
Citation Context ...g problems ranging from learning multilayer neural networks [29] to learning hidden Markov models [30]. It has recently become very popular in the field of linear independent component analysis (ICA) =-=[31]-=-–[35]. The approach also provides suitable regularization for severely ill-posed nonlinear problems and has been successfully applied to nonlinear ICA [36]–[38] as well as nonlinear and switching stat... |

198 | Estimation and inference by compact coding - Wallace, Freeman - 1987 |

178 |
Probability, Frequency and Reasonable Expectation
- Cox
- 1946
(Show Context)
Citation Context ...o the same underlying learning methodology. Bayesian statistics can be derived from Cox’s axioms stating that the subject should perform rationally with respect to the information he has of the world =-=[1]-=-. The beliefs of the subject Manuscript received March 15, 2003; revised October 20, 2003. This work was supported by the European Commission project BLISS, and the Finnish Center of Excellence Progra... |

148 | Variational learning for switching state-space models
- Ghahramani, Hinton
- 1998
(Show Context)
Citation Context ...lso provides suitable regularization for severely ill-posed nonlinear problems and has been successfully applied to nonlinear ICA [36]–[38] as well as nonlinear and switching state-space models [39], =-=[40]-=-, to name a few examples. It also allows modeling of variance simultaneously with the mean in a way that would be impossible for methods based on conventional point estimates [41], [42]. In order to d... |

132 | Keeping neural networks simple by minimizing the description length of the weights
- Hinton, Camp
- 1993
(Show Context)
Citation Context ...ARIATIONAL METHODS In 1990, Wallace presented an interesting new coding scheme to be used together with the MDL/MML learning principle [14]. The same idea has been later developed by numerous authors =-=[15]-=-–[19]. The name of bits-back coding is due to Hinton and van Camp [15], [16]. Wallace’s scheme is based on the idea of using a code with redundant codewords and selecting the codeword according to som... |

115 | Autoencoders, minimum description length and helmholtz free energy
- Hinton, Zemel
- 1993
(Show Context)
Citation Context ...me to be used together with the MDL/MML learning principle [14]. The same idea has been later developed by numerous authors [15]–[19]. The name of bits-back coding is due to Hinton and van Camp [15], =-=[16]-=-. Wallace’s scheme is based on the idea of using a code with redundant codewords and selecting the codeword according to some auxiliary information. The receiver can then later recover the auxiliary i... |

89 | An unsupervised ensemble learning method for nonlinear dynamic state-space models - Valpola, Karhunen |

84 | Ensemble learning for hidden Markov models
- MacKay
- 1997
(Show Context)
Citation Context ...MODELS The ensemble learning approach presented above has been used for a variety of different modeling problems ranging from learning multilayer neural networks [29] to learning hidden Markov models =-=[30]-=-. It has recently become very popular in the field of linear independent component analysis (ICA) [31]–[35]. The approach also provides suitable regularization for severely ill-posed nonlinear problem... |

67 | Independent Component Approach to the Analysis of EEG
- Vigário, Säre, et al.
- 2000
(Show Context)
Citation Context ...thods that can be used to suppress them [48], [49]. To illustrate the formation of bumps with real data, we performed some experiments using biomedical magnetoencephalogram (MEG) measurements used in =-=[51]-=-. The MEG data consists of signals originating from brain activity measured with an array of magnetic sensors. The data has 122 channels corresponding to magnetic fields measured in two directions in ... |

63 | Ensemble learning
- Lappalainen, Miskin
- 2000
(Show Context)
Citation Context ... by considering the minimization of the Kullback–Leibler divergence between the approximation and the true posterior. The resulting Bayesian-learning algorithm is often called ensemble learning [17], =-=[26]-=-. The posterior approximation in ensemble learning is usually chosen to be a product of independent distributions for some easily separable sets of parameters. In an EM-like situation with some unobse... |

59 | Bayesian nonlinear independent component analysis by multi-layer perceptrons
- Lappalainen, Honkela
- 2000
(Show Context)
Citation Context ... linear independent component analysis (ICA) [31]–[35]. The approach also provides suitable regularization for severely ill-posed nonlinear problems and has been successfully applied to nonlinear ICA =-=[36]-=-–[38] as well as nonlinear and switching state-space models [39], [40], to name a few examples. It also allows modeling of variance simultaneously with the mean in a way that would be impossible for m... |

50 | Developments in probabilistic modelling with neural networks|ensemble learning
- MacKay
- 1995
(Show Context)
Citation Context ...arning by considering the minimization of the Kullback–Leibler divergence between the approximation and the true posterior. The resulting Bayesian-learning algorithm is often called ensemble learning =-=[17]-=-, [26]. The posterior approximation in ensemble learning is usually chosen to be a product of independent distributions for some easily separable sets of parameters. In an EM-like situation with some ... |

47 |
eds), Advanced Mean Field Methods —Theory and Practice
- Opper, Saad
- 2001
(Show Context)
Citation Context ...tion to approximate the true posterior . These kinds of approaches have been used for some time under the name of variational methods or especially in statistical mechanics as mean field methods [22]–=-=[25]-=-. Variational methods are used to decrease the number of posterior dependencies in too complex models. This can be done for instance by decoupling variables from the model with so called variational t... |

46 | Ensemble learning for independent component analysis - Lappalainen - 1999 |

37 | Ensemble learning for multi-layer networks
- Barber, Bishop
- 1998
(Show Context)
Citation Context ... IV. BUILDING BLOCKS FOR HIERARCHICAL MODELS The ensemble learning approach presented above has been used for a variety of different modeling problems ranging from learning multilayer neural networks =-=[29]-=- to learning hidden Markov models [30]. It has recently become very popular in the field of linear independent component analysis (ICA) [31]–[35]. The approach also provides suitable regularization fo... |

33 | Hierarchical models of variance sources
- Valpola, Harva, et al.
- 2004
(Show Context)
Citation Context ...e-space models [39], [40], to name a few examples. It also allows modeling of variance simultaneously with the mean in a way that would be impossible for methods based on conventional point estimates =-=[41]-=-, [42]. In order to demonstrate the above principles in practice, we use a collection of “building blocks” called Bayes Blocks presented in [43]. These blocks allow easy definition and learning of man... |

29 | Building blocks for hierarchical latent variable models
- Valpola, Raiko, et al.
- 2001
(Show Context)
Citation Context ...sible for methods based on conventional point estimates [41], [42]. In order to demonstrate the above principles in practice, we use a collection of “building blocks” called Bayes Blocks presented in =-=[43]-=-. These blocks allow easy definition and learning of many linear and nonlinear hierarchical latent variable model structures. Most probabilistic models can be represented as graphical models. A graphi... |

25 | Nonlinear independent factor analysis by hierarchical models
- Valpola, Östman, et al.
- 2003
(Show Context)
Citation Context ...ar independent component analysis (ICA) [31]–[35]. The approach also provides suitable regularization for severely ill-posed nonlinear problems and has been successfully applied to nonlinear ICA [36]–=-=[38]-=- as well as nonlinear and switching state-space models [39], [40], to name a few examples. It also allows modeling of variance simultaneously with the mean in a way that would be impossible for method... |

23 |
Classification by Minimum-Message-Length Inference, S.G. Akl et al (eds
- Wallace
- 1990
(Show Context)
Citation Context ... the MAP estimate for the parameters. III. BITS-BACK CODING AND VARIATIONAL METHODS In 1990, Wallace presented an interesting new coding scheme to be used together with the MDL/MML learning principle =-=[14]-=-. The same idea has been later developed by numerous authors [15]–[19]. The name of bits-back coding is due to Hinton and van Camp [15], [16]. Wallace’s scheme is based on the idea of using a code wit... |

21 |
Variational methods in statistics
- Rustagi
- 1976
(Show Context)
Citation Context ...oximation to approximate the true posterior . These kinds of approaches have been used for some time under the name of variational methods or especially in statistical mechanics as mean field methods =-=[22]-=-–[25]. Variational methods are used to decrease the number of posterior dependencies in too complex models. This can be done for instance by decoupling variables from the model with so called variatio... |

20 |
Bayesian data analysis. Boca
- Gelman, JB, et al.
- 2003
(Show Context)
Citation Context ...e the posterior probabilities of different models. Different approximation techniques retain these properties to different extents. For a thorough introduction to Bayesian statistics, see, e.g., [2], =-=[3]-=-. Another interesting view to learning is provided by information theory and involves finding a model that can be used to encode the data in a compact manner. The idea of using compact coding for indu... |

16 | Nonlinear blind source separation by variational Bayesian learning - Valpola, Oja, et al. |

16 | Accelerating cyclic update algorithms for parameter estimation by pattern searches
- Honkela, Valpola, et al.
(Show Context)
Citation Context ...sented in (15). This is usually not the most efficient way to perform the minimization, however, and the convergence of the learning process can be accelerated with a simple procedure as presented in =-=[46]-=-. This speedup was used in all the experiments. E. Model Selection and Pruning The above procedure works well for a single fixed model structure. However, as the flexibility of the building-block fram... |

12 | model order selection and dynamic source models
- Penny, Everson, et al.
(Show Context)
Citation Context ...blems ranging from learning multilayer neural networks [29] to learning hidden Markov models [30]. It has recently become very popular in the field of linear independent component analysis (ICA) [31]–=-=[35]-=-. The approach also provides suitable regularization for severely ill-posed nonlinear problems and has been successfully applied to nonlinear ICA [36]–[38] as well as nonlinear and switching state-spa... |

11 |
Advanced Inference in Bayesian Networks. Learning in Graphical Models (p
- Cowell
- 1998
(Show Context)
Citation Context ...de is directly dependent only on the variables represented by the immediate parents of , i.e., the distribution of the values of is perfectly determined once the values of its parents are known [44], =-=[45]-=-. Let us assume that we are using a fully factorial posterior approximation (14) The case of a general (not necessarily fully) factorial approximation is similar with joint densities of groups of vari... |

6 | Efficient stochastic source coding and an application to a Bayesian network source model - Frey, Hinton - 1997 |

6 | graphical models and variational methods. In Independent component analysis: principles and practice - Attias |

5 |
Theory, Inference, and Learning Algorithms
- Information
- 2003
(Show Context)
Citation Context ...odels. Thus, it is reasonable to approximate the marginalization by using only the single best model. Whichever of the alternatives is chosen, the key quantity to evaluate is the model evidence [27], =-=[28]-=-. The ensemble learning procedure of minimizing the Kullback–Leibler divergence between an approximate posterior and the exact can also be applied to approximate the exact inference. Performing the mi... |

5 | Ensemble learning for blind source separation,” in Independent Component Analysis: Principles and Practice - Miskin, MacKay - 2001 |

5 |
Spikes and bumps: artifacts generated by independent component analysis with insufficient sample size
- Hyvärinen, Särelä, et al.
- 1999
(Show Context)
Citation Context ...experiments shown in the figure differ only in their use of the evidence nodes. B. Spikes in ICA Using standard ICA with unsuitable data set can cause two related types of artifacts: spikes and bumps =-=[47]-=-–[49]. We shall next discuss how the conditions in which they occur differ and how this behavior can be easily understood from an informationtheoretic point of view. Spikes are components whose energy... |

5 |
Overlearning problem in high-order ICA: analysis and solutions
- Särelä, Vigário
- 2003
(Show Context)
Citation Context ...ble to attain a shorter description for the original data. Bumps can be best avoided by using a temporal model suitable for the data but there are also other methods that can be used to suppress them =-=[48]-=-, [49]. To illustrate the formation of bumps with real data, we performed some experiments using biomedical magnetoencephalogram (MEG) measurements used in [51]. The MEG data consists of signals origi... |

2 |
Introduction to Minimum Encoding Inference,” Dept
- Oliver, Hand
- 1994
(Show Context)
Citation Context ...tation for the data, some of which are more closely and some more distantly related to Bayesian learning [6]–[9]. An introduction to different approaches to minimum-encoding inference can be found in =-=[10]-=-–[12]. The fundamental idea behind minimum-encoding learning was distilled by Rissanen as the MDL principle [6]: choose the model that gives the shortest description of data. The description here mean... |

2 |
models of variance sources
- “Hierarchical
- 2004
(Show Context)
Citation Context ...e models [39], [40], to name a few examples. It also allows modeling of variance simultaneously with the mean in a way that would be impossible for methods based on conventional point estimates [41], =-=[42]-=-. In order to demonstrate the above principles in practice, we use a collection of “building blocks” called Bayes Blocks presented in [43]. These blocks allow easy definition and learning of many line... |

2 |
Bayesian Approach to Overlearning in ICA: A Comparison Study
- “A
- 2003
(Show Context)
Citation Context ...iments shown in the figure differ only in their use of the evidence nodes. B. Spikes in ICA Using standard ICA with unsuitable data set can cause two related types of artifacts: spikes and bumps [47]–=-=[49]-=-. We shall next discuss how the conditions in which they occur differ and how this behavior can be easily understood from an informationtheoretic point of view. Spikes are components whose energy is c... |