## A Unifying Review of Linear Gaussian Models (1999)

### Cached

### Download Links

- [www.gatsby.ucl.ac.uk]
- [www.cs.nyu.edu]
- [www.cs.toronto.edu]
- [www.iro.umontreal.ca]
- [psych.stanford.edu]
- [authors.library.caltech.edu]
- [www.gatsby.ucl.ac.uk]
- [ftp.cs.toronto.edu]
- [www.gatsby.ucl.ac.uk]
- [www.stat.columbia.edu]
- [www.stat.columbia.edu]
- [www.cs.nyu.edu]
- [www.stat.columbia.edu]
- [www.cs.toronto.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 263 - 17 self |

### BibTeX

@MISC{Roweis99aunifying,

author = {Sam Roweis and Zoubin Ghahramani},

title = {A Unifying Review of Linear Gaussian Models},

year = {1999}

}

### Years of Citing Articles

### OpenURL

### Abstract

Factor analysis, principal component analysis, mixtures of gaussian clusters, vector quantization, Kalman filter models, and hidden Markov models can all be unified as variations of unsupervised learning under a single basic generative model. This is achieved by collecting together disparate observations and derivations made by many previous authors and introducing a new way of linking discrete and continuous state models using a simple nonlinearity. Through the use of other nonlinearities, we show how independent component analysis is also a variation of the same basic generative model. We show that factor analysis and mixtures of gaussians can be implemented in autoencoder neural networks and learned using squared error plus the same regularization term. We introduce a new model for static data, known as sensible principal component analysis, as well as a novel concept of spatially adaptive observation noise. We also review some of the literature involving global and local mixtures of the basic models and provide pseudocode for inference and learning for all the basic models.

### Citations

8084 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...(1995) which also covers learning in this context. The basis of all the learning algorithms presented by these authors is the powerful EM or expectation-maximization algorithm (Baum and Petrie, 1966; =-=Dempster et al., 1977). T-=-he objective of the EM algorithm is to maximize the likelihood of the observed data (6) in the presence of hidden variables. Let us denote the observed data by Y = fy 1 : : : y �� g, the hidden va... |

7048 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ... vector x 0 = ,1=2 E T x with A 0 =( ,1=2 E T )A(E 1=2 ) and C 0 = C(E 1=2 )such that the new covariance of x 0 is the identity matrix: Q 0 = I. 2 C A z ,1s(or belief) networks, and other formalisms (=-=Pearl, 1988-=-; Lauritzen and Spiegelhalter, 1988; Whittaker, 1990; Smyth et al., 1997). A graphical model is a representation of the dependency structure between variables in a multivariate probability distributio... |

4823 | Neural Networks for Pattern Recognition - Bishop - 1995 |

3919 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...sists of finding the cluster means (columns of C), the covariance R, and the mixing coefficients �� j . This is easily done with EM and corresponds exactly to maximum likelihood competitive learni=-=ng (Duda and Hart, 1973-=-; Nowlan, 1991) except that all the clusters share the same covariance. Later we introduce extensions to the model which remove this restriction. As in the continuous state case, we can consider the l... |

3719 |
Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images
- Geman, Geman
- 1984
(Show Context)
Citation Context ...we can compute the posterior distribution of w given y , which istheproduct of a Gaussian prior and a non-Gaussian likelihood. Again, this may not be easy and we may wish to resort to Gibbs sampling (=-=Geman and Geman, 1984-=-) or other Markov chain Monte Carlo methods (Neal, 1993). Another option is to employ a deterministic trick recently used by Bishop et al (1997) in the context of the Generative Topographic Map (GTM),... |

2108 |
A new approach to linear filtering and prediction problems
- Kalman
- 1960
(Show Context)
Citation Context ... the desired distributions (8) or (9). Filtering and smoothing have been extensively studied for continuous state models in the signal processing community, starting with the seminal works of Kalman (=-=Kalman, 1960-=-; Kalman and Bucy, 1961) and Rauch (Rauch, 1963; Rauch et al., 1965) although this literature is often not well known in the machine learning community. For the discrete state models much of the liter... |

1356 |
Independent component analysis, a new concept? Signal Processing 36
- Comon
- 1994
(Show Context)
Citation Context ...re most easily separated. We will focus on a modified, but by now classic, version due to Bell and Sejnowski (1995) and Baram and Roth (1994) of the original independent component analysis algorithm (=-=Comon, 1994-=-). Although Bell and Sejnowski derived it from an information--maximization perspective, this modified algorithm can also be obtained by defining a particular prior distribution over the components of... |

1284 |
Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems (with Discussion
- Lauritzen, Spiegelhalter
- 1988
(Show Context)
Citation Context ... ,1=2 E T x with A 0 =( ,1=2 E T )A(E 1=2 ) and C 0 = C(E 1=2 )such that the new covariance of x 0 is the identity matrix: Q 0 = I. 2 C A z ,1s(or belief) networks, and other formalisms (Pearl, 1988; =-=Lauritzen and Spiegelhalter, 1988-=-; Whittaker, 1990; Smyth et al., 1997). A graphical model is a representation of the dependency structure between variables in a multivariate probability distribution. Each node corresponds to a rando... |

1163 |
Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm
- Viterbi
- 1967
(Show Context)
Citation Context ...ete state models much of the literature stems from the work of Baum and colleagues (Baum and Petrie, 1966; Baum and Eagon, 1967; Baum et al., 1970; Baum, 1972) on hidden Markov models and of Viterbi (=-=Viterbi, 1967-=-) and others on optimal decoding. The recent book by Elliott and colleagues (1995) contains a thorough mathematical treatment of ltering and smoothing for many general models. 4.2 Learning (System Ide... |

1122 |
Statistical Analysis with Missing Data
- Little, Rubin
- 1987
(Show Context)
Citation Context ...r. For example, in this framework it is obvious how to deal properly with missing data in solving both the learning and inference problems. This topic has been well understood for many static models (=-=Little and Rubin, 1987-=-; Tresp et al., 1994; Ghahramani and Jordan, 1994) but is typically not well addressed in the linear dynamical systems literature. As another example, it is easy to design and work with models having ... |

1090 | Self-organized Formation of Topologically Correct Feature Maps - Kohonen - 1982 |

1070 | An Information-Maximization Approach to Blind Separation and Blind Deconvolution - Bell, Sejnowski - 1995 |

842 | Least squares quantization in pcm - Lloyd - 1982 |

834 | A tutorial on hidden Markov models
- Rabiner, Juang
- 1989
(Show Context)
Citation Context ...n), the sequence of maximum a posteriori states is exactly the single most likely state trajectory. So the regular Kalman filter and RTS smoothing recursions suffice. It is possible (see for example (=-=Rabiner and Juang, 1986-=-)) to learn the discrete state model parameters based on the results of the Viterbi decoding instead of the forward-backward smoothing --- in other words to maximize the joint likelihood of the observ... |

772 |
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ...s often not well known in the machine learning community. For the discrete state models much of the literature stems from the work of Baum and colleagues (Baum and Petrie, 1966; Baum and Eagon, 1967; =-=Baum et al., 1970-=-; Baum, 1972) on hidden Markov models and of Viterbi (Viterbi, 1967) and others on optimal decoding. The recent book by Elliott and colleagues (1995) contains a thorough mathematical treatment of filt... |

764 | A view of the EM algorithm that justifies incremental sparse and other variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...Yj`), then some readers may notice that the lower bound F(Q; `)sL(`) is the negative of a quantity known in statistical physics as the free energy: the expected energy under Q minus the entropy of Q (=-=Neal and Hinton, 1993-=-). The EM algorithm alternates between maximizing F with respect to the distribution Q and the parameters `, respectively, holding the other fixed. Starting from some initial parameters ` 0 : E step: ... |

724 | Hierarchical mixtures of experts and EM algorithm - Jordan, Jacobs - 1994 |

562 | inference using markov chain monte carlo methods
- Neal
- 1993
(Show Context)
Citation Context ...heproduct of a Gaussian prior and a non-Gaussian likelihood. Again, this may not be easy and we may wish to resort to Gibbs sampling (Geman and Geman, 1984) or other Markov chain Monte Carlo methods (=-=Neal, 1993-=-). Another option is to employ a deterministic trick recently used by Bishop et al (1997) in the context of the Generative Topographic Map (GTM), which is a probabilistic version of Kohonen's (1982) s... |

512 |
An inequality and associated maximization technique in statistical estimation of probabilistic functions of a markov process
- Baum
- 1972
(Show Context)
Citation Context ...own in the machine learning community. For the discrete state models much of the literature stems from the work of Baum and colleagues (Baum and Petrie, 1966; Baum and Eagon, 1967; Baum et al., 1970; =-=Baum, 1972-=-) on hidden Markov models and of Viterbi (Viterbi, 1967) and others on optimal decoding. The recent book by Elliott and colleagues (1995) contains a thorough mathematical treatment of filtering and sm... |

510 | A New Learning Algorithm for Blind Signal Separation
- Amari, Cichocki, et al.
- 1996
(Show Context)
Citation Context ...ticular prior distribution over the components of the vector x t of sources and then deriving a gradient learning rule that maximizes the likelihood of the data y t in the limit of zero output noise (=-=Amari et al., 1996-=-; Pearlmutter and Parra, 1997; MacKay, 1996). The algorithm, originally derived for unordered data, has also been extended to modeling time series (Pearlmutter and Parra, 1997). We now show that the g... |

441 | Graphical models in applied multivariate statistics - Whittaker - 1990 |

397 | Mixtures of probabilistic principal component analysers
- Tipping, M
- 1999
(Show Context)
Citation Context ...ents of y the model could not be easily corrected since v has spherical covariance (R = fflI). The SPCA model is very similar to the independently proposed probabilistic principal component analysis (=-=Tipping and Bishop, 1997-=-). If we go even further and take the limit R = lim ffl!0 fflI (while keeping the diagonal elements of Q finite 9 ) then we obtain the standard principal component analysis or PCA model. The direction... |

374 |
Theory and practice of recursive identification
- Ljung, Soderstrom
- 1983
(Show Context)
Citation Context ...rea of study in control theory known as system identification which investigates learning in continuous state models. For linear Gaussian models there are several approaches to system identification (=-=Ljung and Soderstrom, 1983-=-), but to clarify the relationship between these models and the others we review in this paper, we focus on system identification methods based on the EM algorithm, described below. The EM algorithm f... |

361 |
Statistical Inference for Probabilistic Functions of Finite State Markov Chains
- Baum, Petrie
- 1966
(Show Context)
Citation Context ...auch et al., 1965) although this literature is often not well known in the machine learning community. For the discrete state models much of the literature stems from the work of Baum and colleagues (=-=Baum and Petrie, 1966-=-; Baum and Eagon, 1967; Baum et al., 1970; Baum, 1972) on hidden Markov models and of Viterbi (Viterbi, 1967) and others on optimal decoding. The recent book by Elliott and colleagues (1995) contains ... |

331 | New results in linear filtering and prediction theory
- Kalman, Bucy
- 1961
(Show Context)
Citation Context ...istributions (8) or (9). Filtering and smoothing have been extensively studied for continuous state models in the signal processing community, starting with the seminal works of Kalman (Kalman, 1960; =-=Kalman and Bucy, 1961-=-) and Rauch (Rauch, 1963; Rauch et al., 1965) although this literature is often not well known in the machine learning community. For the discrete state models much of the literature stems from the wo... |

280 | GTM: The generative topographic mapping - Bishop, Svensén, et al. - 1998 |

270 |
Pattern Classi cation and Scene Analysis
- Duda, Hart, et al.
- 1973
(Show Context)
Citation Context ...ning consists of nding the cluster means (columns of C), the covariance R, and the mixing coe cients j. This is easily done with EM and corresponds exactly to maximum likelihood competitive learning (=-=Duda and Hart, 1973-=-; Nowlan, 1991) except that all the clusters share the same covariance. Later we introduce extensions to the model which remove this restriction. As in the continuous state case, we can consider the l... |

247 | Exponentiated gradient versus gradient descent for linear predictors
- KIVINEN, WARMUTH
- 1997
(Show Context)
Citation Context ...ed approach to learning. Finally, recent work in on-line learning has shown that it is possible to derive a family of EM -like algorithms with faster convergence rates than the standard EM algorithm (=-=Kivinen and Warmuth, 1997-=-; Bauer et al., 1997). 5.3 SPCA and PCA If instead of restricting R to be merely diagonal, we require it to be a multiple of the identity matrix (in other words the covariance ellipsoid of v is spheri... |

222 |
An approach to time series smoothing and forecasting using the EM algorithm
- Shumway, Stoffer
- 1982
(Show Context)
Citation Context ...ma1 / V t t J 0 t\Gamma1 + J t (sV t+1;t \Gamma AV t t )J 0 t\Gamma1 if t ! �� end returnsx t ,sV t ,sV t;t\Gamma1 for all t A.3.2 Learning The EM algorithm for learning a linear dynamical system =-=is (Shumway and Stoffer, 1982-=-; Ghahramani and Hinton, 1996a), assuming for simplicity that we only have a single sequence of observations: LDSLearn(Y,k,ffl) initialize A, C, Q, R, x 0 1 , V 0 1 set ff / P t y t y 0 t while change... |

219 |
Adaptiue Filtering Prediction and Control
- Goodwin, Sin
- 1984
(Show Context)
Citation Context ...e all the matrices are full-rank, the problem of inferring the state from a sequence of �� consecutive observations is well-defined as long ks��p (a notion related to observability in systems =-=theory (Goodwin and Sin, 1984-=-)). For this reason, in dynamic models it is sometimes useful to use state spaces of larger dimension than the observations, k ? p, in which case the state provides a compact representation of a seque... |

214 |
An Inequality with Applications to Statistical Estimation for Probabilistic Functions of a Markov Process and to a Model for Ecology
- Baum, Egon
- 1967
(Show Context)
Citation Context ...ough this literature is often not well known in the machine learning community. For the discrete state models much of the literature stems from the work of Baum and colleagues (Baum and Petrie, 1966; =-=Baum and Eagon, 1967-=-; Baum et al., 1970; Baum, 1972) on hidden Markov models and of Viterbi (Viterbi, 1967) and others on optimal decoding. The recent book by Elliott and colleagues (1995) contains a thorough mathematica... |

184 | Supervised learning from incomplete data via an EM approach
- Ghahramani, Jordan
- 1994
(Show Context)
Citation Context ...ous how to deal properly with missing data in solving both the learning and inference problems. This topic has been well understood for many static models (Little and Rubin, 1987; Tresp et al., 1994; =-=Ghahramani and Jordan, 1994-=-) but is typically not well addressed in the linear dynamical systems literature. As another example, it is easy to design and work with models having a mixed continuous and discrete state vector (suc... |

178 |
Neural networks and principal component analysis: Leamhg Rom examples without local minha
- Baldi, Hornik
- 1989
(Show Context)
Citation Context ...work interpretations and regularization Since early in the modern history of neural networks it was realized that principal component analysis could be implemented using a linear autoencoder network (=-=Baldi and Hornik, 1989-=-). The data is fed both as the input and target of the network, and the network parameters are learned using the squared error cost function. In this section we show how factor analysis and mixture of... |

167 | Probabilistic independence networks for hidden markov probability models
- Smyth, Heckerman, et al.
- 1997
(Show Context)
Citation Context ...= C(E 1=2 )such that the new covariance of x 0 is the identity matrix: Q 0 = I. 2 C A z ,1s(or belief) networks, and other formalisms (Pearl, 1988; Lauritzen and Spiegelhalter, 1988; Whittaker, 1990; =-=Smyth et al., 1997-=-). A graphical model is a representation of the dependency structure between variables in a multivariate probability distribution. Each node corresponds to a random variable, and the absence of an arc... |

156 | Parameter estimation for linear dynamical systems
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...filtering. In this paper we unify many of the disparate observations made by previous authors (Rubin and Thayer, 1982; Delyon, 1993; Digalakis et al., 1993; Hinton et al., 1995; Elliott et al., 1995; =-=Ghahramani and Hinton, 1996-=-a, 1996b, 1997; Hinton and Ghahramani, 1997) and present a review of all these algorithms as instances of a single basic generative model. This unified view allows us to show some interesting relation... |

147 |
A Rapidly Convergent Descent Method for Minimization
- Fletcher, Powell
- 1963
(Show Context)
Citation Context ...sis has been criticised as being quite slow (Rubin and Thayer, 1982). Indeed, the standard method for tting a factor analysis model (Joreskog, 1967) is based on a quasi-Newton optimization algorithm (=-=Fletcher and Powell, 1963-=-) which has been found empirically to converge faster than EM . We present the EM algorithm here, not because it is the most e cient way of tting a factor analysis model, but because we wish to emphas... |

146 | Modeling the manifolds of images of handwritten digits - Hinton, Dayan, et al. - 1997 |

146 |
Turbulence and the dynamics of coherent structures
- Sirovich
- 1987
(Show Context)
Citation Context ...housands) number of dimensions and want to extract only a few (tens) principal components we cannot naively try to diagonalize the sample covariance of our data. Techniques like the snap-shot method (=-=Sirovich, 1987-=-) attempt to address this but still require the diagonalization of an N by N matrix where N is the number of data points. The EM algorithm approach solves all of these problems, requiring no explicit ... |

138 | Clustering Sequences with Hidden Markov Models
- Smyth
- 1997
(Show Context)
Citation Context ... 1991; Ghahramani and Hinton, 1996b) and references therein); mixtures of factor analyzers (Ghahramani and Hinton, 1997) and of pancakes (PCA) (Hinton et al., 1995); mixtures of hidden Markov models (=-=Smyth, 1997). A mixtu-=-re of m of our constrained mixtures of Gaussians each with k clusters gives a mixture model with mk components in which there are only m possible covariance matrices. This "tied covariance" ... |

132 |
An introduction to latent variable models
- Everitt
- 1984
(Show Context)
Citation Context ...) then we obtain the standard principal component analysis or PCA model. The directions 8 The correction k(k \Gamma 1)=2 comes in because of the degeneracy in ordering the factors -- see for example (=-=Everitt, 1984-=-). 9 Since isotropic scaling of the data space is arbitrary we could just as easily take the limit as the diagonal elements of Q became infinite while holding R finite or take both limits at once. The... |

126 |
Maximum likelihood estimates of linear dynamic systems
- Rauch, Tung, et al.
- 1965
(Show Context)
Citation Context ...g have been extensively studied for continuous state models in the signal processing community, starting with the seminal works of Kalman (Kalman, 1960; Kalman and Bucy, 1961) and Rauch (Rauch, 1963; =-=Rauch et al., 1965-=-) although this literature is often not well known in the machine learning community. For the discrete state models much of the literature stems from the work of Baum and colleagues (Baum and Petrie, ... |

120 | Generative models for discovering sparse distributed representations
- Hinton, Ghahramani
- 1177
(Show Context)
Citation Context ...he disparate observations made by previous authors (Rubin and Thayer, 1982; Delyon, 1993; Digalakis et al., 1993; Hinton et al., 1995; Elliott et al., 1995; Ghahramani and Hinton, 1996a, 1996b, 1997; =-=Hinton and Ghahramani, 1997-=-) and present a review of all these algorithms as instances of a single basic generative model. This unified view allows us to show some interesting relations between previously disparate algorithms. ... |

110 | Autoencoders, minimum description length, and Helmholtz free energy
- Hinton, Zemel
- 1994
(Show Context)
Citation Context ...albeit with a different cost function. To understand how a probabilistic model can be learned using an autoencoder it is very useful to make a recognition/generation decomposition of the autoencoder (=-=Hinton and Zemel, 1994-=-; Hinton 15 In the limit of zero noise, R = 0, the EM updates derived in this manner degenerate to C / C and R / R. Since this does not decrease the likelihood, it does not contradict the convergence ... |

108 | Maximum Likelihood and Covariant Algorithms for Independent Component Analysis,”University of Cambridge, Cavendish Laboratory - MacKay - 1996 |

98 | algorithms for pca and spca - Roweis, “Em - 1998 |

96 |
algorithms for ML factor analysis
- Rubin, Thayer, et al.
- 1982
(Show Context)
Citation Context ...re closely related and Digalakis et al. (1993) relate the forward--backward algorithm for HMMs to Kalman filtering. In this paper we unify many of the disparate observations made by previous authors (=-=Rubin and Thayer, 1982-=-; Delyon, 1993; Digalakis et al., 1993; Hinton et al., 1995; Elliott et al., 1995; Ghahramani and Hinton, 1996a, 1996b, 1997; Hinton and Ghahramani, 1997) and present a review of all these algorithms ... |

90 | Hidden Markov models: Estimation and control - Elliott, Aggoun, et al. - 1995 |

85 | ML Estimation of a Stochastic Linear System with the EM Algorithm and Its Application to Speech Recognition
- Digalakis, Rohlicek, et al.
- 1993
(Show Context)
Citation Context .... (1993) relate the forward--backward algorithm for HMMs to Kalman filtering. In this paper we unify many of the disparate observations made by previous authors (Rubin and Thayer, 1982; Delyon, 1993; =-=Digalakis et al., 1993-=-; Hinton et al., 1995; Elliott et al., 1995; Ghahramani and Hinton, 1996a, 1996b, 1997; Hinton and Ghahramani, 1997) and present a review of all these algorithms as instances of a single basic generat... |

76 |
Maximum likelihood competitive learning
- Nowlan
- 1990
(Show Context)
Citation Context ...cluster means (columns of C), the covariance R, and the mixing coefficients �� j . This is easily done with EM and corresponds exactly to maximum likelihood competitive learning (Duda and Hart, 19=-=73; Nowlan, 1991-=-) except that all the clusters share the same covariance. Later we introduce extensions to the model which remove this restriction. As in the continuous state case, we can consider the limit as the ob... |

74 | Maximum likelihood blind source separation a context-sensitive generalization
- Pearlmutter, Parra
- 1997
(Show Context)
Citation Context ...bution over the components of the vector x t of sources and then deriving a gradient learning rule that maximizes the likelihood of the data y t in the limit of zero output noise (Amari et al., 1996; =-=Pearlmutter and Parra, 1997-=-; MacKay, 1996). The algorithm, originally derived for unordered data, has also been extended to modeling time series (Pearlmutter and Parra, 1997). We now show that the generative model underlying IC... |