## Unsupervised learning (2004)

### Cached

### Download Links

Venue: | Advanced Lectures on Machine Learning |

Citations: | 19 - 0 self |

### BibTeX

@INPROCEEDINGS{Ghahramani04unsupervisedlearning,

author = {Zoubin Ghahramani},

title = {Unsupervised learning},

booktitle = {Advanced Lectures on Machine Learning},

year = {2004},

pages = {72--112},

publisher = {Springer-Verlag}

}

### Years of Citing Articles

### OpenURL

### Abstract

We give a tutorial and overview of the field of unsupervised learning from the perspective of statistical modelling. Unsupervised learning can be motivated from information theoretic and Bayesian principles. We briefly review basic models in unsupervised learning, including factor analysis, PCA, mixtures of Gaussians, ICA, hidden Markov models, state-space models, and many variants and extensions. We derive the EM algorithm and give an overview of fundamental concepts in graphical models, and inference algorithms on graphs. This is followed by a quick tour of approximate Bayesian inference, including Markov chain Monte Carlo (MCMC), Laplace approximation, BIC, variational approximations, and expectation propagation (EP). The aim of this chapter is to provide a high-level view of the field. Along the way, many state-of-the-art ideas and future directions are also reviewed. Contents 1

### Citations

8089 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...e carried out for each i, where the distributions are q (t) xi (xi). Clearly qθ(θ) and qxi (xi) are coupled, so we iterate these equations until convergence. Recalling the EM algorithm (Section 3 and =-=[14, 63]-=-) we note the similarity between EM and the iterative algorithm in (62) and (63). This procedure is called the Variational Bayesian EM Algorithm and generalises the usual EM algorithm; see also [5] an... |

7052 |
Probabilistic Reasoning in Intelligent Systems
- Pearl
- 1988
(Show Context)
Citation Context ...Y . These type of phenomena are related to “explaining-away” which refers to the fact that if there are multiple potential causes for some effect, observing one, explains away the need for the others =-=[64]-=-. Intractability can thus occur if we have a model with discrete hidden variables which can take on exponentially many combinations. Intractability can also occur with continuous hidden variables if t... |

1688 |
A Global Geometric Framework for Nonlinear Dimensionality
- Tenenbaum, Silva, et al.
(Show Context)
Citation Context ...sionality reduction models, including generative topographic mappings (GTM) [11] (a probabilistic alternative to Kohonen maps), multi-dimensional scaling (MDS) [72, 45], principal curves [30], Isomap =-=[76]-=-, and locally linear embedding (LLE) [69]. Hidden Markov models also have their limitations. Even though they can model nonlinear dynamics by discretising the hidden state space, an HMM with K hidden ... |

1614 | Nonlinear dimensionality reduction by locally linear embedding
- Roweis, Saul
- 2000
(Show Context)
Citation Context ...erative topographic mappings (GTM) [11] (a probabilistic alternative to Kohonen maps), multi-dimensional scaling (MDS) [72, 45], principal curves [30], Isomap [76], and locally linear embedding (LLE) =-=[69]-=-. Hidden Markov models also have their limitations. Even though they can model nonlinear dynamics by discretising the hidden state space, an HMM with K hidden states can only capture log 2 K bits of i... |

1304 | Near Shannon limit errorcorrecting coding and decoding: Turbo-Codes
- Nerrou, Glavieux, et al.
- 1993
(Show Context)
Citation Context ... called loopy belief propagation and has been analysed by several researchers [81, 82]. Interest in loopy belief propagation arose out of its impressive performance in decoding error correcting codes =-=[21, 9, 50, 49]-=-. Although the beliefs are not guaranteed to be correct on loopy graphs, interesting connections can be made to approximate inference procedures inspired by statistical physics known as the Bethe and ... |

1284 |
Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems (with Discussion
- Lauritzen, Spiegelhalter
- 1988
(Show Context)
Citation Context ...n of factor graph propagation “loopy belief propagation”. 8.4 Junction tree algorithm For multiply-connected graphs, the standard exact inference algorithms are based on the notion of a junction tree =-=[46]-=-. The basic idea of the junction tree algorithm is to group variables so as to convert the multiplyconnected graph into a singly-connected undirected graph (tree) over sets of variables, and do infere... |

1157 |
Information Theory, Inference, and Learning Algorithms
- MacKay
- 2003
(Show Context)
Citation Context ...e new data. This is an important link between machine learning, statistics, and information theory. An excellent text which elaborates on these relationships and many of the topics in this chapter is =-=[48]-=-. 1.3 Bayes rule Bayes rule, P (y|x) = P (x|y)P (y) P (x) which follows from the equality P (x, y) = P (x)P (y|x) = P (y)P (x|y), can be used to motivate a coherent statistical framework for machine l... |

1138 |
Spatial interaction and the statistical analysis of lattice systems
- Besag
- 1974
(Show Context)
Citation Context ...obability distributions which can be written as a normalised product of non-negative functions over the variables in the maximal cliques of the graph (this is known as the Hammersley-Clifford Theorem =-=[10]-=-). In the example in Figure 1, this implies that the probability distribution over (A, B, C, D, E) can be written as: D B P (A, B, C, D, E) = c g1(A, C)g2(B, C, D)g3(C, D, E) (28) Here, c is the const... |

1131 | CONDENSATION conditional density propagation for visual tracking
- Isard, Blake
- 1998
(Show Context)
Citation Context ...ussians), and the M step is nonlinear regression, rather than linear regression [25]. There are many methods of dealing with inference in non-linear SSMs, including methods such as particle filtering =-=[29, 27, 40, 43, 35, 15]-=-, linearisation [2], the unscented filter [39, 80], the EP algorithm [52], and embedded HMMs [62]. 11sNon-linear models are also important if we are to consider generalising simple dimensionality redu... |

1060 |
A F M. Novel approach to nonlinear-non-gaussian Bayesian state estimation. IEEProceeding-F
- Gordon, Salmond, et al.
- 1993
(Show Context)
Citation Context ...ussians), and the M step is nonlinear regression, rather than linear regression [25]. There are many methods of dealing with inference in non-linear SSMs, including methods such as particle filtering =-=[29, 27, 40, 43, 35, 15]-=-, linearisation [2], the unscented filter [39, 80], the EP algorithm [52], and embedded HMMs [62]. 11sNon-linear models are also important if we are to consider generalising simple dimensionality redu... |

890 |
Sequential Monte Carlo Methods in Practice
- Doucet, Freitas, et al.
- 2001
(Show Context)
Citation Context ...ussians), and the M step is nonlinear regression, rather than linear regression [25]. There are many methods of dealing with inference in non-linear SSMs, including methods such as particle filtering =-=[29, 27, 40, 43, 35, 15]-=-, linearisation [2], the unscented filter [39, 80], the EP algorithm [52], and embedded HMMs [62]. 11sNon-linear models are also important if we are to consider generalising simple dimensionality redu... |

889 | Low-density parity-check codes
- Gallager
- 1962
(Show Context)
Citation Context ... called loopy belief propagation and has been analysed by several researchers [81, 82]. Interest in loopy belief propagation arose out of its impressive performance in decoding error correcting codes =-=[21, 9, 50, 49]-=-. Although the beliefs are not guaranteed to be correct on loopy graphs, interesting connections can be made to approximate inference procedures inspired by statistical physics known as the Bethe and ... |

854 | A tutorial on learning with bayesian networks
- Heckerman
- 1995
(Show Context)
Citation Context ...tion 8 we described exact algorithms for inferring the value of variables in a graph with known parameters and structure. If the parameters and structure are unknown they can be learned from the data =-=[31]-=-. The learning problem can be divided into learning the graph parameters for a known structure, and learning the model structure (i.e. which edges should be present or absent). 5 We focus here on dire... |

831 | An introduction to variational methods for graphical models
- Jordan, Ghahramani, et al.
- 1999
(Show Context)
Citation Context ...pproximations Variational methods can be used to derive a family of lower bounds on the marginal likelihood and to perform approximate Bayesian inference over the parameters of a probabilistic models =-=[38, 83, 79]-=-. Variational methods provide an alternative to the asymptotic and sampling-based approximations described above; they tend to be more accurate than the asymptotic approximations like BIC and faster t... |

764 | A view of the EM algorithm that justifies incremental sparse and other variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...e carried out for each i, where the distributions are q (t) xi (xi). Clearly qθ(θ) and qxi (xi) are coupled, so we iterate these equations until convergence. Recalling the EM algorithm (Section 3 and =-=[14, 63]-=-) we note the similarity between EM and the iterative algorithm in (62) and (63). This procedure is called the Variational Bayesian EM Algorithm and generalises the usual EM algorithm; see also [5] an... |

706 |
Optimal Filtering
- Anderson, Moore
(Show Context)
Citation Context ...gression, rather than linear regression [25]. There are many methods of dealing with inference in non-linear SSMs, including methods such as particle filtering [29, 27, 40, 43, 35, 15], linearisation =-=[2]-=-, the unscented filter [39, 80], the EP algorithm [52], and embedded HMMs [62]. 11sNon-linear models are also important if we are to consider generalising simple dimensionality reduction models such a... |

580 |
The computational complexity of probabilistic inference using bayesian belief networks
- Cooper
- 1990
(Show Context)
Citation Context ...integrals which exploit the structure of the graph to get the solution efficiently for certain graph structures (namely trees and related graphs). For general graphs the problem is fundamentally hard =-=[13]-=-. 15 (32)s8.1 Elimination The simplest algorithm conceptually is variable elimination. It is easiest to explain with an example. Consider computing P (A = a|D = d) in the directed graph in Figure 1. T... |

564 | Dynamic bayesian networks: representation, inference and learning
- Murphy
(Show Context)
Citation Context ...alleviated by restricting the interactions between the hidden variables at one time step and at the next time step. A generalisation of these ideas is the notion of a dynamical Bayesian network (DBN) =-=[56]-=-. A relatively old but still quite powerful class of models for binary data is the Boltzmann machine (BM) [1]. This is a simple model inspired from Ising models in statistical physics. A BM is a multi... |

562 | inference using markov chain monte carlo methods
- Neal
- 1993
(Show Context)
Citation Context ...ibution, as can be seen by multiplying (45) by the parameter prior and normalising. Inference can be achieved via approximate inference methods such as Markov chain Monte Carlo methods (Section 11.3, =-=[59]-=-) like Gibbs sampling, and variational approximations (Section 11.4, [6]). 9.2 Learning graph structure There are two basic components to learning the structure of a graph from data: scoring and searc... |

538 | Hierarchical dirichlet processes
- Teh, Jordan, et al.
- 2006
(Show Context)
Citation Context ...d infinite mixture models to hidden Markov models with infinitely many states [7]. Infinite models based on Dirichlet processes have also been generalised to be hierarchical in several different ways =-=[61, 75]-=-. Bayesian inference in nonparametric models is one of the most active areas of research in unsupervised learning, and there still remain many open problems. As we have seen, the field of unsupervised... |

514 | Good error correcting codes based on very sparse matrices
- MacKay
- 1999
(Show Context)
Citation Context ... called loopy belief propagation and has been analysed by several researchers [81, 82]. Interest in loopy belief propagation arose out of its impressive performance in decoding error correcting codes =-=[21, 9, 50, 49]-=-. Although the beliefs are not guaranteed to be correct on loopy graphs, interesting connections can be made to approximate inference procedures inspired by statistical physics known as the Bethe and ... |

490 | Semisupervised learning using gaussian fields and harmonic functions
- Zhu, Ghahramani, et al.
- 2003
(Show Context)
Citation Context ...this problem attempt to infer a manifold, graph structure, or tree-structure from the unlabelled data and use spread in this structure to determine how labels will generalise to new unlabelled points =-=[74, 85, 8, 42]-=-. Another area of great interest which we did not have the space to cover are nonparametric models. The basic assumption of parametric statistical models is that the model is defined using a finite nu... |

489 | Factorial Hidden Markov Models
- Ghahramani, Jordan
- 1998
(Show Context)
Citation Context ...dden Markov models can be seen as an extension of finite mixture models to model time series data, it is possible to extend infinite mixture models to hidden Markov models with infinitely many states =-=[7]-=-. Infinite models based on Dirichlet processes have also been generalised to be hierarchical in several different ways [61, 75]. Bayesian inference in nonparametric models is one of the most active ar... |

476 | Probabilistic principal component analysis
- Tipping, Bishop
- 1999
(Show Context)
Citation Context ...ifications to FA. First, the noise is assumed to be isotropic, in other words each element of ɛ has equal variance: Ψ = σ 2 I, where I is a D×D identity matrix. This model is called probabilistic PCA =-=[67, 78]-=-. Second, if we take the limit of σ → 0 in probabilistic PCA, we obtain standard PCA (which also goes by the names Karhunen-Loève expansion, and singular value decomposition; SVD). Given a data set wi... |

431 | A learning algorithm for Boltzmann machines
- Ackley, Hinton, et al.
- 1985
(Show Context)
Citation Context ...step. A generalisation of these ideas is the notion of a dynamical Bayesian network (DBN) [56]. A relatively old but still quite powerful class of models for binary data is the Boltzmann machine (BM) =-=[1]-=-. This is a simple model inspired from Ising models in statistical physics. A BM is a multivariate model for capturing correlations and higher order statistics in vectors of binary data. Consider data... |

427 | Graphical models, exponential families, and variational inference
- Wainwright, Jordan
- 2008
(Show Context)
Citation Context ...pproximations Variational methods can be used to derive a family of lower bounds on the marginal likelihood and to perform approximate Bayesian inference over the parameters of a probabilistic models =-=[38, 83, 79]-=-. Variational methods provide an alternative to the asymptotic and sampling-based approximations described above; they tend to be more accurate than the asymptotic approximations like BIC and faster t... |

426 | New Extension of the Kalman Filter to Nonlinear Systems
- Julier, Uhlmann
- 1997
(Show Context)
Citation Context ...ar regression [25]. There are many methods of dealing with inference in non-linear SSMs, including methods such as particle filtering [29, 27, 40, 43, 35, 15], linearisation [2], the unscented filter =-=[39, 80]-=-, the EP algorithm [52], and embedded HMMs [62]. 11sNon-linear models are also important if we are to consider generalising simple dimensionality reduction models such as PCA and FA. These models are ... |

416 |
Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics
- Antoniak
- 1974
(Show Context)
Citation Context ... the model. For this reason nonparametric models are also sometimes called infinite models. An important example of this are infinite mixture models, more formally known as Dirichlet Process mixtures =-=[3, 18]-=-. These correspond to mixture models (Section 2.4) where the number of components is assumed to be infinite. Inference can be done in these models using MCMC methods [17, 60, 65], variational methods ... |

397 | Bayesian density estimation and inference using mixtures
- Escobar, West
- 1995
(Show Context)
Citation Context ...Dirichlet Process mixtures [3, 18]. These correspond to mixture models (Section 2.4) where the number of components is assumed to be infinite. Inference can be done in these models using MCMC methods =-=[17, 60, 65]-=-, variational methods [12], or the EP algorithm [54]. Just as hidden Markov models can be seen as an extension of finite mixture models to model time series data, it is possible to extend infinite mix... |

397 | Mixtures of probabilistic principal component analysers
- Tipping, M
- 1999
(Show Context)
Citation Context ...odel. This motivates us to seek to describe and study learning in much more flexible models. A simple combination of two of the ideas we have described for iid data is the mixture of factor analysers =-=[23, 34, 77]-=-. This model performs simultaneous clustering and dimensionality reduction on the data, by assuming that the covariance in each Gaussian cluster can be modelled by an FA model. Thus, it becomes possib... |

397 | Generalized belief propagation
- Yedidia, Freeman, et al.
- 2001
(Show Context)
Citation Context ... are not guaranteed to be correct on loopy graphs, interesting connections can be made to approximate inference procedures inspired by statistical physics known as the Bethe and Kikuchi free energies =-=[84]-=-. 8.3 Factor graph propagation In belief propagation, there is an asymmetry between the messages a child sends its parents and the messages a parent sends its children. Propagation in singly-connected... |

395 |
Carlo filter and smoother for non-Gaussian nonlinear state space models
- Kitagawa, “Monte
- 1996
(Show Context)
Citation Context |

372 | Markov chain sampling methods for Dirichlet process mixture models
- Neal
- 1998
(Show Context)
Citation Context ...Dirichlet Process mixtures [3, 18]. These correspond to mixture models (Section 2.4) where the number of components is assumed to be infinite. Inference can be done in these models using MCMC methods =-=[17, 60, 65]-=-, variational methods [12], or the EP algorithm [54]. Just as hidden Markov models can be seen as an extension of finite mixture models to model time series data, it is possible to extend infinite mix... |

328 |
Multidimensional scaling by optimizing goodness of fit to a non-metric hypothesis
- KRUSKAL
- 1964
(Show Context)
Citation Context ...nteresting and important nonlinear dimensionality reduction models, including generative topographic mappings (GTM) [11] (a probabilistic alternative to Kohonen maps), multi-dimensional scaling (MDS) =-=[72, 45]-=-, principal curves [30], Isomap [76], and locally linear embedding (LLE) [69]. Hidden Markov models also have their limitations. Even though they can model nonlinear dynamics by discretising the hidde... |

327 |
Complexity of finding embeddings in a k-tree
- Arnborg, Corneil, et al.
- 1987
(Show Context)
Citation Context ...nction tree algorithm would have to store and manipulate tables of size 2 K . Moreover, finding the optimal triangulation to get the most efficient junction tree for a particular graph is NP-complete =-=[4, 44]-=-. 8.5 Cutest conditioning In certain graphs the simplest inference algorithm is cutset conditioning which is related to the idea of “reasoning by assumptions”. The basic idea is very straightforward: ... |

311 | Regularization theory and neural-network architectures, Neural Computation
- Girosi, Jones, et al.
- 1995
(Show Context)
Citation Context ... complex models will generally have higher maxima of the likelihood. In order to avoid problems with overfitting, frequentist procedures often maximise a penalised or regularised log likelihood (e.g. =-=[26]-=-). If the penalty or regularisation term is interpreted as a log prior, then maximising penalised likelihood appears identical to maximising a posterior. However, there are subtle issues that make a B... |

309 | Turbo decoding as an instance of Pearl’s beliefpropagation algorithm
- McEliece, MacKay, et al.
- 1998
(Show Context)
Citation Context |

308 |
Expectation Propagation for Approximate Bayesian Inference
- Minka
- 2001
(Show Context)
Citation Context ...re many methods of dealing with inference in non-linear SSMs, including methods such as particle filtering [29, 27, 40, 43, 35, 15], linearisation [2], the unscented filter [39, 80], the EP algorithm =-=[52]-=-, and embedded HMMs [62]. 11sNon-linear models are also important if we are to consider generalising simple dimensionality reduction models such as PCA and FA. These models are limited in that they ca... |

280 | GTM: The generative topographic mapping
- Bishop, Svensén, et al.
- 1998
(Show Context)
Citation Context ...he data to capture the correlations between the observed variables. There are many interesting and important nonlinear dimensionality reduction models, including generative topographic mappings (GTM) =-=[11]-=- (a probabilistic alternative to Kohonen maps), multi-dimensional scaling (MDS) [72, 45], principal curves [30], Isomap [76], and locally linear embedding (LLE) [69]. Hidden Markov models also have th... |

265 |
A family of algorithms for approximate Bayesian inference
- Minka
- 2001
(Show Context)
Citation Context ...lgorithm simultaneously computes an approximation to the marginal likelihood and to the parameter posterior by maximising a lower bound. 11.5 Expectation propagation (EP) Expectation propagation (EP; =-=[52, 53]-=-) is another powerful method for approximate Bayesian inference. Consider a Bayesian inference problem in which you are given iid data D = {x (1) . . . , x (N) } assumed to have come from a model p(x|... |

263 | A unifying review of linear Gaussian models, Neural Computation 11
- Roweis, Ghahramani
- 1999
(Show Context)
Citation Context ...one can readily derive that � p(y|θ) = p(x|θ)p(y|x, θ)dx = N (0, ΛΛ ⊤ + Ψ) (12) where N (µ, Σ) refers to a multivariate Gaussian density with mean µ and covariance matrix Σ. For more details refer to =-=[68]-=-. Factor analysis is an interesting model for several reasons. If the data is very high dimensional (D is large) then even a simple model like the full-covariance multivariate Gaussian will have too m... |

237 |
N.: The analysis of proximities: Multidimensional scaling with an unknown distance function. Psychometrika 27
- SHEPARD
- 1962
(Show Context)
Citation Context ...nteresting and important nonlinear dimensionality reduction models, including generative topographic mappings (GTM) [11] (a probabilistic alternative to Kohonen maps), multi-dimensional scaling (MDS) =-=[72, 45]-=-, principal curves [30], Isomap [76], and locally linear embedding (LLE) [69]. Hidden Markov models also have their limitations. Even though they can model nonlinear dynamics by discretising the hidde... |

225 | The EM algorithm for mixtures of factor analyzers
- Ghahramani, GE
- 1996
(Show Context)
Citation Context ...odel. This motivates us to seek to describe and study learning in much more flexible models. A simple combination of two of the ideas we have described for iid data is the mixture of factor analysers =-=[23, 34, 77]-=-. This model performs simultaneous clustering and dimensionality reduction on the data, by assuming that the covariance in each Gaussian cluster can be modelled by an FA model. Thus, it becomes possib... |

224 | The wake-sleep algorithm for unsupervised neural networks
- Hinton, Dayan, et al.
- 1995
(Show Context)
Citation Context ... with hidden variables in order to enrich the model class, without adding exponentially many parameters. These hidden variables can be organised into layers of a hierarchy as in the Helmholtz machine =-=[33]-=-. Other hierarchical models include recent generalisations of ICA designed to capture higher order statistics in images [41]. 6 Intractability The problem with the models described in the previous sec... |

222 |
An approach to time series smoothing and forecasting using the EM algorithm
- Shumway, Stoffer
- 1982
(Show Context)
Citation Context ...re the factors are assumed to have linear-Gaussian dynamics over time. The parameters of this model are θ = (A, C, Q, R). To learn ML settings of these parameters one can make use of the EM algorithm =-=[73]-=-. The E step of the algorithm involves computing q(x1:T ) = p(x1:T |y1:T , θ) which is the posterior over hidden state sequences. In fact, this whole posterior does not have to be computed or represen... |

220 | The Bayesian Structural EM Algorithm
- Friedman
- 1998
(Show Context)
Citation Context ...ems are briefly reviewed in Section 11. 9.2.2 Search algorithms. Given a way of scoring models, one can search over the space of all possible valid graphical models for the one with the highest score =-=[19]-=-. The space of all possible graphs is very large (exponential in the number of variables) and for directed graphs it can be expensive to check whether a particular change to the graph will result in a... |

206 | Partially labeled classification with markov random walks
- Szummer, Jaakkola
- 2006
(Show Context)
Citation Context ...this problem attempt to infer a manifold, graph structure, or tree-structure from the unlabelled data and use spread in this structure to determine how labels will generalise to new unlabelled points =-=[74, 85, 8, 42]-=-. Another area of great interest which we did not have the space to cover are nonparametric models. The basic assumption of parametric statistical models is that the model is defined using a finite nu... |

202 | Being bayesian about network structure
- Friedman, Koller
- 2000
(Show Context)
Citation Context ...Thus intelligent heuristics are needed to search the space efficiently [55]. An alternative to trying to find the most probable graph are methods that sample over the posterior distribution of graphs =-=[20]-=-. This has the advantage that it avoids the problem of overfitting which can occur for algorithms that select a single structure with highest score out of exponentially many. 10 Bayesian model compari... |

185 | On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs
- Weiss, Freeman
(Show Context)
Citation Context ...d graphs there is a large body of research on its application to multiply connected graphs—the use of BP on such graphs is called loopy belief propagation and has been analysed by several researchers =-=[81, 82]-=-. Interest in loopy belief propagation arose out of its impressive performance in decoding error correcting codes [21, 9, 50, 49]. Although the beliefs are not guaranteed to be correct on loopy graphs... |

181 |
Connectionist learning of belief networks
- Neal
- 1992
(Show Context)
Citation Context ...ry in this table, but this is often not a natural way to parameterise the dependency between variables. Alternatives (for binary data) are the noisy-or or sigmoid parameterisation of the dependencies =-=[58]-=-. Whatever the specific parameterisation, let θi denote the parameters relating Xi to its parents, and let θ denote all the parameters in the model. Let m denote the model structure, which corresponds... |