## Unsupervised learning of finite mixture models (2002)

### Download From

IEEE### Download Links

- [www.cse.msu.edu]
- [dataclustering.cse.msu.edu]
- [www.cse.msu.edu]
- [www.lx.it.pt]
- DBLP

### Other Repositories/Bibliography

Venue: | IEEE Transactions on Pattern Analysis and Machine Intelligence |

Citations: | 309 - 21 self |

### BibTeX

@ARTICLE{Figueiredo02unsupervisedlearning,

author = {Mario A. T. Figueiredo and Senior Member and Anil K. Jain},

title = {Unsupervised learning of finite mixture models},

journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},

year = {2002},

volume = {24},

pages = {381--396}

}

### Years of Citing Articles

### OpenURL

### Abstract

AbstractÐThis paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective ªunsupervisedº is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the standard expectation-maximization (EM) algorithm, it does not require careful initialization. The proposed method also avoids another drawback of EM for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space. The novelty of our approach is that we do not use a model selection criterion to choose one among a set of preestimated candidate models; instead, we seamlessly integrate estimation and model selection in a single algorithm. Our technique can be applied to any type of parametric mixture model for which it is possible to write an EM algorithm; in this paper, we illustrate it with experiments involving Gaussian mixtures. These experiments testify for the good performance of our approach. Index TermsÐFinite mixtures, unsupervised learning, model selection, minimum message length criterion, Bayesian methods, expectation-maximization algorithm, clustering. 1

### Citations

9359 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...generation model [49], [60], [61]. To formalize this idea, consider some data-set Y, known to have been generated according to p…Yj †, which is to be encoded and transmitted. Following Shannon theory =-=[15]-=-, the shortest code length (measured in bits, if base-2 logarithm is used, or in nats, if natural logarithm is adopted [15]) for Y is d log p…Yj †e, where dae denotes ªthe smallest integer no less tha... |

4204 |
PE.(1973) Pattern Classification and Scene Analysis [M
- Duda, Hart
(Show Context)
Citation Context ... at a fraction ( 0:1) of the computational cost. The ICL and LEC criteria yielded very bad results in this problem. Finally, Fig. 11 shows the best 2D projection (obtained using discriminant analysis =-=[19]-=-) of 800 points from each class, together with the projections of the mixtures that were fitted to each class-conditional density and of the corresponding decision regions. Fig. 11. Best 2D projection... |

2831 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...used for mixtures include: . Approximate Bayesian criteria, like the one in [50] (termed Laplace-empirical criterion, LEC, in [37]), and Schwarz's Bayesian inference criterion (BIC) [10], [17], [22], =-=[53]-=-. . Approaches based on information/coding theory concepts, such as Rissanen's minimum description length (MDL) [49], which formally coincides with BIC, the minimum message length (MML) criterion [42]... |

2365 |
Algorithms for Clustering Data
- Jain, Dubes
- 1988
(Show Context)
Citation Context ...machine learning) is currently widely acknowledged. In statistical pattern recognition, finite mixtures allow a formal (probabilistic model-based) approach to unsupervised learning (i.e., clustering) =-=[28]-=-, [29], [35], [37], [57]. In fact, finite mixtures naturally model observations which are assumed to have been produced by one (randomly selected and unknown) of a set of alternative random sources. I... |

1477 | Pattern Recognition with Fuzzy Objective Function Algorithms - Bezdek - 1981 |

1194 |
Bayesian Theory
- Bernardo, Smith
- 2000
(Show Context)
Citation Context ...also independent from the mixing probabilities, i.e., p… †ˆp… 1; ...; k† Yk mˆ1 p… m†: For each factor p… m† and p… 1; ...; k†, we adopt the standard noninformative Jeffreys' prior (see, for example, =-=[3]-=-) p… m† / jI …1† q … m†j …11† p p… 1; ...; k† / jMj ˆ … 1 2 k† 1=2 …12† for 0 1; 2; ...; k 1 and 1 ‡ 2 ‡ ‡ k ˆ 1. With these choices and noticing that for a k-component mixture, c ˆ Nk ‡ k, where N is... |

1142 |
Finite Mixture Models
- McLachlan, Peel
- 2000
(Show Context)
Citation Context ...is currently widely acknowledged. In statistical pattern recognition, finite mixtures allow a formal (probabilistic model-based) approach to unsupervised learning (i.e., clustering) [28], [29], [35], =-=[37]-=-, [57]. In fact, finite mixtures naturally model observations which are assumed to have been produced by one (randomly selected and unknown) of a set of alternative random sources. Inferring (the para... |

1092 |
The EM Algorithm and Extensions
- McLachlan
- 1997
(Show Context)
Citation Context ... send e-mail to: tpami@computer.org,and reference IEEECS Log Number 112382. The standard method used to fit finite mixture models to observed data is the expectation-maximization (EM) algorithm [18], =-=[36]-=-, [37], which converges to a maximum likelihood (ML) estimate of the mixture parameters. However, the EM algorithm for finite mixture fitting has several drawbacks: it is a local (greedy) method, thus... |

837 | Neuro-Dynamic programming - Bertsekas, Tsitsiklis - 1996 |

811 | A view of the EM algorithm that justifies incremental, sparse, and other variants, ser - Neal, Hinton - 1998 |

774 | Statistical Pattern Recognition: A Review - Jain, Duin, et al. - 2000 |

731 |
Statistical Analysis of Finite Mixture Distributions
- Titterington, Smith, et al.
- 1985
(Show Context)
Citation Context ...rently widely acknowledged. In statistical pattern recognition, finite mixtures allow a formal (probabilistic model-based) approach to unsupervised learning (i.e., clustering) [28], [29], [35], [37], =-=[57]-=-. In fact, finite mixtures naturally model observations which are assumed to have been produced by one (randomly selected and unknown) of a set of alternative random sources. Inferring (the parameters... |

546 |
Stochastic Complexity
- Rissanen
- 1989
(Show Context)
Citation Context ..., LEC, in [37]), and Schwarz's Bayesian inference criterion (BIC) [10], [17], [22], [53]. . Approaches based on information/coding theory concepts, such as Rissanen's minimum description length (MDL) =-=[49]-=-, which formally coincides with BIC, the minimum message length (MML) criterion [42], [60], [61], Akaike's information criterion(AIC) [62], and the informational complexity criterion (ICOMP) [8]. . Me... |

520 |
Mixture models: inference and applications to clustering
- McLachlan, Basford
- 1988
(Show Context)
Citation Context ...ning) is currently widely acknowledged. In statistical pattern recognition, finite mixtures allow a formal (probabilistic model-based) approach to unsupervised learning (i.e., clustering) [28], [29], =-=[35]-=-, [37], [57]. In fact, finite mixtures naturally model observations which are assumed to have been produced by one (randomly selected and unknown) of a set of alternative random sources. Inferring (th... |

497 | On Bayesian analysis of mixtures with an unknown number of components (with discussion
- Richardson, Green
- 1997
(Show Context)
Citation Context ...or mixture inference: to implement model selection criteria (e.g., [2], [39], [51]); or, in fully Bayesian way, to sample from the full a posteriori distribution with k considered unknown [40], [45], =-=[48]-=-. Despite their formal appeal, we thinkthat MCMC-based techniques are still far too computationally demanding to be useful in pattern recognition applications. Resampling-based schemes [33] and cross-... |

458 |
Maximum Likelihood Estimation from Incomplete Data Via the EM Algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...please send e-mail to: tpami@computer.org,and reference IEEECS Log Number 112382. The standard method used to fit finite mixture models to observed data is the expectation-maximization (EM) algorithm =-=[18]-=-, [36], [37], which converges to a maximum likelihood (ML) estimate of the mixture parameters. However, the EM algorithm for finite mixture fitting has several drawbacks: it is a local (greedy) method... |

437 | Sphere Packings, Lattices and Groups - Conway, Sloane - 1988 |

425 | Mixtures of probabilistic principal component analyzers, Neural Computation 11
- Tipping, Bishop
- 1999
(Show Context)
Citation Context ...x. Since v may be of lower dimension than y, MFA are able to perform local dimensionality reduction. MFA are closely related to the mixtures of probabilistic principal component analyzers proposed in =-=[56]-=-. An EM algorithm for MFA was derived in [24]. The split and merge EM algorithm [59] was also applied to MFA, successfully overcoming most of the initialization sensitivity of EM. A recently proposed ... |

357 | Model-based Gaussian and non-Gaussian clustering - Banfield, Raftery - 1993 |

322 | How Many Clusters? Which Clustering Method? Answers Via Model Based Cluster Analysis - Fraley, Raftery - 1998 |

319 | Monotone operators and the proximal point algorithm - Rockafellar - 1976 |

234 | The EM algorithm for mixtures of factor analyzers - Ghahramani, Hinton - 1996 |

206 | Pairwise data clustering by deterministic annealing
- Hofmann, Buhmann
- 1997
(Show Context)
Citation Context ...ocal maxima of the log-likelihood has been proposed [59]. Deterministic annealing (DA) has been used with success to avoid the initialization dependence of k-means type algorithms for hard-clustering =-=[27]-=-, [38], [52]. The resulting algorithm is similar to EM for Gaussian mixtures under the constraint of covariance matrices of the form TI, where T is called the temperature and I is the identity matrix.... |

170 | The infinite Gaussian mixture model - Rasmussen - 2000 |

159 | Discriminant analysis by Gaussian mixtures - Hastie, Tibshirani - 1996 |

158 | Variational inference for Bayesian mixture of factor analysers
- Ghahramani, Beal
- 1999
(Show Context)
Citation Context ...uccessfully overcoming most of the initialization sensitivity of EM. A recently proposed variational Bayesian approach estimates the number of components and also the dimensionality of each component =-=[23]-=-. We tested the algorithm proposed here on the noisy shrinking spiral data. As described in [59], the goal is to extract a piece-wise linear approximation to a onedimensional non-linear manifold from ... |

156 | Modeling the manifolds of images of handwritten digits - Hinton, Dayan, et al. - 1997 |

156 |
Small sample size effect in statistical pattern recognition: recommendations for practitioners
- Raudys, Jain
- 1991
(Show Context)
Citation Context ...alent number of points assigned to the mth component). This is in accordance with known results on the relation between sample size, dimensionality, and error probability in supervised classification =-=[46]-=-, [47]; namely, in learning quadratic discriminants, the training sample size needed to guarantee a given error probability grows (approximately) quadratically with the dimensionality of the feature s... |

153 | On convergence properties of the em algorithm for gaussian mixtures
- Xu, Jordan
- 1996
(Show Context)
Citation Context ...5], [36], [37]. EM is an iterative procedure which finds local maxima of log p… Yj † or ‰log p… Yj †‡log p… †Š. For the case of Gaussian mixtures, the convergence behavior of EM is well studied [37], =-=[63]-=-. It was recently shown that EM belongs to a class of iterative methods called proximal point algorithms (PPA; for an introduction to PPA and a comprehensive set of references see [4], chapter 5) [13]... |

125 | Practical Bayesian density estimation using mixtures of normals
- Roeder, Wasserman
- 1997
(Show Context)
Citation Context ...teria. 3.1.2 Stochastic and Resampling Methods Markov chain Monte Carlo (MCMC) methods can be used in two different ways for mixture inference: to implement model selection criteria (e.g., [2], [39], =-=[51]-=-); or, in fully Bayesian way, to sample from the full a posteriori distribution with k considered unknown [40], [45], [48]. Despite their formal appeal, we thinkthat MCMC-based techniques are still fa... |

118 |
Assessing a mixture model for clustering with the integrated completed likelihood
- Biernacki, Celeux, et al.
(Show Context)
Citation Context ...oximate weight of evidence (AWE) [1], the classification likelihood criterion (CLC) [7], the normalized entropy criterion (NEC) [6], [12], and the integrated classification likelihood (ICL) criterion =-=[5]-=-. A more detailed review of these methods is found in [37] (chapter 6) which also includes a comparative study where ICL and LEC are found to outperform the other criteria. 3.1.2 Stochastic and Resamp... |

112 | SMEM Algorithm for Mixture Models - Ueda, Nakano, et al. |

109 | Minimum Message Length and Kolmogorov complexity - Wallace, Dowe - 1999 |

98 | Deterministic annealing EM algorithm - Ueda, Nakano - 1998 |

94 |
On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture
- McLachlan
- 1987
(Show Context)
Citation Context ...40], [45], [48]. Despite their formal appeal, we thinkthat MCMC-based techniques are still far too computationally demanding to be useful in pattern recognition applications. Resampling-based schemes =-=[33]-=- and cross-validation approaches [54] have also been used to estimate the number of mixture components. In terms of computational load, these methods are closer to stochastic techniques than to determ... |

81 | W.: Bayesian approaches to Gaussian mixture modeling - Roberts, Husmeier, et al. - 1998 |

70 | Structure Learning in Conditional Probability Models via an Entropic Prior and Parameter
- Brand
- 1999
(Show Context)
Citation Context ...s configurations where 1 equals either zero or one. For comparison, we also plot the corresponding Jeffreys prior (as given by (12), dashed line) and the minimum entropy prior p… 1† ˆ 1 1 …1 1† …1 1† =-=[9]-=-, [64] (dotted line). Notice how these other priors also favor estimates where 1 equals either zero or one, though not as strongly. that has to be done is to remove the unnecessary ones. We have previ... |

68 | Model selection for probabilistic clustering using cross-validated likelihood
- SMYTH
- 1998
(Show Context)
Citation Context ... appeal, we thinkthat MCMC-based techniques are still far too computationally demanding to be useful in pattern recognition applications. Resampling-based schemes [33] and cross-validation approaches =-=[54]-=- have also been used to estimate the number of mixture components. In terms of computational load, these methods are closer to stochastic techniques than to deterministic ones. 3.2 The Drawbacks of EM... |

67 | Régularisation d’inéquations variationnelles par approximations successives,” Revue Française d’Informatique et de Recherche Opérationnelle - Martinet - 1970 |

56 |
An Entropy Criterion for Assessing the Number of Clusters in a Mixture Model
- Celeux, Soromenho
- 1996
(Show Context)
Citation Context ..., which is also called classification likelihood), such as the approximate weight of evidence (AWE) [1], the classification likelihood criterion (CLC) [7], the normalized entropy criterion (NEC) [6], =-=[12]-=-, and the integrated classification likelihood (ICL) criterion [5]. A more detailed review of these methods is found in [37] (chapter 6) which also includes a comparative study where ICL and LEC are f... |

46 |
On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition
- Raudys, Pikelis
- 1980
(Show Context)
Citation Context ...number of points assigned to the mth component). This is in accordance with known results on the relation between sample size, dimensionality, and error probability in supervised classification [46], =-=[47]-=-; namely, in learning quadratic discriminants, the training sample size needed to guarantee a given error probability grows (approximately) quadratically with the dimensionality of the feature space. ... |

44 | Unsupervised Learning Using MML - Oliver, Baxter, et al. - 1996 |

39 | A Component-wise EM Algorithm for Mixtures - CELEUX, CHRÉTIEN, et al. |

36 | Testing for Mixtures: A Bayesian Entropic approach - Mengersen, Robert - 1996 |

32 | Linear flaw detection in woven textiles using model-based clustering
- Campbell, Fraley, et al.
- 1997
(Show Context)
Citation Context ...ia that have been used for mixtures include: . Approximate Bayesian criteria, like the one in [50] (termed Laplace-empirical criterion, LEC, in [37]), and Schwarz's Bayesian inference criterion (BIC) =-=[10]-=-, [17], [22], [53]. . Approaches based on information/coding theory concepts, such as Rissanen's minimum description length (MDL) [49], which formally coincides with BIC, the minimum message length (M... |

31 | Bayesian mixture modeling - Neal - 1992 |

31 |
Rissanen: Intertwining themes in theories of model order estimation
- Lanterman, Schwarz
- 2001
(Show Context)
Citation Context ... close to the optimal value. Conversely, with a coarse precision, Length…e † is small, but Length…Yje † can be very far from optimal. There are several ways to formalize and solve this trade off; see =-=[32]-=- for a comprehensive review and pointers to the literature. The fact that the data itself may also be real-valued does not cause any difficulty; simply truncate Y to some arbitrary fine precision and ... |

31 |
Choosing the Number of Component Clusters in the Mixture Model Using a New Informational Complexity Criterion of the Inverse-Fisher
- Bozdogan
- 1993
(Show Context)
Citation Context ...MDL) [49], which formally coincides with BIC, the minimum message length (MML) criterion [42], [60], [61], Akaike's information criterion(AIC) [62], and the informational complexity criterion (ICOMP) =-=[8]-=-. . Methods based on the complete likelihood (4), which is also called classification likelihood), such as the approximate weight of evidence (AWE) [1], the classification likelihood criterion (CLC) [... |

29 |
Maximum-likelihood training of probabilistic neural networks
- Streit, Luginbuhl
- 1994
(Show Context)
Citation Context ...nsity functions (pdf's). This fact makes them an excellent choice for representing complex class-conditional pdf's (i.e., likelihood functions) in (Bayesian) supervised learning scenarios [25], [26], =-=[55]-=-, or priors for Bayesian parameter estimation [16]. Mixture models can also be used to perform feature selection [43]. . M.A.T. Figueiredo is with the Institute of Telecommunications and the Departmen... |

27 |
A View of the EM Algorithm that
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...ces see [4], chapter 5) [13]. Seeing EM under this new light opens the door to several extensions and generalizations. An earlier related result, although without identifying EM as a PPA, appeared in =-=[41]-=-. …3† The EM algorithm is based on the interpretation of Y as incomplete data. For finite mixtures, the missing part is a set of n labels Zˆfz…1† ; ...; z…n† g associated with the n samples, indicatin... |