Results 1 
9 of
9
A View Of The Em Algorithm That Justifies Incremental, Sparse, And Other Variants
 Learning in Graphical Models
, 1998
"... . The EM algorithm performs maximum likelihood estimation for data in which some variables are unobserved. We present a function that resembles negative free energy and show that the M step maximizes this function with respect to the model parameters and the E step maximizes it with respect to the d ..."
Abstract

Cited by 766 (16 self)
 Add to MetaCart
. The EM algorithm performs maximum likelihood estimation for data in which some variables are unobserved. We present a function that resembles negative free energy and show that the M step maximizes this function with respect to the model parameters and the E step maximizes it with respect to the distribution over the unobserved variables. From this perspective, it is easy to justify an incremental variant of the EM algorithm in which the distribution for only one of the unobserved variables is recalculated in each E step. This variant is shown empirically to give faster convergence in a mixture estimation problem. A variant of the algorithm that exploits sparse conditional distributions is also described, and a wide range of other variant algorithms are also seen to be possible. 1. Introduction The ExpectationMaximization (EM) algorithm finds maximum likelihood parameter estimates in problems where some variables were unobserved. Special cases of the algorithm date back several dec...
A Comparison of New and Old Algorithms for A Mixture Estimation Problem
 Machine Learning
, 1995
"... . We investigate the problem of estimating the proportion vector which maximizes the likelihood of a given sample for a mixture of given densities. We adapt a framework developed for supervised learning and give simple derivations for many of the standard iterative algorithms like gradient projectio ..."
Abstract

Cited by 34 (13 self)
 Add to MetaCart
. We investigate the problem of estimating the proportion vector which maximizes the likelihood of a given sample for a mixture of given densities. We adapt a framework developed for supervised learning and give simple derivations for many of the standard iterative algorithms like gradient projection and EM. In this framework, the distance between the new and old proportion vectors is used as a penalty term. The square distance leads to the gradient projection update, and the relative entropy to a new update which we call the exponentiated gradient update (EGj ). Curiously, when a second order Taylor expansion of the relative entropy is used, we arrive at an update EMj which, for j = 1, gives the usual EM update. Experimentally, both the EMjupdate and the EGjupdate for j ? 1 outperform the EM algorithm and its variants. We also prove a polynomial bound on the rate of convergence of the EGj algorithm. 1. Introduction The problem of maximumlikelihood (ML) estimation of a mixture of de...
Batch and online parameter estimation of Gaussian mixtures based on the joint entropy
 In Neural Information Processing Systems
, 1998
"... We describe a new iterative method for parameter estimation of Gaussian mixtures. The new method is based on a framework developed by Kivinen and Warmuth for supervised online learning. In contrast to gradient descent and EM, which estimate the mixture’s covariance matrices, the proposed method est ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
We describe a new iterative method for parameter estimation of Gaussian mixtures. The new method is based on a framework developed by Kivinen and Warmuth for supervised online learning. In contrast to gradient descent and EM, which estimate the mixture’s covariance matrices, the proposed method estimates the inverses of the covariance matrices. Furthermore, the new parameter estimation procedure can be applied in both online and batch settings. We show experimentally that it is typically faster than EM, and usually requires about half as many iterations as EM. We also describe experiments with digit recognition that demonstrate the merits of the online version. 1
Maximum likelihood estimation via the ECM algorithm: Computing the asymptotic variance
, 1994
"... Abstract: This paper provides detailed theory, algorithms, and illustrations for computing asymptotic variancecovariance matrices for maximum likelihood estimates using the ECM algorithm (Meng and Rubin (1993)). This Supplemented ECM (SECM) algorithm is developed as an extension of the Supplemented ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
Abstract: This paper provides detailed theory, algorithms, and illustrations for computing asymptotic variancecovariance matrices for maximum likelihood estimates using the ECM algorithm (Meng and Rubin (1993)). This Supplemented ECM (SECM) algorithm is developed as an extension of the Supplemented EM (SEM) algorithm (Meng and Rubin (1991a)). Explicit examples are given, including one that demonstrates SECM, like SEM, has a powerful internal error detecting system for the implementation of the parent ECM or of SECM itself.
Efficient Stochastic Source Coding and an Application to a Bayesian Network Source Model
 The Computer Journal
, 1997
"... this paper, we introduce a new algorithm called `bitsback coding' that makes stochastic source ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
this paper, we introduce a new algorithm called `bitsback coding' that makes stochastic source
A New Parameter Estimation Method for Gaussian Mixtures
 in Advances in Neural Information Processing Systems
, 1998
"... We describe a new iterative method for parameter estimation of Gaussian mixtures. The new method is based on a framework developed by Kivinen and Warmuth for supervised online learning. In contrast to gradient descent and EM, which estimate the mixture's covariance matrices, the proposed method esti ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We describe a new iterative method for parameter estimation of Gaussian mixtures. The new method is based on a framework developed by Kivinen and Warmuth for supervised online learning. In contrast to gradient descent and EM, which estimate the mixture's covariance matrices, the proposed method estimates the inverses of the covariance matrices. Furthermore, the new parameter estimation procedure can be applied in both online and batch settings. We show experimentally that it is typically faster than EM, and usually requires about half as many iterations as EM. We also describe experiments with digit recognition that demonstrate the merits of the online version when the source generating the data is nonstationary. Keywords: Mixture of Gaussians, Online learning, EM, Convergence rate, Digit recognition 1 Introduction Mixture models, in particular mixtures of Gaussians, have been a popular tool for density estimation, clustering, and unsupervised learning with a wide range of appl...
Making Stochastic Source Coding Efficient By Recovering Information
"... In this paper, we introduce a new algorithm called "bitsback coding" that makes stochastic source codes efficient. For a given onetomany source code, we show that this algorithm can actually be more efficient than the algorithm that always picks the shortest codeword. Optimal efficiency is ach ..."
Abstract
 Add to MetaCart
In this paper, we introduce a new algorithm called "bitsback coding" that makes stochastic source codes efficient. For a given onetomany source code, we show that this algorithm can actually be more efficient than the algorithm that always picks the shortest codeword. Optimal efficiency is achieved when codewords are chosen according to the Boltzmann distribution based on the codeword lengths. After presenting a binary Bayesian network model that assigns exponentially many codewords to each symbol, we show how a tractable approximation to the Boltzmann distribution can be used for bitsback coding. It turns out that a commonly used technique for determining parameters  maximum likelihood estimation  actually minimizes the optimal bitsback coding cost. A tractable approximation to maximum likelihood estimation  incremental expectation maximization  minimizes the bitsback coding cost as well. We illustrate the performance of bitsback coding first on a toy problem and then using real data with a binary Bayesian network that produces 2 60 possible codewords for each symbol. For both tasks, the rate for bitsback coding is nearly one half of that obtained by picking the shortest codeword for each symbol. 1