## Batch and on-line parameter estimation of Gaussian mixtures based on the joint entropy (1998)

Venue: | In Neural Information Processing Systems |

Citations: | 15 - 1 self |

### BibTeX

@INPROCEEDINGS{Singer98batchand,

author = {Yoram Singer and Manfred K. Warmuth},

title = {Batch and on-line parameter estimation of Gaussian mixtures based on the joint entropy},

booktitle = {In Neural Information Processing Systems},

year = {1998},

pages = {578--584},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We describe a new iterative method for parameter estimation of Gaussian mixtures. The new method is based on a framework developed by Kivinen and Warmuth for supervised on-line learning. In contrast to gradient descent and EM, which estimate the mixture’s covariance matrices, the proposed method estimates the inverses of the covariance matrices. Furthermore, the new parameter estimation procedure can be applied in both on-line and batch settings. We show experimentally that it is typically faster than EM, and usually requires about half as many iterations as EM. We also describe experiments with digit recognition that demonstrate the merits of the on-line version. 1

### Citations

8564 |
Elements of Information Theory
- Cover, Thomas
- 2003
(Show Context)
Citation Context ...nvex. Thus, the loss function � � � � ��� � � � � �s�¢�¤§¦©¨ � � ��� may have multiple minima, making the problem of finding difficult. In order to sidestep this problem we use the log-sum inequality =-=[5]-=- to obtain an upper bound for the distance function � ¦���� � � ¦���� � . We denote this upper bound as . � ����� ��� ��� � � ��� � � � � ����� ��� � � � � � � � � � � � � ����� � ������� ��� � � � ��... |

8084 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...d and hidden variables. For brevity we term the new iterative parameter estimation method the joint-entropy (JE) update. The JE update shares a common characteristic with the Expectation Maximization =-=[6, 17]-=- algorithm as it first calculates the same expectations. However, it replacessthe maximization step with a different update of the parameters. For instance, it updates the inverse of the covariance ma... |

4823 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...uction Mixture models, in particular mixtures of Gaussians, have been a popular tool for density estimation, clustering, and un-supervised learning with a wide range of applications (see for instance =-=[7, 2]-=- and the references therein). Mixture models are one of the most useful tools for handling incomplete data, in particular hidden variables. For Gaussian mixtures the hidden variables indicate for each... |

3919 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...uction Mixture models, in particular mixtures of Gaussians, have been a popular tool for density estimation, clustering, and un-supervised learning with a wide range of applications (see for instance =-=[7, 2]-=- and the references therein). Mixture models are one of the most useful tools for handling incomplete data, in particular hidden variables. For Gaussian mixtures the hidden variables indicate for each... |

2168 | Support-Vector Networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ... US Postal Service (USPS) data set is a collection of digits collected from actual handwritten mailings. The problem of automatic digit recognition using this data set was by several researchers (see =-=[4]-=- and the references therein). This data set contains ¢ digit is represented by �¤£¦¥��§£ a pixel image that can � � £ take different values. Error rates of various classifiers for this data set are gi... |

625 |
Statistical Analysis of Finite Mixture Distributions
- Titterington, Smith, et al.
- 1985
(Show Context)
Citation Context ...� � � � ¡ � �£� � � ��� � ¤ � � � � ¤ (c) To guarantee convergence of the on-line update one should use a diminishing learning rate, that � ���sis � as (for further motivation and proof technique see =-=[18]-=-). � � � � ��� � � � � � ��� � � � � � � � � ¤ � � . � 6 Experiments We conducted numerous experiments with the new update. Due to the lack of space we describe here only a few. In the first experimen... |

505 |
Mixture densities, maximum likelihood and the EM algorithm
- Redner, Walker
- 1984
(Show Context)
Citation Context ...d and hidden variables. For brevity we term the new iterative parameter estimation method the joint-entropy (JE) update. The JE update shares a common characteristic with the Expectation Maximization =-=[6, 17]-=- algorithm as it first calculates the same expectations. However, it replacessthe maximization step with a different update of the parameters. For instance, it updates the inverse of the covariance ma... |

142 | On convergence properties of the EM algorithm for gaussian mixtures - Xu, Jordan - 1996 |

134 |
Additive versus exponentiated gradient updates for linear prediction
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...M algorithm. In this paper we describe a new technique for estimating the parameters of Gaussian mixtures. The new parameter estimation method is based on a framework developed by Kivinen and Warmuth =-=[12]-=- for supervised on-line learning. This framework was successfully used in a large number of supervised and un-supervised problems (see for instance [9, 8, 13, 1]). Our goal is to find a local minimum ... |

96 | Convergence results for the EM approach to mixture of experts architectures - Jordan, Xu - 1995 |

87 | Comparison of learning algorithms for handwritten digit recognition
- LeCun, Jackel, et al.
- 1995
(Show Context)
Citation Context ...mpared to other digit data sets. One reason for the relatively poor performance of all classifiers is due to a significant disparity in the shape of the digits constituting the training and test sets =-=[14]-=-. While the training set was cleaned by removing digits that were “chopped” by a segmentation algorithm, the test set was kept untouched. Thus, there are shapes that occur rather frequently in the tes... |

71 | Relative loss bounds for multidimensional regression problems
- Kivinen, Warmuth
- 2001
(Show Context)
Citation Context ...on a framework developed by Kivinen and Warmuth [12] for supervised on-line learning. This framework was successfully used in a large number of supervised and un-supervised problems (see for instance =-=[9, 8, 13, 1]-=-). Our goal is to find a local minimum of a loss function which, in our case, is the negative log likelihood induced by a mixture of Gaussians. However, rather than minimizing the loss directly we add... |

53 | Update rules for parameter estimation in Bayesian networks
- Bauer, Koller, et al.
- 1997
(Show Context)
Citation Context ...on a framework developed by Kivinen and Warmuth [12] for supervised on-line learning. This framework was successfully used in a large number of supervised and un-supervised problems (see for instance =-=[9, 8, 13, 1]-=-). Our goal is to find a local minimum of a loss function which, in our case, is the negative log likelihood induced by a mixture of Gaussians. However, rather than minimizing the loss directly we add... |

34 | A comparison of new and old algorithms for a mixture estimation problem, Machine Learning 27(1
- Helmbold, Schapire, et al.
- 1997
(Show Context)
Citation Context ...on a framework developed by Kivinen and Warmuth [12] for supervised on-line learning. This framework was successfully used in a large number of supervised and un-supervised problems (see for instance =-=[9, 8, 13, 1]-=-). Our goal is to find a local minimum of a loss function which, in our case, is the negative log likelihood induced by a mixture of Gaussians. However, rather than minimizing the loss directly we add... |

25 | An Iterative Procedure for Obtaining Maximum-Likelihood Estimates of the Parameters for a Mixture of Normal Distributions - Peters, Walker - 1978 |

10 | M.K.: Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence - Jagota, Warmuth - 1998 |

10 | Recent extensions to the EM algorithm (with discussion - Meng, Rubin - 1992 |

5 |
Worstcase loss bounds for sigmoided neurons
- Helmbold, Kivinen, et al.
- 1995
(Show Context)
Citation Context |

1 |
Neural-network and ¥ -nearest neighbor classifiers
- Bromley, Sackinger
- 1991
(Show Context)
Citation Context ...mixture model in an on-line mode is � � ¢¤£ , which is lower than the error rates of previously studied classifiers. Furthermore, it is actually better than human performance (in the sense defined in =-=[3]-=-). These results provide some empirical evidence that the JE update, when used in the on-line mode, is able to track and efficiently approximate the distribution of a time varying source. Note, howeve... |