## Adaptive Overrelaxed Bound Optimization Methods (2003)

### Cached

### Download Links

Venue: | In Proceedings of International Conference on Machine Learning, ICML. International Conference on Machine Learning, ICML |

Citations: | 30 - 0 self |

### BibTeX

@INPROCEEDINGS{Salakhutdinov03adaptiveoverrelaxed,

author = {Ruslan Salakhutdinov and Sam Roweis},

title = {Adaptive Overrelaxed Bound Optimization Methods},

booktitle = {In Proceedings of International Conference on Machine Learning, ICML. International Conference on Machine Learning, ICML},

year = {2003},

pages = {664--671}

}

### Years of Citing Articles

### OpenURL

### Abstract

We study a class of overrelaxed bound optimization algorithms, and their relationship to standard bound optimizers, such as ExpectationMaximization, Iterative Scaling, CCCP and Non-Negative Matrix Factorization. We provide a theoretical analysis of the convergence properties of these optimizers and identify analytic conditions under which they are expected to outperform the standard versions. Based on this analysis, we propose a novel, simple adaptive overrelaxed scheme for practical optimization and report empirical results on several synthetic and real-world data sets showing that these new adaptive methods exhibit superior performance (in certain cases by several orders of magnitude) compared to their traditional counterparts. Our "drop-in" extensions are simple to implement, apply to a wide variety of algorithms, almost always give a substantial speedup, and do not require any theoretical analysis of the underlying algorithm.

### Citations

8919 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ... t+1 ; t+1 ) G( t+1 ; t ) G( t ; t ) = L( t ) Many popular iterative algorithms are bound optimizers, including the EM algorithm for maximum likelihood learning in latent variable models[3], iterative scaling (IS) algorithms for parameter estimation in maximum entropy models[2] and the recent CCCP algorithm for minimizing the Bethe free energy in approximate inference problems[17]. Boun... |

2491 | Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
- Lafferty
- 2001
(Show Context)
Citation Context ...of lines, like begin-with-number, contains-http, etc. 3 We observe that AGIS outperforms GIS by several orders of magnitude. We have also obtained analogous results training Conditional Random Fields =-=[8]-=-. In our last experiment, we trained NMF and adaptive NMF on the data set of facial images to learn part-based representation of faces [9]. The data set consisted of m=2,429 facial images, each consis... |

1083 |
Learning the parts of objects by non-negative matrix factorization
- Lee, Seung
- 1999
(Show Context)
Citation Context ...ues of will result in much faster convergence. 4.4. Non-Negative Matrix Factorization (NMF) Given non-negative matrix V, we are interested in finding non-negative matrices W and H, such that V WH [9]. Posed as an optimization problem, we are interested in maximizing a negative divergence L() = D(V jjWH), subject to = (W; H) 0 elementwise, where: L() = P ij V ij ln V ij (WH) ij V ij + (WH)... |

1060 |
The EM Algorithm and Extensions
- McLachlan, Krishnan
- 1997
(Show Context)
Citation Context ...d expect the Hessian of the objective function ∂2L(Θ) ∂Θ2 |Θ=Θ∗ to be negative semidefinite, or negative definite, and therefore the eigenvalues of M ′ (Θ ∗ ) all lie in [0, 1] or [0, 1) respectively =-=[11]-=-. Exceptions to convergence of the bound optimizer occur if M ′ (Θ ∗ ) has eigenvalues that exceed unity. Lemma: If BO(η) iterates converge to Θ ∗ and Φ(Θ) and M(Θ) are differentiable in the parameter... |

572 | Inducing features of random fields
- Pietra, S
- 1997
(Show Context)
Citation Context ...thms are bound optimizers, including the EM algorithm for maximum likelihood learning in latent variable models[3], iterative scaling (IS) algorithms for parameter estimation in maximum entropy models=-=[2]-=- and the recent CCCP algorithm for minimizing the Bethe free energy in approximate inference problems[17]. Bound optimization algorithms enjoy a strong guarantee; they never worsen the objective funct... |

462 | Maximum entropy Markov models for information extraction and segmentation
- McCallum, Freitag, et al.
- 2000
(Show Context)
Citation Context ...10 mixture components. 2 Once again, AEM beats EM by a factor of over two, converging to the better likelihood. To present the comparison between GIS and AGIS, we trained Maximum Entropy Markov Model =-=[10]-=- on the Frequently Asked Questions (FAQ) data set. The data set consisted of 38 files belonging to 7 Usenet groups. Each file contains header, followed by a series of one or more question /answer pair... |

419 | Mixtures of probabilistic principal component analysers
- Tipping, Bishop
- 1999
(Show Context)
Citation Context ...e randomly initialized). We observe that even for the real, structured data AEM is superior to EM. We also experimented with the Probabilistic Principal Component Analysis (PPCA) latent variable model=-=[13, 15]-=-, which has continuous rather than discrete hidden variables. Here the concept of missing information is related to the ratios of the leading eigenvalues of the sample covariance, which corresponds to... |

232 | The EM algorithm for mixtures of factor analyzers
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...e, which corresponds to the ellipticity of the distribution. We observe that even for "nice" data, AEM outperforms EM by almost a factor of four. Similar results are displayed in figure 1 fo=-=r the MFA -=-[5] model. As a confirmation to our analysis, in figure 3 we show the evolution of the adaptive learning rate and the optimal learning rate during fitting the means of the four mixture components i... |

108 | SMEM algorithm for mixture models
- Ueda, Nakano, et al.
- 2000
(Show Context)
Citation Context ...ereas NMF converged in approximately 13,500 iterations to exactly the same value of the cost function, loosing to ANMF by a factor of almost four. 2 This experiment is similar to the one described in =-=[16]-=-. 3 See [10] for the description of the model and the data set. 4 See [9] for the detailed description of the experiment. 6. Discussion & Future Work In this paper we have analyzed the convergence pro... |

100 |
algorithms for PCA and SPCA
- EM
- 1997
(Show Context)
Citation Context ...e randomly initialized). We observe that even for the real, structured data AEM is superior to EM. We also experimented with the Probabilistic Principal Component Analysis (PPCA) latent variable model=-=[13, 15]-=-, which has continuous rather than discrete hidden variables. Here the concept of missing information is related to the ratios of the leading eigenvalues of the sample covariance, which corresponds to... |

81 | Aggregate and mixed-order Markov models for statistical language processing
- Saul, Pereira
- 1997
(Show Context)
Citation Context ...give speedup over conventional bound optimizers by several orders of magnitude! 5.2. Real World Data Sets To compare AEM and EM, our first experiment consisted of training Aggregate Markov models AMM =-=[14]-=- on the ARPA North American Business News (NAB) corpus, kindly provided to us by Lawrence Saul. AMMs are classbased bigram models in which the mapping from words to classes is probabilistic. The task ... |

56 | Y.: Update Rules for Parameter Estimation in Bayesian Networks
- Bauer, Koller, et al.
- 1997
(Show Context)
Citation Context ...for a mixture of given densities, and discovered that an EM() update rule can be viewed as a first order approximation to the exponentiated gradient EG() update. Following this, Bauer et al. (1997) [1] presented an analysis of EM() similar to [6] and derived the update rules for parameter estimation in discrete Bayesian networks. However, more generalsBO() methods have not been widely used for se... |

53 |
Algorithms for maximum-likelihood logistic regression
- Minka
- 2001
(Show Context)
Citation Context ...convergence rate matrix M 0 ( ) in eq.(1). To compare IS and adaptive IS algorithms, we applied both methods to the simple 2-class logistic regression model: p(y = 1jx; w) = 1=(1 + exp ( yw T x)) [12=-=]-=-. Feature 0 10 20 30 40 50 60 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 20 30 40 50 -15 -10 -5 0 10 -3 EM AEM 2-Class AMM 0 10 20 30 40 50 60 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 Log-Likelihood + Const 20 30 40 50... |

52 |
Conjugate gradient acceleration of the EM algorithm
- Jamshidian, Jennrich
- 1993
(Show Context)
Citation Context ... we would like to find the optimal learning rate in an adaptive fashion that is computationally inexpensive and valid everywhere. It is possible to perform a line search at each step to determine [7=-=]-=-; however, this is quite expensive. We now describe a very simple adaptive overrelaxed bound optimization (ABO) algorithm that is guaranteed not to decrease the objective function at each iteration an... |

34 | A comparison of new and old algorithms for a mixture estimation problem
- Helmbold, Singer, et al.
- 1995
(Show Context)
Citation Context ...= 1 BO() algorithms become just regular bound optimizers. Several authors have studied a particular variant of this idea as applied to Expectation Maximization. In particular, Helmbold et al. (1995) [=-=6-=-] investigated the problem of estimating the component priors for a mixture of given densities, and discovered that an EM() update rule can be viewed as a first order approximation to the exponentiate... |

17 |
On the Global and Componentwise Rates of Convergence of the EM algorithm. Linear Algebra and its Applications
- Meng, Rubin
- 1994
(Show Context)
Citation Context ...x M 0 ( ). For multidimensional vector , a measure of the actual observed convergence rate is the "global" rate, defined as: r = lim t!1 k t+1 k k t k (2) with k k being Euclidean norm[11]. It is also well-known that under some regularity conditions r = max (M 0 ) the largest eigenvalue of M 0 ( ). All of the eigenvalues of the convergence rate matrix M 0 ( ) lie in the interv... |

6 |
The convex-concave computational procedure (CCCP
- Yuille, Rangarajan
(Show Context)
Citation Context ... models[3], iterative scaling (IS) algorithms for parameter estimation in maximum entropy models[2] and the recent CCCP algorithm for minimizing the Bethe free energy in approximate inference problems=-=[17-=-]. Bound optimization algorithms enjoy a strong guarantee; they never worsen the objective function. 2. Overrelaxed Bound Optimization: BO() To guarantee an increase in the objective function at each ... |

3 |
data set
- gene
(Show Context)
Citation Context ...er two. Our second experiment consisted of training a fully connected HMM to model DNA sequences. For the training, we used publicly available "GENIE gene finding data set", provided by UCSC=-= and LBNL [4]-=-, that contains 793 unrelated human genomic DNA sequences. We applied our AEM algorithm to 66 DNA sequences with length varying anywhere between 200 to 3000 genes per sequence. The number of states we... |