## Matrix exponentiated gradient updates for on-line learning and Bregman projections (2005)

### Cached

### Download Links

Venue: | Journal of Machine Learning Research |

Citations: | 47 - 9 self |

### BibTeX

@ARTICLE{Tsuda05matrixexponentiated,

author = {Koji Tsuda and Gunnar Rätsch and Manfred K. Warmuth},

title = {Matrix exponentiated gradient updates for on-line learning and Bregman projections},

journal = {Journal of Machine Learning Research},

year = {2005},

volume = {6},

pages = {995--1018}

}

### Years of Citing Articles

### OpenURL

### Abstract

We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements. 1

### Citations

3699 | Convex optimization - Boyd, Vandenberghe - 2004 |

2321 | Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting
- Freund, E
- 1997
(Show Context)
Citation Context ...tions to each region [1]. In the original papers only asymptotic convergence was shown. More recently a connection [4, 7] was made to the AdaBoost algorithm which has an improved convergence analysis =-=[2, 9]-=-. We generalize the latter algorithm and its analysis to symmetric positive definite matrices and call the new algorithm DefiniteBoost. As in the original setting, only approximate projections (Figure... |

2036 | Online learning with kernels
- Kivinen, Smola, et al.
- 2001
(Show Context)
Citation Context ...arity between objects i and j. In the feature space, Kij corresponds to the inner product between object i and j, and thus the Euclidean distance can be computed from the entries of the kernel matrix =-=[10]-=-. In some cases, the kernel matrix is not given explicitly, but only a set of distance measurements is available. The data are represented either as (i) quantitative distance values (e.g., the distanc... |

1834 |
Robust statistics
- Huber
- 1981
(Show Context)
Citation Context ...ested by the experiments, our algorithm can be straightforwardly applied to learning a covariance matrix. It would also be interesting to use a robust loss Lt(W) for the purpose of ignoring outliers (=-=Huber, 1981-=-) and investigate possible applications of our learning algorithms to quantum statistical inference problems (Barndorff-Nielsen et al., 2003). Our method is designed for learning a positive definite p... |

1366 |
Quantum Computation and Quantum Information
- Nielsen, Chuang
- 2000
(Show Context)
Citation Context ...onentiated gradient from the Kullback-Leibler divergence. We use the von Neumann divergence (also called quantum relative entropy) for measuring the discrepancy between two positive definite matrices =-=[8]-=-. We derive a new Matrix Exponentiated Gradient update from this divergence (which is a Bregman divergence for positive definite matrices). Finally we prove relative loss bounds using the von Neumann ... |

700 | Improved boosting algorithms using confidence-rated prediction
- Schapire, Singer
- 1999
(Show Context)
Citation Context ...linear inequality constraints. The new DefiniteBoost algorithm greedily chooses the most violated constraint and performs an approximated Bregman projection. In the diagonal case, we recover AdaBoost =-=[9]-=-. We also show how the convergence proof of AdaBoost generalizes to the non-diagonal case. 2 von Neumann Divergence or Quantum Relative Entropy If F is a real convex differentiable function on the par... |

507 | Distance metric learning, with application to clustering with side information - Xing, Ng, et al. - 2004 |

260 |
The relaxation method for finding the common point of convex sets and its application to the solution of problems in convex programming
- Bregman
- 1967
(Show Context)
Citation Context ...intersection of convex regions defined by the constraints. It is well known that the Bregman projection into the intersection of convex regions can be solved by sequential projections to each region (=-=Bregman, 1967-=-; Censor and Lent, 1981). In the original papers only asymptotic convergence was shown. More recently a connection (Kivinen and Warmuth, 1999; Lafferty, 1999) was made to the AdaBoost algorithm which ... |

246 | Exponentiated gradient versus gradient descent for linear predictors
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...etric positive definite matrix that serves as a kernel (e.g. [14, 11, 13]). Learning is typically formulated as a parameter updating procedure to optimize a loss function. The gradient descent update =-=[6]-=- is one of the most commonly used algorithms, but it is not appropriate when the parameters form a positive definite matrix, because the updated parameter is not necessarily positive definite. Xing et... |

107 | Mistake bounds and logarithmic linear-threshold learning algorithms. Unpublished doctoral dissertation - Littlestone - 1989 |

105 |
Learning when irrelevant attributes abound
- Littlestone
- 1987
(Show Context)
Citation Context ...TSUDA, RÄTSCH AND WARMUTH We dedicate this paper to Nick Littlestone who first proved relative loss bounds for an algorithm in the EG family—his well known Winnow algorithm for learning disjunctions (=-=Littlestone, 1988-=-, 1989). K.T. and G.R. gratefully acknowledge partial support from the PASCAL Network of Excellence (EU #506778). M.W. was supported by NSF grant CCR 9821087 and UC Discovery grant LSIT0210110. This w... |

72 | Relative loss bounds for multidimensional regression problems
- Kivinen, Warmuth
- 2001
(Show Context)
Citation Context ...eness because the matrix exponential maps any symmetric matrix to a positive definite matrix.sBregman divergences play a central role in the motivation and the analysis of on-line learning algorithms =-=[5]-=-. A learning problem is essentially defined by a loss function, and a divergence that measures the discrepancy between parameters. More precisely, the updates are motivated by minimizing the sum of th... |

59 | Boosting as entropy projection
- Kivinen, Warmuth
- 1999
(Show Context)
Citation Context ...projection into the intersection of convex regions can be solved by sequential projections to each region [1]. In the original papers only asymptotic convergence was shown. More recently a connection =-=[4, 7]-=- was made to the AdaBoost algorithm which has an improved convergence analysis [2, 9]. We generalize the latter algorithm and its analysis to symmetric positive definite matrices and call the new algo... |

49 |
Learning kernels from biological networks by maximizing entropy
- Tsuda, Noble
(Show Context)
Citation Context ...ctured parameters. More specifically, when learning a similarity or a distance function among objects, the parameters are defined as a symmetric positive definite matrix that serves as a kernel (e.g. =-=[14, 11, 13]-=-). Learning is typically formulated as a parameter updating procedure to optimize a loss function. The gradient descent update [6] is one of the most commonly used algorithms, but it is not appropriat... |

42 |
An iterative row-action method for interval convex programming
- Censor, Lent
(Show Context)
Citation Context ... convex regions defined by the constraints. It is well known that the Bregman projection into the intersection of convex regions can be solved by sequential projections to each region (Bregman, 1967; =-=Censor and Lent, 1981-=-). In the original papers only asymptotic convergence was shown. More recently a connection (Kivinen and Warmuth, 1999; Lafferty, 1999) was made to the AdaBoost algorithm which has an improved converg... |

41 | The em algorithm for Kernel matrix completion with auxiliary data
- Tsuda, Akaho, et al.
- 2004
(Show Context)
Citation Context ...t in which the distance examples are created from a known target kernel matrix. We used a 52 × 52 kernel matrix amonggyrB proteins of bacteria (d = 52). This data contains three bacteria species (see =-=[12]-=- for details). Each distance example is created by randomly choosing one element of the target kernel. The initial parameter was set as W1 = I/d. When the comparison matrix U is set to the target matr... |

41 | On-line learning of linear functions
- Littlestone, Long, et al.
- 1991
(Show Context)
Citation Context ... ) operations (excluding the cost of identifying violated constraints). Similar bounds on the number of iterations for solving a system of linear equations with the EG algorithm were first proven in (=-=Littlestone et al., 1992-=-, Corollary 15). Observe that if (4.1) is not feasible, then one may continue finding ε-violated constraints and the primal objective can become unbounded, i.e. ∑t �αt may become unbounded. 4.5 Relati... |

39 | Additive models, boosting, and inference for generalized divergences
- Lafferty
- 1999
(Show Context)
Citation Context ...projection into the intersection of convex regions can be solved by sequential projections to each region [1]. In the original papers only asymptotic convergence was shown. More recently a connection =-=[4, 7]-=- was made to the AdaBoost algorithm which has an improved convergence analysis [2, 9]. We generalize the latter algorithm and its analysis to symmetric positive definite matrices and call the new algo... |

29 | Distance Metric Learning with Kernels
- Tsang, Kwok
(Show Context)
Citation Context ...ctured parameters. More specifically, when learning a similarity or a distance function among objects, the parameters are defined as a symmetric positive definite matrix that serves as a kernel (e.g. =-=[14, 11, 13]-=-). Learning is typically formulated as a parameter updating procedure to optimize a loss function. The gradient descent update [6] is one of the most commonly used algorithms, but it is not appropriat... |

20 |
Robust boosting via convex optimization
- Rätsch
- 2001
(Show Context)
Citation Context ...n time O(d 3 logd log(1/ε)/ε 2 ) (excluding the cost of identifying violated constraints). Additionally we assert that a slightly more advanced adaptation of θ during the optimization (as was done by =-=Rätsch, 2001-=-; Rätsch and Warmuth, 2005, for the diagonal case) will yield the reduced time complexity of O(d 3 logd/ε 2 ). Rigorous proofs of these conjectures go beyond the scope of this paper. 5. Experiments on... |

18 |
Lower Bounds for the Helmholtz Function
- Golden
- 1965
(Show Context)
Citation Context ...such that 0 < a ≤ 2b/(2 + r 2 b) and any learning rate η = 2b/(2 + r 2 b), we have a(yt − tr(WtXt)) 2 − b(yt − tr(UXt)) 2 ≤ ∆(U,Wt) − ∆(U,Wt+1) (5) In the proof, we use the Golden-Thompson inequality =-=[3]-=-, i.e., tr[exp(A + B)] ≥ tr[exp(A)exp(B)] for symmetric matrices A and B. We also needed to prove the following generalization of Jensen’s inequality to matrices: exp(ρ1A + ρ2(I − A)) ≤ exp(ρ1)A + exp... |

17 | Maximizing the margin with boosting
- Rätsch, Warmuth
- 2002
(Show Context)
Citation Context ...en θ < θ ∗ and also θ > θ ∗ . This allows the design of a binary search procedure to approximate θ ∗ in a few steps. Based on this idea we previously proposed a margin maximizing version of AdaBoost (=-=Rätsch and Warmuth, 2002-=-). For this algorithm we could show that after O(logd log(1/ε)/ε 2 ) iterations the algorithm achieved an optimal solution within accuracy ε. We claim that the outlined binary search procedure can als... |

15 |
Finding the common point of convex sets by the method of successive projection
- Brègman
- 1965
(Show Context)
Citation Context ... intersection of convex regions defined by the constraints. It is well known that the Bregman projection into the intersection of convex regions can be solved by sequential projections to each region =-=[1]-=-. In the original papers only asymptotic convergence was shown. More recently a connection [4, 7] was made to the AdaBoost algorithm which has an improved convergence analysis [2, 9]. We generalize th... |

14 | Batch and on-line parameter estimation of Gaussian mixtures based on the joint entropy - Singer, Warmuth - 1998 |

3 | and on-line parameter estimation of gaussian mixtures based on the joint entropy - Batch - 1998 |

1 |
A comparison of new and old algorithms for amixture estimation problem
- Helmbold, Schapire, et al.
- 1997
(Show Context)
Citation Context ...r(y − tr(XtWt)))) 1 − tr(XtWt)−r0 r −2ηyt(yt − tr(XtWt))+(a+ η2 b (y − tr(XtWt)) 2 . We now upper bound the second term using the inequality log(1 − p(1 − expq)) ≤ pq+q 2 /8, for 0 ≤ q ≤ 1 and q ∈ R (=-=Helmbold et al., 1997-=-): h ≤ (yt − tr(XtWt)) 2 ((2+r 2b 2 b)η 2 − 4bη+2ab). It remains to show q = (2+r 2 b)η 2 − 4bη+2ab ≤ 0. We easily see that q is minimized for η = 2b/(2+r 2 b) and that for this value of η we have q ≤... |

1 | Efficient margin maximization with boosting. submitted to - Rätsch, Warmuth |