## Low-Rank Kernel Learning with Bregman Matrix Divergences

Citations: | 21 - 1 self |

### BibTeX

@MISC{Kulis_low-rankkernel,

author = {Brian Kulis and Mátyás A. Sustik and Inderjit S. Dhillon},

title = {Low-Rank Kernel Learning with Bregman Matrix Divergences},

year = {}

}

### OpenURL

### Abstract

In this paper, we study low-rank matrix nearness problems, with a focus on learning lowrank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices.

### Citations

1512 |
Quantum computation and quantum information
- Nielsen, Chuang
- 2000
(Show Context)
Citation Context ...ence is DvN(X, Y ) = tr(X log X − X log Y − X + Y ), (1) and we call it the von Neumann divergence. This divergence is also called quantum relative entropy, and is used in quantum information theory (=-=Nielsen and Chuang, 2000-=-). Another important matrix divergence arises by taking the Burg entropy of the eigenvalues, i.e. φ(X) = − � i log λi, or equivalently as φ(X) = −log detX. The resulting Bregman divergence over positi... |

862 | Kernel Methods for Pattern Analysis - Shawe-Taylor, Cristianini - 2004 |

794 | A fast algorithm for particle simulations - Greengard, Rohklin - 1987 |

583 | Learning the kernel matrix with semidefinite programming - Lanckriet, Cristianini, et al. |

575 | Applied Numerical Linear Algebra
- Demmel
- 1997
(Show Context)
Citation Context ...me needed by matrix multiplication. We devote this section to the development of this fast multiplication algorithm. Recall the algorithm used for the Cholesky factorization of an r × r matrix A (see =-=Demmel, 1997-=-, page 78): for j = 1 to r do l j j = (a j j − ∑ j−1 k=1 l2 jk )1/2 for i = j + 1 to r do li j = (ai j − ∑ j−1 k=1 likl jk)/l j j end for end for 356LOW-RANK KERNEL LEARNING WITH BREGMAN MATRIX DIVER... |

363 | Metric learning for large margin nearest neighbor classification
- Weinberger, Blitzer, et al.
- 2005
(Show Context)
Citation Context ...gainst other existing metric learning algorithms. Such methods include metric learning by collapsing classes (MCML) (Globerson and Roweis, 2005), large-margin nearest neighbor metric learning (LMNN) (=-=Weinberger et al., 2005-=-), and many others. In the experimental results section, we provide some results comparing our algorithms with these existing methods. 6.2 Generalizations 6.2.1 Slack Variables In many cases, especial... |

362 | Algorithms for Minimization without Derivatives
- Brent
- 1973
(Show Context)
Citation Context ...ik and Dhillon (2008). Using the approach from the previous section, f(α) can be computed in O(r 2 ) time. One natural choice to find the root of f(α) is to apply Brent’s general root finding method (=-=Brent, 1973-=-), which does not need the calculation of the derivative of f(α). We have built an even more efficient custom root finder that is optimized for this problem. We rarely need more than six evaluations p... |

309 |
Estimation with quadratic loss
- James, Stein
- 1961
(Show Context)
Citation Context ... component analysis (Warmuth and Kuzmin, 2006). The LogDet divergence is called Stein’s loss in the statistics literature, where it has been used as a measure of distance between covariance matrices (=-=James and Stein, 1961-=-). It has also been employed in the optimization community; the updates for the BFGS and DFP algorithms (Fletcher, 1991), both quasi-Newton algorithms, can be viewed as LogDet optimization programs. I... |

293 |
The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming
- Bregman
- 1967
(Show Context)
Citation Context ...tor divergences. Let ϕ be a real-valued strictly convex function defined over a convex set S = dom(ϕ) ⊆ R m such that ϕ is differentiable on the relative interior of S. The Bregman vector divergence (=-=Bregman, 1967-=-) with respect to ϕ is defined as Dϕ(x, y) = ϕ(x) − ϕ(y) − (x − y) T ∇ϕ(y). For example, if ϕ(x) = x T x, then the resulting Bregman divergence is Dϕ(x, y) = �x−y� 2 2 . Another example is ϕ(x) = � i ... |

248 | On kernel-target alignment - Cristianini, Shawe-Taylor, et al. - 2002 |

197 | Integrating constraints and metric learning in semi-supervised clustering
- Bilenko, Basu, et al.
- 2004
(Show Context)
Citation Context ...ins handwritten samples of the digits 3, 8, and 9. The raw data for each digit is 16-dimensional; this subset contains 317 digits and is a standard data set for semi-supervised clustering (e.g., see, =-=Bilenko et al., 2004-=-). 2. GyrB: a protein data set: a 52 × 52 kernel matrix among bacteria proteins, containing three bacteria species. This matrix is identical to the one used to test the DefiniteBoost algorithm in Tsud... |

197 | Efficient SVM training using low-rank kernel representations - Fine, Scheinberg - 2001 |

166 | Information-theoretic metric learning
- Davis, Kulis, et al.
- 2007
(Show Context)
Citation Context ...0 = G0G T 0 . We can view the matrix B as a linear transformation applied to our input data vectors G0 and therefore we can apply this linear transformation to new points. In particular, recent work (=-=Davis et al., 2007-=-) has shown that the learning algorithms considered in this paper can equivalently be viewed as learning a Mahalanobis distance function given constraints on the data, which is simply the Euclidean di... |

163 | Impact of Similarity Measures on Web-page Clustering
- Strehl, Ghosh, et al.
- 2002
(Show Context)
Citation Context ...easure, a standard technique for determining quality of clusters, which measures the amount of statistical information shared by the random variables representing the cluster and class distributions (=-=Strehl et al., 2000-=-). If C is the random variable denoting the cluster assignments of the points, and K is the random variable denoting the underlying class labels on the points then the NMI measure is defined as: NMI =... |

155 | A hierarchical O(N log N) force calculation algorithm - BARNES, HUT |

145 | Metric learning by collapsing classes
- Globerson, Roweis
- 2005
(Show Context)
Citation Context ...n be viewed as Mahalanobis distance learning techniques, it is natural to compare against other existing metric learning algorithms. Such methods include metric learning by collapsing classes (MCML) (=-=Globerson and Roweis, 2005-=-), large-margin nearest neighbor metric learning (LMNN) (Weinberger et al., 2005), and many others. In the experimental results section, we provide some results comparing our algorithms with these exi... |

139 | Kernel kmeans, spectral clustering and normalized cuts
- Dhillon, Guan, et al.
- 2004
(Show Context)
Citation Context ...position. Empirically, the algorithm in Fine and Scheinberg (2001) outperforms other SVM training algorithms in terms of training time by several factors. In clustering, the kernel k-means algorithm (=-=Dhillon et al., 2004-=-) has a running time of O(n 2 ) per iteration, which can be improved to O(nrc) time per iteration with a low-rank kernel representation, where c is the number of desired clusters. Low-rank kernel repr... |

136 | Some modified matrix eigenvalue problems - Golub - 1973 |

133 | Learning a distance metric from relative comparisons - Schultz, Joachims - 2004 |

123 | Learning a kernel matrix for nonlinear dimensionality reduction
- Weinberger, Sha, et al.
- 2004
(Show Context)
Citation Context ...nel learning using semidefinite programming. In Kwok and Tsang (2003), a formulation based on idealized kernels is presented to learn a kernel matrix when some labels are given. Another recent paper (=-=Weinberger et al., 2004-=-) considers learning a kernel matrix for nonlinear dimensionality reduction; like much of the research on learning a kernel matrix, they use semidefinite programming and the running time is at least c... |

73 |
Adjustment of an inverse matrix corresponding to a change in one element of a given matrix
- Sherman, Morrison
- 1950
(Show Context)
Citation Context ...0 and a rank-one constraint matrix A = zz T is: Xt+1 = Vt((V T t XtVt) −1 − α(V T t zz T Vt)) −1 V T t , where the eigendecomposition of Xt is VtΛtV T t . Recall the Sherman–Morrison inverse formula (=-=Sherman and Morrison, 1949-=-; Golub and Van Loan, 1996): (A + uv T ) −1 = A −1 − A−1 uv T A −1 1 + v T A −1 u . 14sLow-Rank Kernel Learning with Bregman Matrix Divergences Applying this formula to the middle term of the projecti... |

69 | Fast image search for learned metrics - Jain, Kulis, et al. - 2008 |

66 | Predictive low-rank decomposition for kernel methods - Bach, Jordan - 2005 |

55 | Learning with idealized kernels - Kwok, Tsang - 2003 |

52 | A hierarchical O(NlogN) force calculation algorithm - Barnes, Hut - 1986 |

51 | Matrix exponentiated gradient updates for on-line learning and Bregman projection - Tsuda, Rätsch, et al. |

47 |
Parallel optimization
- Censor, Zenios
- 1997
(Show Context)
Citation Context ...ion problem (5) presented above, but without the rank constraint (we will see how to handle the rank constraint later). To solve this problem, we use the method of Bregman projections (Bregman, 1967; =-=Censor and Zenios, 1997-=-). Suppose we wish to minimize Dφ(X, X0) subject to linear equality and inequality constraints. The idea behind Bregman projections is to choose one constraint per iteration, and perform a projection ... |

41 | A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem - Gu, Eisenstat - 1994 |

37 | Learning low-rank kernel matrices - Kulis, Sustik, et al. - 2006 |

33 | Computing the nearest correlation matrix—a problem from finance - Higham - 2002 |

32 | Large margin component analysis - Torresani, Lee |

21 | Fixing two weaknesses of the spectral method
- Lang
- 2006
(Show Context)
Citation Context ...e given an n-vertex graph, whose adjacency matrix is A. Let L = diag(Ae) − A be the Laplacian of A, where e is the vector of all ones. The semidefinite relaxation to the minimum balanced cut problem (=-=Lang, 2005-=-) results in the following SDP: min X tr(LX) subject to diag(X) = e tr(Xee T ) = 0 X � 0. Let L † denote the pseudoinverse of the Laplacian, and let V be an orthonormal basis for the range space of L.... |

18 | Randomized PCA algorithms with regret bounds that are logarithmic in the dimension
- Warmuth, Kuzmin
(Show Context)
Citation Context ...m Tsuda et al. (2005); our analysis shows how to improve the running time of their algorithm by a factor of n, from O(n 3 ) time per projection to O(n 2 ). This projection is also used in online-PCA (=-=Warmuth and Kuzmin, 2006-=-), and we obtain a factor of n speedup for that problem as well. In terms of experimental results, a direct application of our techniques is in learning low-rank kernel matrices in the setting where w... |

17 |
A new variational result for quasi-Newton formulae
- Fletcher
- 1991
(Show Context)
Citation Context ...re it has been used as a measure of distance between covariance matrices (James and Stein, 1961). It has also been employed in the optimization community; the updates for the BFGS and DFP algorithms (=-=Fletcher, 1991-=-), both quasi-Newton algorithms, can be viewed as LogDet optimization programs. In particular, the update to the approximation of the Hessian given in these algorithms is the result of a LogDet minimi... |

13 | Matrix nearness problems with Bregman divergence - Dhillon, Tropp |

10 | Fast low-rank semidefinite programming for embedding and clustering - Kulis, Surendran, et al. - 2007 |

2 | Scalable semidefinite programming using convex perturbations - Kulis, Sra, et al. - 2007 |

1 | On some modified root-finding problems. Working manuscript - Sustik, Dhillon - 2008 |