## Information-theoretic metric learning (2007)

### Cached

### Download Links

Venue: | in NIPS 2006 Workshop on Learning to Compare Examples |

Citations: | 149 - 13 self |

### BibTeX

@INPROCEEDINGS{Davis07information-theoreticmetric,

author = {Jason Davis and Brian Kulis and Suvrit Sra and Inderjit Dhillon},

title = {Information-theoretic metric learning},

booktitle = {in NIPS 2006 Workshop on Learning to Compare Examples},

year = {2007}

}

### Years of Citing Articles

### OpenURL

### Abstract

We formulate the metric learning problem as that of minimizing the differential relative entropy between two multivariate Gaussians under constraints on the Mahalanobis distance function. Via a surprising equivalence, we show that this problem can be solved as a low-rank kernel learning problem. Specifically, we minimize the Burg divergence of a low-rank kernel to an input kernel, subject to pairwise distance constraints. Our approach has several advantages over existing methods. First, we present a natural information-theoretic formulation for the problem. Second, the algorithm utilizes the methods developed by Kulis et al. [6], which do not involve any eigenvector computation; in particular, the running time of our method is faster than most existing techniques. Third, the formulation offers insights into connections between metric learning and kernel learning. 1

### Citations

530 |
Theory of point estimation
- Lehmann, Casella
- 1998
(Show Context)
Citation Context ...s & Stein, 1961). It can be shown that Stein’s loss is the unique scale invariant loss-function for which the uniform minimum variance unbiased estimator is also a minimum risk equivariant estimator (=-=Lehmann & Casella, 2003-=-). In the context of metric learning, the scale invariance implies that the divergence (4.1) remains invariant under any scaling of the feature space. The result can be further generalized to invarian... |

504 | S.: Distance metric learning with application to clustering with side-information
- Xing, Ng, et al.
- 2003
(Show Context)
Citation Context ...s the number of distance constraints, and d is the dimensionality of the data. In particular, this method does not require costly eigenvalue computations, unlike many other metric learning algorithms =-=[4, 10, 11]-=-. 2 Problem Formulation Given a set of n points {x1,...,xn} in ℜ d , we seek a positive definite matrix A which parameterizes the Mahalanobis distance: dA(xi,xj) = (xi − xj) T A(xi − xj). We assume th... |

326 | Distance Metric Learning for Large Margin Nearest Neighbor Classification
- Weinberger, Blitzer, et al.
- 2006
(Show Context)
Citation Context ...ghly problem-specific and ultimately dictates the success—or failure—of the learning algorithm. To this end, there have been several recent approaches that attempt to learn distance functions, e.g., (=-=Weinberger et al., 2005-=-; Xing et al., 2002; Globerson & Roweis, 2005; Shalev-Shwartz et al., 2004). These methods work by exploiting distance information that is intrinsically available in many learning settings. For exampl... |

263 |
Estimation with quadratic loss
- James, Stein
- 1956
(Show Context)
Citation Context ...0)�p(x; A)) = 1 −1 Dℓd(A0 2 , A−1 ) = 1 2 Dℓd(A, A0), (4.2) where the second line follows from definition (4.1). The LogDet divergence is also known as Stein’s loss, having originated in the work of (=-=James & Stein, 1961-=-). It can be shown that Stein’s loss is the unique scale invariant loss-function for which the uniform minimum variance unbiased estimator is also a minimum risk equivariant estimator (Lehmann & Casel... |

256 |
Parallel Optimization: Theory, Algorithms, and Applications
- Censor, Zenios
- 1997
(Show Context)
Citation Context ...ove. 4. Algorithm In this section, we first show that our information-theoretic objective (3.3) can be expressed as a particular type of Bregman divergence, which allows us to adapt Bregman’s method (=-=Censor & Zenios, 1997-=-) to solve the metric learning problem. We then show a surprising equivalence to a recently-proposed low-rank kernel learning problem (Kulis et al., 2006), allowing kernelization of the algorithm. 4.1... |

247 | Exponentiated gradient versus gradient descent for linear predictors
- KIVINEN, WARMUTH
- 1997
(Show Context)
Citation Context ...get distance, the algorithm uses its current Mahalanobis matrix to predict the distance between the given pair of points at each step. This formulation is standard in many online regression settings (=-=Kivinen & Warmuth, 1997-=-). More formally, assume the algorithm receives an instance (xt, yt, dt) at time step t, and predicts ˆ dt = dAt(xt, yt) using the current model At. The loss associated with this prediction is measure... |

243 | Discriminant adaptive nearest neighbor classification
- Hastie, Tibshirani
- 1996
(Show Context)
Citation Context ...includes (Shalev-Shwartz et al., 2004) (online metric learning), Relevant Components Analysis (RCA) (Shental et al., 2002) (similar to discriminant analysis), locally-adaptive discriminative methods (=-=Hastie & Tibshirani, 1996-=-), and learning from relative comparisons (Schutz & Joachims, 2003). Non-Mahalanobis based metric learning methods have also been proposed, though these methods usually suffer from suboptimal performa... |

208 | Neighbourhood components analysis
- Goldberger, Roweis, et al.
- 2005
(Show Context)
Citation Context ...ve also been proposed, though these methods usually suffer from suboptimal performance, non-convexity, or computational complexity. Some example methods include neighborhood component analysis (NCA) (=-=Goldberger et al., 2004-=-) that learns a distance metric specifically for nearest-neighbor based classification; convolutional neural net based methods of (Chopra et al., 2005); and a general Riemannian metric learning method... |

183 | A probabilistic framework for semi-supervised clustering
- Basu, Bilenko, et al.
(Show Context)
Citation Context ...esults. Note that both MCML and LMNN are not amenable to optimization subject to pairwise distance constraints. Instead, we compare our method to the semi-supervised clustering algorithm HMRF-KMeans (=-=Basu et al., 2004-=-). We use a standard 2-fold cross validation approach for evaluating semi-supervised clustering results. Distances are constrained to be either similar or dissimilar, based on class values, and are dr... |

130 | Metric learning by collapsing classes
- Globerson, Roweis
- 2006
(Show Context)
Citation Context ...s the success—or failure—of the learning algorithm. To this end, there have been several recent approaches that attempt to learn distance functions, e.g., (Weinberger et al., 2005; Xing et al., 2002; =-=Globerson & Roweis, 2005-=-; Shalev-Shwartz et al., 2004). These methods work by exploiting distance information that is intrinsically available in many learning settings. For example, in the problem of semi-supervised clusteri... |

123 | Learning a distance metric from relative comparisons
- Schultz, Joachims
- 2004
(Show Context)
Citation Context ...evant Components Analysis (RCA) (Shental et al., 2002) (similar to discriminant analysis), locally-adaptive discriminative methods (Hastie & Tibshirani, 1996), and learning from relative comparisons (=-=Schutz & Joachims, 2003-=-). Non-Mahalanobis based metric learning methods have also been proposed, though these methods usually suffer from suboptimal performance, non-convexity, or computational complexity. Some example meth... |

95 | Learning a similarity metric discriminatively, with application to face verification
- Chopra, Hadsell, et al.
- 2005
(Show Context)
Citation Context ...nclude neighborhood component analysis (NCA) (Goldberger et al., 2004) that learns a distance metric specifically for nearest-neighbor based classification; convolutional neural net based methods of (=-=Chopra et al., 2005-=-); and a general Riemannian metric learning method (Lebanon, 2006). 3. Problem Formulation Given a set of n points {x1, ..., xn} in R d , we seek a positive definite matrix A which parameterizes the (... |

78 | Adjustment learning and relevant component analysis
- Shental, Hertz, et al.
(Show Context)
Citation Context ...the authors present methods Information-Theoretic Metric Learning for learning Mahalanobis metrics includes (Shalev-Shwartz et al., 2004) (online metric learning), Relevant Components Analysis (RCA) (=-=Shental et al., 2002-=-) (similar to discriminant analysis), locally-adaptive discriminative methods (Hastie & Tibshirani, 1996), and learning from relative comparisons (Schutz & Joachims, 2003). Non-Mahalanobis based metri... |

60 | Kernel design using boosting
- Crammer, Keshet, et al.
- 2002
(Show Context)
Citation Context ...formulation considers distances between all pairs of similar and dissimilar points, whereas we consider only a fixed set of input pairwise constrained points. Other notable work includes the articles =-=[2, 5, 7, 8]-=-. Crammer et al. [2] applies boosting to kernel learning, for a connection of our method kernel learning see Section 3. Lanckriet et al. [7] study the problem of kernel learning via semidefinite progr... |

54 | Online and batch learning of pseudo-metrics
- Shalev-Shwartz, Singer, et al.
- 2004
(Show Context)
Citation Context ...fficiently, and the constraints are enforced incrementally. Furthermore, as discussed above, by including slacks on our constraints, we can accommodate “softmargin” constraints. Shalev-Shwartz et al. =-=[9]-=- consider an online metric learning setting, where the interpoint constraints are similar to ours. They also provide a margin interpretation, similar to that of [10]. Their formulation considers dista... |

46 | Matrix exponentiated gradient updates for on-line learning and Bregman projection
- TSUDA, RÄTSCH, et al.
(Show Context)
Citation Context ...nDℓd(A, At) + ηt(dt − A ˆ dt) 2 . To our knowledge, no relative loss bounds have been proven for the above problem. In fact, we know of no existing loss bounds for any LogDet-based online algorithms (=-=Tsuda et al., 2005-=-). We present below a novel algorithm with guaranteed bound on the regret. Our algorithm uses gradient descent, but it adapts the learning rate according to the input data to maintain positive-definit... |

31 | Improved error reporting for software that uses black box components
- Ha, Rossbach, et al.
- 2007
(Show Context)
Citation Context ... points, we compare it to existing state-of-the-art metric learning algorithms. We apply the algorithms to Clarify, a recently developed system that classifies software errors using machine learning (=-=Ha et al., 2007-=-). In this domain, we show that our algorithm effectively learns a metric for the problem of nearest neighbor software support. Furthermore, on standard UCI datasets, we show that our algorithm consis... |

26 | Metric learning for text documents - Lebanon |

22 | Differential entropic clustering of multivariate gaussians
- Davis, Dhillon
- 2007
(Show Context)
Citation Context ...elative entropy between two multivariate Gaussians can be expressed as the convex combination of a Mahalanobis distance between mean vectors and the LogDet divergence between the covariance matrices (=-=Davis & Dhillon, 2006-=-). Assuming the means of the Gaussians to be the same, we have, KL(p(x; A0)�p(x; A)) = 1 −1 Dℓd(A0 2 , A−1 ) = 1 2 Dℓd(A, A0), (4.2) where the second line follows from definition (4.1). The LogDet div... |

1 |
Parallel Optimization: TheInformation-Theoretic Metric Learning ory, Algorithms, and Applications
- Censor, Zenios
- 1997
(Show Context)
Citation Context ...ove. 4. Algorithm In this section, we first show that our information-theoretic objective (3.3) can be expressed as a particular type of Bregman divergence, which allows us to adapt Bregman’s method (=-=Censor & Zenios, 1997-=-) to solve the metric learning problem. We then show a surprising equivalence to a recently-proposed low-rank kernel learning problem (Kulis et al., 2006), allowing kernelization of the algorithm. 4.1... |

1 | Online linear regression using burg entropy
- Jain, Kulis, et al.
- 2007
(Show Context)
Citation Context ...et incurred by an online algorithm is to bound the loss incurred at each step in terms of the loss incurred by the optimal offline solution. Below, we state the result—the full proof can be found in (=-=Jain et al., 2007-=-). Lemma 2 (Loss at one step). at( ˆ dt − dt) 2 − bt(d ∗ t − dt) 2 ≤ Dℓd(A ∗ , At) − Dℓd(A ∗ , At+1), where A∗ is the optimal offline solution, d∗ t = dA∗(xt, yt), ηt at, bt are constants s.t. 0 ≤ at ... |

1 |
Information-Theoretic Metric Learning
- Chopra, Hadsell, et al.
- 2005
(Show Context)
Citation Context ...nclude neighborhood component analysis (NCA) (Goldberger et al., 2004) that learns a distance metric specifically for nearest-neighbor based classification; convolutional neural net based methods of (=-=Chopra et al., 2005-=-); and a general Riemannian metric learning method (Lebanon, 2006). 3. Problem Formulation Given a set of n points {x1,...,xn} in R d , we seek a positive definite matrix A which parameterizes the (sq... |

1 |
Learning Low-rank Kernels
- Kulis, Sustik, et al.
- 2006
(Show Context)
Citation Context ...roach has several advantages over existing methods. First, we present a natural information-theoretic formulation for the problem. Second, the algorithm utilizes the methods developed by Kulis et al. =-=[6]-=-, which do not involve any eigenvector computation; in particular, the running time of our method is faster than most existing techniques. Third, the formulation offers insights into connections betwe... |