## Distance metric learning for large margin nearest neighbor classification (2006)

### Cached

### Download Links

- [www.seas.upenn.edu]
- [john.blitzer.com]
- [www.jmlr.org]
- [jmlr.csail.mit.edu]
- [pdf.aminer.org]
- [jmlr.org]
- [books.nips.cc]
- [www.cs.utah.edu]
- [www.cs.gmu.edu]
- CiteULike
- DBLP

### Other Repositories/Bibliography

Venue: | In NIPS |

Citations: | 391 - 10 self |

### BibTeX

@INPROCEEDINGS{Weinberger06distancemetric,

author = {Kilian Q. Weinberger and John Blitzer and Lawrence K. Saul},

title = {Distance metric learning for large margin nearest neighbor classification},

booktitle = {In NIPS},

year = {2006},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3 % on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1

### Citations

4347 |
Convex optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...s they may not be reproducible across different problems and applications. We can overcome these difficulties by reformulating the optimization of Eq. (13) as an instance of semidefinite programming (=-=Boyd and Vandenberghe, 2004-=-). A semidefinite program (SDP) is a linear program that incorporates an additional constraint on a symmetric matrix whose elements are linear in the unknown variables. This additional constraint requ... |

3161 | Eigenfaces for recognition
- Turk, Pentland
- 1991
(Show Context)
Citation Context ...&T face recognition data set 4 contains 400 grayscale images of 40 individuals in 10 different poses. We downsampled the images from to 38 × 31 pixels and used PCA to obtain 30-dimensional eigenfaces =-=[15]-=-. Training and test sets were created by randomly sampling 7 images of each person for training and 3 images for testing. The task involved 40-way classification—essentially, recognizing a face from a... |

2901 | Normalized Cuts and Image Segmentation
- Shi, Malik
- 1997
(Show Context)
Citation Context ...in learning these metrics specifically to maximize the margin of correct kNN classification. As a first step, we partition the training data into disjoint clusters using k-means, spectral clustering (=-=Shi and Malik, 2000-=-), or label information. (In our experience, the latter seems to work best.) We then learn a Mahalanobis distance metric for each cluster. While the training procedure couples the distance metrics in ... |

2420 |
Principal Component Analysis
- Jolliffe
- 1986
(Show Context)
Citation Context ...er et al., 2001; Schölkopf et al., 1998; Tsang et al., 2005), though we do not discuss such formulations here. 2.2.1 PRINCIPAL COMPONENT ANALYSIS We briefly review principal component analysis (PCA) (=-=Jolliffe, 1986-=-) in the context of distance metric learning. Essentially, PCA computes the linear transformation ⃗xi → L⃗xi that projects the training inputs {⃗xi} n i=1 into a variance-maximizing subspace. The vari... |

1401 | Shape matching and object recognition using shape context
- Belongie, Malik, et al.
- 2002
(Show Context)
Citation Context ...s for pattern classification. Nevertheless, it often yields competitive results, and in certain domains, when cleverly combined with prior knowledge, it has significantly advanced the state-ofthe-art =-=[1, 14]-=-. The kNN rule classifies each unlabeled example by the majority label among its k-nearest neighbors in the training set. Its performance thus depends crucially on the distance metric used to identify... |

1207 | Nonlinear component analysis as a kernel eigenvalue problem
- Sch¨olkopf, Smola, et al.
- 1998
(Show Context)
Citation Context ... the way that they use labeled or unlabeled data to derive linear transformations of the input space. These methods can also be “kernelized” to work in a nonlinear feature space (Müller et al., 2001; =-=Schölkopf et al., 1998-=-; Tsang et al., 2005), though we do not discuss such formulations here. 2.2.1 PRINCIPAL COMPONENT ANALYSIS We briefly review principal component analysis (PCA) (Jolliffe, 1986) in the context of dista... |

1142 |
The use of multiple measurements in taxonomic problems
- Fisher
- 1936
(Show Context)
Citation Context ...or simply by re-ordering the input coordinates in terms of their variance (as discussed further in section 6). 2.2.2 LINEAR DISCRIMINANT ANALYSIS We briefly review linear discriminant analysis (LDA) (=-=Fisher, 1936-=-) in the context of distance metric learning. Let Ωc denote the set of indices of examples in the cth class (with yi = c). Essentially, LDA computes the linear projection⃗xi → L⃗xi that maximizes the ... |

1012 |
Nearest neighbor pattern classification
- Cover, Hart
- 1967
(Show Context)
Citation Context ...Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1 Introduction The k-nearest neighbors (kNN) rule =-=[3]-=- is one of the oldest and simplest methods for pattern classification. Nevertheless, it often yields competitive results, and in certain domains, when cleverly combined with prior knowledge, it has si... |

860 | Semidefinite programming
- Vandenberghe, Boyd
- 1996
(Show Context)
Citation Context ...is only triggered by differently labeled examples that invade each other’s neighborhoods. Convex optimization We can reformulate the optimization of eq. (2) as an instance of semidefinite programming =-=[16]-=-. A semidefinite program (SDP) is a linear program with the additional constraint that a matrix whose elements are linear in the unknown variables is required to be positive semidefinite. SDPs are con... |

657 | An algorithm for finding best matches in logarithmic expected time
- Friedman, Bentley, et al.
- 1977
(Show Context)
Citation Context ... can be used for both faster training and testing of LMNN classifiers. 6.1 Review of Ball Trees Several authors have proposed tree-based data structures to speed up kNN search. Examples are kd-trees (=-=Friedman et al., 1977-=-), ball trees (Liu et al., 2005; Omohundro, 1987) and cover-trees (Beygelzimer et al., 2006). All these data structures exploit the same idea: to partition the input space data into hierarchically nes... |

624 | Solving multiclass learning problems via error-correcting output codes
- Dietterich, Bakiri
- 1995
(Show Context)
Citation Context ...iri report test error rates of 4.2% using nonlinear backpropagation networks with 26 output units (one per class) and 3.3% using nonlinear backpropagation networks with a 30-bit error correcting code =-=[5]-=-. LMNN with energy-based classification obtains a test error rate of 3.7%. Text categorization The 20-newsgroups data set consists of posted articles from 20 newsgroups, with roughly 1000 articles per... |

600 | M.: Learning the kernel matrix with semidefinite programming - Lanckriet, Cristianini, et al. - 2004 |

573 | Distance metric learning, with application to clustering with side-information
- Xing, Ng, et al.
- 2002
(Show Context)
Citation Context ...es are separated by a large margin. Our goal for metric learning differs in a crucial way from those of previous approaches that minimize the pairwise distances between all similarly labeled examples =-=[12, 13, 17]-=-. This latter objective is far more difficult to achieve and does not leverage the full power of kNN classification, whose accuracy does not require that all similarly labeled inputs be tightly cluste... |

484 |
Learning with Kernels: Support Vector
- Schölkopf, Smola
- 2001
(Show Context)
Citation Context ...nduced by this cost function are illustrated in Fig. 1 for an input with k =3 target neighbors. Parallels with SVMs The competing terms in eq. (2) are analogous to those in the cost function for SVMs =-=[11]-=-. In both cost functions, one term penalizes the norm of the “parameBEFORE AFTER margin �xi �xi target neighbor local neighborhood margin �xi �xi Similarly labeled Differently labeled Differently labe... |

444 | An introduction to kernel-based learning algorithms - M¨uller, Mika, et al. |

403 | On the algorithmic implementation of multiclass kernel-based vector machines
- Crammer, Singer
(Show Context)
Citation Context ...nary) classification. Extensions of SVMs to multiclass problems typically involve combining the results of many binary classifiers, or they require additional machinery that is elegant but nontrivial =-=[4]-=-. In both cases the training time scales at least linearly in the number of classes. By contrast, our learning problem has no explicit dependence on the number of classes. 2 Model Let {(�xi, yi)} n i=... |

265 | Discriminant Adaptive Nearest Neighbor Classification
- HASTIE, TIBSHIRANI
- 1996
(Show Context)
Citation Context ...use the same distance metric for face recognition as for gender identification, even if in both tasks, distances are computed between the same fixed-size images. In fact, as shown by many researchers =-=[2, 6, 7, 8, 12, 13]-=-, kNN classification can be significantly improved by learning a distance metric from labeled examples. Even a simple (global) linear transformation of input features has been shown to yield much bett... |

248 |
Efficient Pattern Recognition Using a New Transformation Distance
- Simard, LeCun, et al.
- 1993
(Show Context)
Citation Context ...s for pattern classification. Nevertheless, it often yields competitive results, and in certain domains, when cleverly combined with prior knowledge, it has significantly advanced the state-ofthe-art =-=[1, 14]-=-. The kNN rule classifies each unlabeled example by the majority label among its k-nearest neighbors in the training set. Its performance thus depends crucially on the distance metric used to identify... |

233 | Neighbourhood components analysis
- Goldberger, Roweis, et al.
- 2004
(Show Context)
Citation Context ...use the same distance metric for face recognition as for gender identification, even if in both tasks, distances are computed between the same fixed-size images. In fact, as shown by many researchers =-=[2, 6, 7, 8, 12, 13]-=-, kNN classification can be significantly improved by learning a distance metric from labeled examples. Even a simple (global) linear transformation of input features has been shown to yield much bett... |

201 | Integrating constraints and metric learning in semi-supervised clustering
- Bilenko, Basu, et al.
- 2004
(Show Context)
Citation Context ...stance metrics for kNN classification is at least a decade old (Hastie and Tibshirani, 1996). It has also been explored more recently in the context of metric learning for semi-supervised clustering (=-=Bilenko et al., 2004-=-). The novelty of our approach lies in learning these metrics specifically to maximize the margin of correct kNN classification. As a first step, we partition the training data into disjoint clusters ... |

170 | Learning the discriminative power-invariance trade-off
- Varma, Ray
- 2007
(Show Context)
Citation Context ...) to work in a nonlinear feature space, as opposed to the original input space. The idea of learning a kernel matrix has been explored in other contexts (Kwok and Tsang, 2003; Lanckriet et al., 2004; =-=Varma and Ray, 2007-=-), particularly large margin classification by support vector machines. This idea for LMNN has been investigated in detail by Torresani and Lee (2007). The “kernel trick” is used to map the inputs ⃗xi... |

167 |
A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow
- Bow
- 1996
(Show Context)
Citation Context ...h roughly 1000 articles per newsgroup. We used the 18828-version of the data set 5 which has crosspostings removed and some headers stripped out. We tokenized the newsgroups using the rainbow package =-=[10]-=-. Each article was initially represented by the weighted word-counts of the 20,000 most common words. We then reduced the dimensionality by projecting the data onto its leading 200 principal component... |

157 | Cover trees for nearest neighbor
- Beygelzimer, Kakade, et al.
- 2006
(Show Context)
Citation Context ...up kNN search is to build a sophisticated tree-based data structure for storing training examples. Such a data structure can reduce the nearest neighbor test time complexity in practice to O(d logn) (=-=Beygelzimer et al., 2006-=-). This latter method works best for low dimensional data. Fig. 10 compares a baseline implementation of kNN search versus one based on ball trees (Liu et al., 2005; Omohundro, 1987). Note how the spe... |

152 | Metric learning by collapsing classes
- Globerson, Roweis
- 2006
(Show Context)
Citation Context ...l minimum can be efficiently computed. There have been other studies in distance metric learning based on eigenvalue problems (Shental et al., 2002; De Bie et al., 2003) and semidefinite programming (=-=Globerson and Roweis, 2006-=-; Shalev-Shwartz et al., 2004; Xing et al., 2002). These previous approaches, however, essentially attempt to learn distance metrics that cluster together all similarly labeled inputs, even those that... |

124 | Learning a mahalanobis metric from equivalence constraints - Bar-Hillel, Hertz, et al. |

123 | An elementary proof of the johnsonlindenstrauss lemma
- Dasgupta, Gupta
- 1999
(Show Context)
Citation Context ...eline implementation of kNN search. Generally there are two major approaches to gain additional speed-ups. The first approach is to reduce the input dimensionality d. The Johnson-Lindenstrauss Lemma (=-=Dasgupta and Gupta, 1999-=-) states that n points can be mapped into a space of dimensionality O( log(n) ε2 ) such that the distances between any two points changes only by a factor of (1 ± ε). Thus we can often reduce the dime... |

98 | V.: Comparison of learning algorithms for handwritten digit recognition
- LeCun, Jackel, et al.
- 1995
(Show Context)
Citation Context ...SVMs using linear and RBF kernels; Fig. 2 reports the results of the better classifier. On MNIST, we used a non-homogeneous polynomial kernel of degree four, which gave us our best results. (See also =-=[9]-=-.) 1 A great speedup can be achieved by solving an SDP that only monitors a fraction of the margin conditions, then using the resulting solution as a starting point for the actual SDP of interest. 2 A... |

88 | An investigation of practical approximate nearest neighbor algorithms
- Liu, Moore, et al.
- 2005
(Show Context)
Citation Context ...y in practice to O(d logn) (Beygelzimer et al., 2006). This latter method works best for low dimensional data. Fig. 10 compares a baseline implementation of kNN search versus one based on ball trees (=-=Liu et al., 2005-=-; Omohundro, 1987). Note how the speed-up from the ball trees is magnified by dimensionality reduction of the inputs. 233WEINBERGER AND SAUL Relative Speedup ����� ����� ����� ����� ����� ���� 3-NN C... |

87 |
Efficient algorithms with neural net work behavior
- Omohundro
- 1987
(Show Context)
Citation Context ...(d logn) (Beygelzimer et al., 2006). This latter method works best for low dimensional data. Fig. 10 compares a baseline implementation of kNN search versus one based on ball trees (Liu et al., 2005; =-=Omohundro, 1987-=-). Note how the speed-up from the ball trees is magnified by dimensionality reduction of the inputs. 233WEINBERGER AND SAUL Relative Speedup ����� ����� ����� ����� ����� ���� 3-NN Classification wit... |

82 | Adjustment learning and relevant component analysis
- Shental, Hertz, et al.
- 2002
(Show Context)
Citation Context ...use the same distance metric for face recognition as for gender identification, even if in both tasks, distances are computed between the same fixed-size images. In fact, as shown by many researchers =-=[2, 6, 7, 8, 12, 13]-=-, kNN classification can be significantly improved by learning a distance metric from labeled examples. Even a simple (global) linear transformation of input features has been shown to yield much bett... |

64 | Online and batch learning of pseudo-metrics
- Shalev-Shwartz, Singer, et al.
- 2004
(Show Context)
Citation Context |

56 | Learning with Idealized Kernel
- Kwok, Tsang
- 2003
(Show Context)
Citation Context ...sing kernel methods (Schölkopf and Smola, 2002) to work in a nonlinear feature space, as opposed to the original input space. The idea of learning a kernel matrix has been explored in other contexts (=-=Kwok and Tsang, 2003-=-; Lanckriet et al., 2004; Varma and Ray, 2007), particularly large margin classification by support vector machines. This idea for LMNN has been investigated in detail by Torresani and Lee (2007). The... |

51 | Fast solvers and efficient implementations for distance metric learning
- Weinberger, Saul
- 2008
(Show Context)
Citation Context ... distance metric for kNN classification. The algorithm that we propose was described at a high level in earlier work (Weinberger et al., 2006) and later extended in terms of scalability and accuracy (=-=Weinberger and Saul, 2008-=-). Intuitively, the algorithm is based on the simple observation that the kNN decision rule will correctly classify an example if its k-nearest neighbors share the same label. The algorithm attempts t... |

34 |
Large margin component analysis
- Torresani, Lee
- 2007
(Show Context)
Citation Context ... rectangular of size r × d, where r is the desired output dimensionality (presumed to be much smaller than the input dimensionality, d). The optimization in terms of L is not convex, but in practice (=-=Torresani and Lee, 2007-=-), it does not appear to suffer from very poor local minima. In the following section, we use and compare both these methods to build efficient tree data structures for LMNN classification. 6. Metric ... |

15 | Kernel relevant component analysis for distance metric learning
- Tsang, Cheung, et al.
(Show Context)
Citation Context ...abeled or unlabeled data to derive linear transformations of the input space. These methods can also be “kernelized” to work in a nonlinear feature space (Müller et al., 2001; Schölkopf et al., 1998; =-=Tsang et al., 2005-=-), though we do not discuss such formulations here. 2.2.1 PRINCIPAL COMPONENT ANALYSIS We briefly review principal component analysis (PCA) (Jolliffe, 1986) in the context of distance metric learning.... |

7 | Large margin nearest neighbor classifiers
- Domeniconi, Gunopulos, et al.
(Show Context)
Citation Context |

6 |
Learning a similiarty metric discriminatively, with application to face verification
- Chopra, Hadsell, et al.
- 2005
(Show Context)
Citation Context |