## Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis (2007)

### Cached

### Download Links

Venue: | Journal of Machine Learning Research |

Citations: | 50 - 9 self |

### BibTeX

@ARTICLE{Sugiyama07dimensionalityreduction,

author = {Masashi Sugiyama and Sam Roweis},

title = {Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis},

journal = {Journal of Machine Learning Research},

year = {2007},

volume = {8},

pages = {1027--1061}

}

### OpenURL

### Abstract

Reducing the dimensionality of data without losing intrinsic information is an important preprocessing step in high-dimensional data analysis. Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a class are multimodal. An unsupervised dimensionality reduction method called localitypreserving projection (LPP) can work well with multimodal data due to its locality preserving property. However, since LPP does not take the label information into account, it is not necessarily useful in supervised learning scenarios. In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA), which effectively combines the ideas of FDA and LPP. LFDA has an analytic form of the embedding transformation and the solution can be easily computed just by solving a generalized eigenvalue problem. We demonstrate the practical usefulness and high scalability of the LFDA method in data visualization and classification tasks through extensive simulation studies. We also show that LFDA can be extended to non-linear dimensionality reduction scenarios by applying the kernel trick.

### Citations

9021 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ...nd side yields K�L (b) K�α = � λK�L (w) K�α. (17) This implies that {xi} n i=1 appear only in terms of their inner products. Therefore, we can obtain a non-linear variant of LFDA by the kernel trick (=-=Vapnik, 1998-=-; Schölkopf et al., 1998), which is explained below. Let us consider a non-linear mapping φ(x) from Rd to a reproducing kernel Hilbert space H (Aronszajn, 1950). Let K(x,x ′ ) be the reproducing kerne... |

8142 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...posed mixture discriminant analysis (MDA), which extends FDA to maximum likelihood estimation of Gaussian mixture distributions. A maximum likelihood solution is obtained by an EM-type algorithm (cf. =-=Dempster et al., 1977-=-). However, this is an iterative algorithm and gives only a local optimal solution. Therefore, the computation of MDA is rather slow and there is no guarantee that the global solution can be obtained.... |

3703 |
Convex optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...in Figure 1(a). The data set 1’ includes a single outlier. 4.5 Remark on Rank Constraint The optimization problem of MCML (see Eq. 24) is not generally convex since the rank constraint is non-convex (=-=Boyd and Vandenberghe, 2004-=-). The non-convexity induced by the rank constraint seems to be a universal problem in dimensionality reduction. NCA eliminates the rank constraint by decomposing U into T T ⊤ (see Eqs. 21 and 22). Ho... |

2871 |
UCI repository of machine learning databases. advances in kernel methods, support vector learning
- Blake, Merz
- 1998
(Show Context)
Citation Context ...ction, we numerically evaluate the performance of LFDA and existing methods. 5.1 Exploratory Data Analysis Here we use the Thyroid disease data set available from the UCI machine learning repository (=-=Blake and Merz, 1998-=-) and illustrate how LFDA can be used for exploratory data analysis. The original data consists of 5-dimensional input vector x of the following laboratory tests. 1. T3-resin uptake test. 2. Total Ser... |

2663 |
Introduction to statistical pattern recognition (2nd Ed
- Fukunaga
- 1990
(Show Context)
Citation Context ... classification, etc. In this paper, we consider the supervised dimensionality reduction problem, that is, samples are accompanied with class labels. Fisher discriminant analysis (FDA) (Fisher, 1936; =-=Fukunaga, 1990-=-) is a popular method for linear supervised dimensionality reduction. 1 FDA seeks for an embedding transformation such ∗. An efficient MATLAB implementation of local Fisher discriminant analysis is av... |

2044 | Online learning with kernels
- Kivinen, Smola, et al.
- 2004
(Show Context)
Citation Context ...eduction methods (e.g., Goldberger et al., 2005; Globerson and Roweis, 2006). Furthermore, LFDA can be naturally extended to nonlinear dimensionality reduction scenarios by applying the kernel trick (=-=Schölkopf and Smola, 2002-=-). The rest of this paper is organized as follows. In Section 2, we formulate the linear dimensionality reduction problem, briefly review FDA and LPP, and illustrate how they typically behave. In Sect... |

1930 |
Pattern Classification
- Duda, Hart, et al.
- 2001
(Show Context)
Citation Context ...p/˜sugi/software/LFDA/’. 1. FDA may refer to the classification method which first projects data samples onto a one-dimensional subspace and then classifies the samples by thresholding (Fisher, 1936; =-=Duda et al., 2001-=-). The one-dimensional embedding space used here is obtained as the maximizer of the so-called Fisher criterion. This Fisher criterion can be used for dimensionality reduction onto a space with dimens... |

1870 | Some methods for classification and analysis of multivariate observations
- MacQueen
- 1967
(Show Context)
Citation Context ...ixture components (clusters) in each class as well as the initial location of cluster centers should be determined by users. For cluster centers, using standard techniques such as k-means clustering (=-=MacQueen, 1967-=-; Everitt et al., 2001) or learning vector quantization (Kohonen, 1989) are recommended. However, they are also iterative algorithms and have no guarantee that the global solution can be obtained. Fur... |

1696 | A global geometric framework for nonlinear dimensionality reduction, Science 290 - Tenenbaum, Silva, et al. - 2000 |

1623 | Nonlinear dimensionality reduction by locally linear embedding, Science 290 - Roweis, Saul - 2000 |

1309 |
Self-Organization and Associative Memory
- Kohonen
- 1989
(Show Context)
Citation Context ...ion of cluster centers should be determined by users. For cluster centers, using standard techniques such as k-means clustering (MacQueen, 1967; Everitt et al., 2001) or learning vector quantization (=-=Kohonen, 1989-=-) are recommended. However, they are also iterative algorithms and have no guarantee that the global solution can be obtained. Furthermore, there seems to be no systematic method for determining the n... |

1282 |
Spline Models for Observational Data
- Wahba
- 1990
(Show Context)
Citation Context ... to classification tasks, we often want to embed the data samples into spaces with several different dimensions—the best dimensionality is later chosen by, for example, cross-validation (Stone, 1974; =-=Wahba, 1990-=-). In such a scenario, NCA requires to optimize the transformation matrix individually for each dimensionality r of the embedding space. On the other hand, LFDA needs to compute the transformation mat... |

1250 |
On information and sufficiency
- Kullback, Leibler
- 1951
(Show Context)
Citation Context ...sses are mapped to other locations. In reality, however, any U may not be able to attain pi, j(U) = p∗ i, j exactly; instead the optimal approximation to p∗i, j under the Kullback-Leibler divergence (=-=Kullback and Leibler, 1951-=-) is obtained. This is formally defined as U MCML ≡ argmin U∈Rd×d � n ∑ p i, j=1 ∗ p i, j log ∗ � i, j pi, j(U) subject to U ∈ PSD(r), (24) where PSD(r) is the set of all positive semidefinite matrice... |

1059 | Nonlinear component analysis as a kernel eigenvalue problem
- Scholkopf, Smola, et al.
- 1998
(Show Context)
Citation Context ... K�L (b) K�α = � λK�L (w) K�α. (17) This implies that {xi} n i=1 appear only in terms of their inner products. Therefore, we can obtain a non-linear variant of LFDA by the kernel trick (Vapnik, 1998; =-=Schölkopf et al., 1998-=-), which is explained below. Let us consider a non-linear mapping φ(x) from Rd to a reproducing kernel Hilbert space H (Aronszajn, 1950). Let K(x,x ′ ) be the reproducing kernel of H . A typical choic... |

991 |
Spectral graph theory
- Chung
- 1997
(Show Context)
Citation Context ...e generalized eigenvalues γ1 ≥ γ2 ≥ ··· ≥ γd of the following generalized eigenvalue problem: where XLX ⊤ ψ = γXDX ⊤ ψ, L ≡ D − A. L is called the graph-Laplacian matrix in the spectral graph theory (=-=Chung, 1997-=-), where A is seen as the adjacency matrix of a graph. He and Niyogi (2004) showed that a solution of Eq. (4) is given by T LPP = (ψ d |ψ d−1 |···|ψ d−r+1 ). 2.4 Typical Behavior of FDA and LPP Dimens... |

971 |
The Use of Multiple Measurements in Taxonomic Problems
- Fisher
- 1936
(Show Context)
Citation Context ...visualization, classification, etc. In this paper, we consider the supervised dimensionality reduction problem, that is, samples are accompanied with class labels. Fisher discriminant analysis (FDA) (=-=Fisher, 1936-=-; Fukunaga, 1990) is a popular method for linear supervised dimensionality reduction. 1 FDA seeks for an embedding transformation such ∗. An efficient MATLAB implementation of local Fisher discriminan... |

786 |
Theory of reproducing kernels
- Aronszajn
- 1950
(Show Context)
Citation Context ...linear variant of LFDA by the kernel trick (Vapnik, 1998; Schölkopf et al., 1998), which is explained below. Let us consider a non-linear mapping φ(x) from Rd to a reproducing kernel Hilbert space H (=-=Aronszajn, 1950-=-). Let K(x,x ′ ) be the reproducing kernel of H . A typical choice of the kernel function would be the Gaussian kernel: K(x,x ′ � ) = exp − �x − x′ �2 2σ2 � , with σ > 0. For other choices, see, for e... |

723 |
Cross-validatory choice and assessment of statistical predictions
- Stone
- 1974
(Show Context)
Citation Context ...ue is applied to classification tasks, we often want to embed the data samples into spaces with several different dimensions—the best dimensionality is later chosen by, for example, cross-validation (=-=Stone, 1974-=-; Wahba, 1990). In such a scenario, NCA requires to optimize the transformation matrix individually for each dimensionality r of the embedding space. On the other hand, LFDA needs to compute the trans... |

473 |
Cluster analysis
- Everitt, Landau, et al.
- 2001
(Show Context)
Citation Context ...s (clusters) in each class as well as the initial location of cluster centers should be determined by users. For cluster centers, using standard techniques such as k-means clustering (MacQueen, 1967; =-=Everitt et al., 2001-=-) or learning vector quantization (Kohonen, 1989) are recommended. However, they are also iterative algorithms and have no guarantee that the global solution can be obtained. Furthermore, there seems ... |

387 | Text classification using string kernels - Lodhi, Saunders, et al. |

343 | Reducing the dimensionality of data with neural networks." Science 313(5786 - Hinton, Salakhutdinov - 2006 |

333 | Distance Metric Learning for Large Margin Nearest Neighbor Classification
- Weinberger, Blitzer, et al.
- 2005
(Show Context)
Citation Context .... (12). Therefore, the solution T LFDA is not unique—the range of the transforis uniquely determined, but the distance metric (Goldberger et al., 2005; Globermation H ⊤ T ⊤ LFDA son and Roweis, 2006; =-=Weinberger et al., 2006-=-) in the embedding space can be arbitrary because of the arbitrariness of the matrix H. In practice, we propose determining the LFDA transformation matrix T LFDA as follows. First, we rescale the gene... |

304 | Regularized discriminant analysis
- Friedman
- 1989
(Show Context)
Citation Context ...nd we cannot directly solve the generalized eigenvalue problem (17). To cope with this problem, we propose regularizing K�L (w) K and solving the following generalized eigenvalue problem instead (cf. =-=Friedman, 1989-=-). K�L (b) K�α = � λ(K�L (w) K + εIn)�α, (18) where ε is a small constant. Let {�αk} n k=1 be the generalized eigenvectors associated with the generalized eigenvalues �λ1 ≥ �λ2 ≥ ··· ≥ �λn of Eq. (18)... |

256 | Soft margins for adaboost - Rätsch, Onoda, et al. - 2001 |

255 | Convolution kernels for natural language
- Collins, Duffy
- 2001
(Show Context)
Citation Context ...this kernelized variant of LFDA kernel LFDA (KLFDA). Recently, kernel functions for non-vectorial structured data such as strings, trees, and graphs have been proposed (see, e.g., Lodhi et al., 2002; =-=Duffy and Collins, 2002-=-; Kashima and Koyanagi, 2002; Kondor and Lafferty, 2002; Kashima et al., 2003; Gärtner et al., 2003; Gärtner, 2003). Since KLFDA uses the samples only via the kernel function K(x,x ′ ), it allows us t... |

255 | Think globally, fit locally: unsupervised learning of low dimensional manifolds - Saul, Roweis |

244 | Discriminant adaptive nearest neighbor classification
- Hastie, Tibshirani
- 1996
(Show Context)
Citation Context ...s the relation between the proposed LFDA and other methods. 1037sSUGIYAMA 4.1 Dimensionality Reduction Using Local Discriminant Information A discriminant adaptive nearest neighbor (DANN) classifier (=-=Hastie and Tibshirani, 1996-=-a) employs an adapted distance metric at each test point for classification. Based on a similar idea, they also proposed a global supervised dimensionality reduction method using local discriminant in... |

219 | Generalized discriminant analysis using a kernel approach - Baudat, Anouar - 2000 |

211 | Neighbourhood Component Analysis
- Goldberger, Roweis, et al.
- 1989
(Show Context)
Citation Context ...bedding matrix and the solution can be easily computed just by solving a generalized eigenvalue problem. This is an advantage over recently proposed supervised dimensionality reduction methods (e.g., =-=Goldberger et al., 2005-=-; Globerson and Roweis, 2006). Furthermore, LFDA can be naturally extended to nonlinear dimensionality reduction scenarios by applying the kernel trick (Schölkopf and Smola, 2002). The rest of this pa... |

211 | Locality preserving projections
- He, Niyogi
- 2004
(Show Context)
Citation Context ...ionality of multimodal data. In order to reduce the dimensionality of multimodal data appropriately, it is important to preserve the local structure of the data. Locality-preserving projection (LPP) (=-=He and Niyogi, 2004-=-) meets this requirement; LPP seeks for an embedding transformation such that nearby data pairs in the original space close in the embedding space. Thus LPP can reduce the dimensionality of multimodal... |

193 | Self-tuning spectral clustering - Zelnik-manor, Perona - 2004 |

186 | Diffusion kernels on graphs and other discrete input
- Kondor, Lafferty
- 2002
(Show Context)
Citation Context ...Recently, kernel functions for non-vectorial structured data such as strings, trees, and graphs have been proposed (see, e.g., Lodhi et al., 2002; Duffy and Collins, 2002; Kashima and Koyanagi, 2002; =-=Kondor and Lafferty, 2002-=-; Kashima et al., 2003; Gärtner et al., 2003; Gärtner, 2003). Since KLFDA uses the samples only via the kernel function K(x,x ′ ), it allows us to reduce the dimensionality of such non-vectorial data.... |

151 | R.: Discriminant analysis by gaussian mixtures
- Hastie, Tibshirani
- 1996
(Show Context)
Citation Context ...s the relation between the proposed LFDA and other methods. 1037sSUGIYAMA 4.1 Dimensionality Reduction Using Local Discriminant Information A discriminant adaptive nearest neighbor (DANN) classifier (=-=Hastie and Tibshirani, 1996-=-a) employs an adapted distance metric at each test point for classification. Based on a similar idea, they also proposed a global supervised dimensionality reduction method using local discriminant in... |

144 | Marginalized kernels between labeled graphs
- Kashima, Tsuda, et al.
- 2003
(Show Context)
Citation Context ...for non-vectorial structured data such as strings, trees, and graphs have been proposed (see, e.g., Lodhi et al., 2002; Duffy and Collins, 2002; Kashima and Koyanagi, 2002; Kondor and Lafferty, 2002; =-=Kashima et al., 2003-=-; Gärtner et al., 2003; Gärtner, 2003). Since KLFDA uses the samples only via the kernel function K(x,x ′ ), it allows us to reduce the dimensionality of such non-vectorial data. 4. Comparison with Re... |

133 | Metric Learning by Collapsing Classes
- Globerson, Roweis
- 2005
(Show Context)
Citation Context ...lution can be easily computed just by solving a generalized eigenvalue problem. This is an advantage over recently proposed supervised dimensionality reduction methods (e.g., Goldberger et al., 2005; =-=Globerson and Roweis, 2006-=-). Furthermore, LFDA can be naturally extended to nonlinear dimensionality reduction scenarios by applying the kernel trick (Schölkopf and Smola, 2002). The rest of this paper is organized as follows.... |

126 |
Regression and the Moore-Penrose Pseudoinverse
- Albert
- 1972
(Show Context)
Citation Context ...e positive semidefinite. Then the values of the Fisher criterion (3) for vi, j and αvi, j are expressed as tr � W −1B � and tr � W −1 � α Bα , respectively. The standard matrix inversion lemma (e.g., =-=Albert, 1972-=-) yields If yi = y j, we have W (w) i, j tr � W −1 B � = tr W −1 = (W α + βW (w) i, j vi, jv ⊤ i, j) −1 > 0 and W (b) i, j = W −1 α − W −1 α vi, j(W −1 α vi, j) ⊤ (βW (w) i, j )−1 + 〈W −1 α vi, j,vi, ... |

114 | Kernels for structured data - Gartner - 2008 |

111 | A Kernel View of the Dimensionality Reduction of Manifolds, ICML’04
- Ham, Lee, et al.
- 2004
(Show Context)
Citation Context ... element being Di,i ≡ n ∑ Ai, j. j=1 3. The matrix D in the constraint (4) is motivated by a geometric argument (Belkin and Niyogi, 2003). However, it is sometimes dropped for the sake of simplicity (=-=Ham et al., 2004-=-). 1030 �sLOCAL FISHER DISCRIMINANT ANALYSIS Eq. (4) implies that LPP looks for a transformation matrix T such that nearby data pairs in the original space R d are kept close in the embedding space. T... |

72 | On the Goldstein-Levitin-Polyak gradient projection method
- Bertsekas
- 1976
(Show Context)
Citation Context ...ends to outperform FDA. 6. In our implementation of MCML, we used a constant step size for the gradient descent. The computation time could be improved if, for example, an Armijo like step size rule (=-=Bertsekas, 1976-=-) is employed. 1050sLOCAL FISHER DISCRIMINANT ANALYSIS Based on the above simulation results, we conclude that the proposed LFDA is a promising dimensionality reduction technique also in classificatio... |

43 | Constructing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel feature spaces - Mika, Rätsch, et al. - 2003 |

37 | Euclidean embedding of co-occurrence data
- Globerson, Chechik, et al.
(Show Context)
Citation Context ...sification tasks, and objectively evaluate the effectiveness of LFDA. There are several measures for quantitatively evaluating separability of data samples in different classes (e.g., Fukunaga, 1990; =-=Globerson et al., 2005-=-). Here we use a simple one: misclassification rate by a one-nearest-neighbor classifier. As explained in Section 3.3, the LFDA criterion is invariant under linear transformations, while the misclassi... |

22 | E.: Tests of Significance - Henkel - 1979 |

7 |
Comparison of multivariate discriminant techniques for clinical data - application to the tyroid functional state
- Coomans, Broeckaert, et al.
(Show Context)
Citation Context ...er injection of 200 micro grams of thyrotropinreleasing hormone as compared to the basal value. The task is to predict whether patients’ thyroids are euthyroidism, hypothyroidism, or hyperthyroidism (=-=Coomans et al., 1983-=-), that is, whether patients’ thyroids are normal, hypo-functioning, or hyper-functioning (Blake and Merz, 1998). The diagnosis (the class label) is based on a complete medical record, including anamn... |

3 |
Kernels for semi-structured date
- Kashima, Koyanagi
- 2002
(Show Context)
Citation Context ...f LFDA kernel LFDA (KLFDA). Recently, kernel functions for non-vectorial structured data such as strings, trees, and graphs have been proposed (see, e.g., Lodhi et al., 2002; Duffy and Collins, 2002; =-=Kashima and Koyanagi, 2002-=-; Kondor and Lafferty, 2002; Kashima et al., 2003; Gärtner et al., 2003; Gärtner, 2003). Since KLFDA uses the samples only via the kernel function K(x,x ′ ), it allows us to reduce the dimensionality ... |