## Dimensionality Reduction: A Comparative Review (2008)

### Cached

### Download Links

Citations: | 18 - 0 self |

### BibTeX

@MISC{Maaten08dimensionalityreduction:,

author = {L.J.P. van der Maaten and E. O. Postma and H. J. van den Herik},

title = { Dimensionality Reduction: A Comparative Review},

year = {2008}

}

### OpenURL

### Abstract

In recent years, a variety of nonlinear dimensionality reduction techniques have been proposed, many of which rely on the evaluation of local properties of the data. The paper presents a review and systematic comparison of these techniques. The performances of the techniques are investigated on artificial and natural tasks. The results of the experiments reveal that nonlinear techniques perform well on selected artificial tasks, but do not outperform the traditional PCA on real-world tasks. The paper explains these results by identifying weaknesses of current nonlinear techniques, and suggests how the performance of nonlinear dimensionality reduction techniques may be improved.

### Citations

9034 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ligning the local linear models in order to obtain the low-dimensional data representation using a variant of LLE. LLC first constructs a mixture of m factor analyzers (MoFA) 7 using the EM algorithm =-=[40, 50, 70]-=-. Alternatively, a mixture of probabilistic PCA model (MoPPCA) could be employed [125]. The local linear models in the mixture are used to construct m data representations zij and their corresponding ... |

2922 |
Introduction to Statistical Pattern Recognition,2nd Edition
- Fukunaga
- 1990
(Show Context)
Citation Context ...nality that corresponds to the intrinsic dimensionality of the data. The intrinsic dimensionality of data is the minimum number of parameters needed to account for the observed properties of the data =-=[49]-=-. Dimensionality reduction is important in many domains, since it mitigates the curse of dimensionality and other undesired properties of high-dimensional spaces [69]. As a result, dimensionality redu... |

2038 | On the evolution of random graphs - Erdös, Rényi - 1960 |

1608 | A Note on Two Problems in Connexion with Graphs
- Dijkstra
- 1959
(Show Context)
Citation Context .... The shortest path between two points in the graph forms an estimate of the geodesic distance between these two points, and can easily be computed using Dijkstra’s or Floyd’s shortest-path algorithm =-=[41, 47]-=-. The geodesic distances between all datapoints in X are computed, thereby forming a pairwise geodesic distance matrix. The low-dimensional representations yi of the datapoints xi in the low-dimension... |

1092 |
The use of multiple measurements in taxonomic problems
- Fisher
- 1936
(Show Context)
Citation Context ... discussed in Section 2. Techniques for Independent Component Analysis [12] are not included in our review, because they were mainly designed for blind-source separation. Linear Discriminant Analysis =-=[46]-=-, Generalized Discriminant Analysis [9], and Neighborhood Components Analysis [53, 106], and recently proposed metric learners [32, 8, 51, 137] are not included in the review, because of their supervi... |

461 | Laplacian eigenmaps and spectral technique for embedding and clustering
- Belkin, Niyogi
- 2001
(Show Context)
Citation Context ...ngs of the data manifold in the embedding [52]. 3.2.2 Laplacian Eigenmaps Similar to LLE, Laplacian Eigenmaps find a low-dimensional data representation by preserving local properties of the manifold =-=[10]-=-. In Laplacian Eigenmaps, the local properties are based on the pairwise distances between near neighbors. Laplacian Eigenmaps compute a low-dimensional representation of the data in which the distanc... |

444 |
Multidimensional Scaling
- Cox, Cox
- 2001
(Show Context)
Citation Context ... xi and xj, and the constant in front is added in order to simplify the gradient of the cost function. The minimization of the Sammon cost function is generally performed using a pseudo-Newton method =-=[34]-=-. Sammon mapping is mainly used for visualization purposes [88]. The main weakness of Sammon mapping is that it assigns a much higher weight to retaining a distance of, say, 10 −5 than to retaining a ... |

435 | FastMap: A fast algorithm for indexing, data-mining and visualization of tradditional and multimedia datasets
- Faloutsos, Lin
- 1995
(Show Context)
Citation Context ...Variable Models [80], principal curves [28], kernel maps [118], conformal eigenmaps [113], Geodesic Nullspace Analysis [24], Structure Preserving Embedding [114], variants of multidimensional scaling =-=[3, 38, 45, 62, 92]-=-, techniques that (similarly to LLC and manifold charting) globally align a mixture of linear models [104, 109, 133], and linear variants of LLE [58, 74], Laplacian Eigenmaps [59], and LTSA [147]. Als... |

390 |
Algorithm 97 (Shortest Path
- Floyd
(Show Context)
Citation Context .... The shortest path between two points in the graph forms an estimate of the geodesic distance between these two points, and can easily be computed using Dijkstra’s or Floyd’s shortest-path algorithm =-=[41, 47]-=-. The geodesic distances between all datapoints in X are computed, thereby forming a pairwise geodesic distance matrix. The low-dimensional representations yi of the datapoints xi in the low-dimension... |

312 |
An information maximization approach to blind separation and blind deconvolution
- Bell, Sejnowski
- 1995
(Show Context)
Citation Context ...ality reduction technique with clustering, as a result of which they do not fit in the dimensionality reduction framework that we discussed in Section 2. Techniques for Independent Component Analysis =-=[12]-=- are not included in our review, because they were mainly designed for blind-source separation. Linear Discriminant Analysis [46], Generalized Discriminant Analysis [9], and Neighborhood Components An... |

303 | GTM: The generative topographic mapping
- Bishop, Svensen, et al.
- 1998
(Show Context)
Citation Context ...es all main techniques for (nonlinear) dimensionality reduction. However, it is not exhaustive. The comparative review does not include self-organizing maps [73] and their probabilistic extension GTM =-=[19]-=-, because these techniques combine a dimensionality reduction technique with clustering, as a result of which they do not fit in the dimensionality reduction framework that we discussed in Section 2. ... |

299 |
Generalized cross validation as a method for choosing a good ridge parameter. Technometrics 21(2
- Golub, Wahba
- 1979
(Show Context)
Citation Context ...for this incapability is that kernel-based methods require the selection of a proper kernel 22function. In general, model selection in kernel methods is performed using some form of hold-out testing =-=[54]-=-, leading to high computational costs. Alternative approaches to model selection for kernel methods are based on, e.g., maximizing the between-class margins or the data variance (as in MVU) using semi... |

290 |
The principle of minimized iterations in the solution of the matrix eigenvalue problem
- Arnoldi
- 1951
(Show Context)
Citation Context ...ver, for these techniques the n × n matrix is sparse, which is beneficial, because it lowers the computational complexity of the eigenanalysis. Eigenanalysis of a sparse matrix (using Arnoldi methods =-=[5]-=- or Jacobi-Davidson methods [48]) has computational complexity O(pn 2 ), where p is the ratio of nonzero elements in the sparse matrix to the total number of elements. The memory complexity is O(pn 2 ... |

237 | Generalized discriminant analysis using a kernel approach
- BAUDAT, ANOUAR
- 2000
(Show Context)
Citation Context ...Independent Component Analysis [12] are not included in our review, because they were mainly designed for blind-source separation. Linear Discriminant Analysis [46], Generalized Discriminant Analysis =-=[9]-=-, and Neighborhood Components Analysis [53, 106], and recently proposed metric learners [32, 8, 51, 137] are not included in the review, because of their supervised nature. Furthermore, our comparativ... |

234 | The EM algorithm for mixtures of factor analyzers
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...ligning the local linear models in order to obtain the low-dimensional data representation using a variant of LLE. LLC first constructs a mixture of m factor analyzers (MoFA) 7 using the EM algorithm =-=[40, 50, 70]-=-. Alternatively, a mixture of probabilistic PCA model (MoPPCA) could be employed [125]. The local linear models in the mixture are used to construct m data representations zij and their corresponding ... |

234 | Locality preserving projections
- He, Niyogi
(Show Context)
Citation Context ...sis of fMRI data [25]. In addition, variants of Laplacian Eigenmaps may be applied to supervised or semi-supervised learning problems [33, 11]. A linear variant of Laplacian Eigenmaps is presented in =-=[59]-=-. In spectral clustering, clustering is performed based on the sign of the coordinates obtained from Laplacian Eigenmaps [93, 116, 140]. 3.2.3 Hessian LLE Hessian LLE (HLLE) [42] is a variant of LLE t... |

170 | Charting a manifold, in - Brand - 2002 |

168 | Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets
- Demartines, Herault
- 1997
(Show Context)
Citation Context ...ot on retaining the small pairwise distances, which are much more important to the geometry of the data. Several multidimensional scaling variants have been proposed that aim to address this weakness =-=[3, 38, 81, 108, 62, 92, 129]-=-. In this subsection, we discuss one such MDS variant called Sammon mapping [108]. Sammon mapping adapts the classical scaling cost function (see Equation 2) by weighting the contribution of each pair... |

161 | Semi-supervised learning on Riemannian manifolds
- Belkin, Niyogi
- 2004
(Show Context)
Citation Context ...successfully applied to, e.g., face recognition [58] and the analysis of fMRI data [25]. In addition, variants of Laplacian Eigenmaps may be applied to supervised or semi-supervised learning problems =-=[33, 11]-=-. A linear variant of Laplacian Eigenmaps is presented in [59]. In spectral clustering, clustering is performed based on the sign of the coordinates obtained from Laplacian Eigenmaps [93, 116, 140]. 3... |

149 | Global versus local methods in nonlinear dimensionality reduction
- deSilva, Tenenbaum
- 2003
(Show Context)
Citation Context ...mpose considerable demands on computational resources, as compared to PCA. Attempts to reduce the computational and/or memory complexities of nonlinear techniques have been proposed for, e.g., Isomap =-=[37, 79]-=-, MVU [136, 139], and Kernel PCA [124]. 5.3 Out-of-sample Extension An important requirement for dimensionality reduction techniques is the ability to embed new highdimensional datapoints into an exis... |

135 |
Introduction to multivariate analysis
- CHATFIELD, COLLINS
- 1980
(Show Context)
Citation Context ...e eigenvectors of the covariance matrix and the Gram matrix of the high-dimensional data: it can be shown that the eigenvectors ui and vi of the matrices XT X and XXT are related through √ λivi = Xui =-=[29]-=-. The connection between PCA and classical scaling is described in more detail in, e.g., [143, 99]. PCA may also be viewed upon as a latent variable model called probabilistic PCA [103]. This model us... |

116 | A kernel view of the dimensionality reduction of manifolds
- Ham, Lee, et al.
- 2004
(Show Context)
Citation Context ...aph defined on the data. Third, the spectral techniques Kernel PCA, Isomap, LLE, and Laplacian Eigenmaps can all be viewed upon as special cases of the more general problem of learning eigenfunctions =-=[14, 57]-=-. As a result, Isomap, LLE, and Laplacian Eigenmaps 9 can be considered as special cases of Kernel PCA that use a specific kernel function κ. For instance, this relation is visible in the out-of-sampl... |

111 | Non-linear dimensionality reduction
- DeMers, Cottrell
- 1993
(Show Context)
Citation Context ...ing have been reported on, e.g., gene data [44] and and geospatial data [119]. 4.2 Multilayer Autoencoders Multilayer autoencoders are feed-forward neural networks with an odd number of hidden layers =-=[39, 63]-=- and shared weights between the top and bottom layers (although asymmetric network structures may be employed as well). The middle hidden layer has d nodes, and the input and the output layer have D n... |

109 |
der Vorst, Jacobi-Davidson style QR and QZ algorithms for the reduction of matrix pencils
- Fokkema, Sleijpen, et al.
- 1998
(Show Context)
Citation Context ... × n matrix is sparse, which is beneficial, because it lowers the computational complexity of the eigenanalysis. Eigenanalysis of a sparse matrix (using Arnoldi methods [5] or Jacobi-Davidson methods =-=[48]-=-) has computational complexity O(pn 2 ), where p is the ratio of nonzero elements in the sparse matrix to the total number of elements. The memory complexity is O(pn 2 ) as well. From the discussion o... |

103 |
The Numerical Treatment of Integral Equations
- Baker
- 1977
(Show Context)
Citation Context ...on is visible in the out-of-sample extensions of Isomap, LLE, and Laplacian Eigenmaps [17]. The out-of-sample extension for these techniques is performed by means of a so-called Nyström approximation =-=[6, 99]-=-, which is known to be equivalent to the Kernel PCA projection (see 5.3 for more details). Laplacian Eigenmaps and Hessian LLE are also intimately related: they only differ in the type of differential... |

103 | Out-of-sample extensions for LLE, Isomap, MDS, eigenmaps, and spectral clustering
- Bengio, Paiement, et al.
- 2003
(Show Context)
Citation Context ... can be considered as special cases of Kernel PCA that use a specific kernel function κ. For instance, this relation is visible in the out-of-sample extensions of Isomap, LLE, and Laplacian Eigenmaps =-=[17]-=-. The out-of-sample extension for these techniques is performed by means of a so-called Nyström approximation [6, 99], which is known to be equivalent to the Kernel PCA projection (see 5.3 for more de... |

102 | Hessian eigenmaps: new locally linear embedding techniques for highdimensional data - Donoho, Grimes - 2003 |

85 | Scaling learning algorithms towards AI
- Bengio, LeCun
- 2007
(Show Context)
Citation Context ...nts. Second, all sparse spectral dimensionality reduction techniques suffer from the curse of dimensionality of the embedded manifold, i.e., from the curse of the intrinsic dimensionality of the data =-=[16, 136, 15]-=-, because the number of datapoints that is required to characterize a manifold properly grows exponentially with the intrinsic dimensionality of the manifold. The susceptibility to the curse of dimens... |

75 | Super-resolution through neighbor embedding
- Chang, Yeung, et al.
(Show Context)
Citation Context .... In this formula, I is the n × n identity matrix. The popularity of LLE has led to the proposal of linear variants of the algorithm [58, 74], and to successful applications to, e.g., superresolution =-=[27]-=- and sound source localization [43]. However, there also exist experimental studies that report weak performance of LLE. In [86], LLE was reported to fail in the visualization of even simple synthetic... |

66 |
The Isomap algorithm and topological stability
- Balasubramanian, Schwartz
(Show Context)
Citation Context ...nsional space Y are computed by applying classical scaling (see 3.1.1) on the resulting pairwise geodesic distance matrix. An important weakness of the Isomap algorithm is its topological instability =-=[7]-=-. Isomap may construct erroneous connections in the neighborhood graph G. Such short-circuiting [82] can severely 4impair the performance of Isomap. Several approaches have been proposed to overcome ... |

66 | Learning eigenfunctions links spectral embedding and kernel PCA. Neural Computation 16(10):2197–2219
- Bengio, Delalleau, et al.
- 2004
(Show Context)
Citation Context ...aph defined on the data. Third, the spectral techniques Kernel PCA, Isomap, LLE, and Laplacian Eigenmaps can all be viewed upon as special cases of the more general problem of learning eigenfunctions =-=[14, 57]-=-. As a result, Isomap, LLE, and Laplacian Eigenmaps 9 can be considered as special cases of Kernel PCA that use a specific kernel function κ. For instance, this relation is visible in the out-of-sampl... |

44 | Neighborhood preserving embedding
- He, Cai, et al.
- 2005
(Show Context)
Citation Context ...aph, and equal to the corresponding reconstruction weight otherwise. In this formula, I is the n × n identity matrix. The popularity of LLE has led to the proposal of linear variants of the algorithm =-=[58, 74]-=-, and to successful applications to, e.g., superresolution [27] and sound source localization [43]. However, there also exist experimental studies that report weak performance of LLE. In [86], LLE was... |

35 | Coloring of dt-mri fiber traces using laplacian eigenmaps
- Brun, Park, et al.
- 2003
(Show Context)
Citation Context ...constraint that can easily be cheated on. Despite these weaknesses, Laplacian Eigenmaps (and its variants) have been successfully applied to, e.g., face recognition [58] and the analysis of fMRI data =-=[25]-=-. In addition, variants of Laplacian Eigenmaps may be applied to supervised or semi-supervised learning problems [33, 11]. A linear variant of Laplacian Eigenmaps is presented in [59]. In spectral clu... |

26 | Eigenvalues of the Laplacian of a graph - Anderson, Morley - 1985 |

26 | Kernel matrix completion by semidefinite programming
- Graepel
- 2002
(Show Context)
Citation Context ...putational costs. Alternative approaches to model selection for kernel methods are based on, e.g., maximizing the between-class margins or the data variance (as in MVU) using semidefinite programming =-=[55, 77]-=-. Despite these alternative approaches, the construction of a proper kernel remains an important obstacle for the successful application of Kernel PCA. In addition, depending on the selection of the k... |

21 | Non-Local Manifold Tangent Learning
- Bengio, Monperrus
- 2005
(Show Context)
Citation Context ...nts. Second, all sparse spectral dimensionality reduction techniques suffer from the curse of dimensionality of the embedded manifold, i.e., from the curse of the intrinsic dimensionality of the data =-=[16, 136, 15]-=-, because the number of datapoints that is required to characterize a manifold properly grows exponentially with the intrinsic dimensionality of the manifold. The susceptibility to the curse of dimens... |

19 |
ªPrincipal Curves for Nonlinear Feature Extraction and Classification,º
- Chang, Ghosh
- 1998
(Show Context)
Citation Context ... for adaptive neighborhood selection are presented in, e.g., [135, 89, 107]. Furthermore, sparse spectral techniques for dimensionality reduction are sensitive to the presence of outliers in the data =-=[28]-=-. In local techniques for dimensionality reduction, outliers are connected to their k nearest neighbors, even when they are very distant. As a consequence, outliers degrade the performance of local te... |

19 | Gender classification of human faces
- Graf, Wichmann
- 2002
(Show Context)
Citation Context ...e literature. On selected datasets, nonlinear techniques for dimensionality reduction outperform linear techniques [94, 123], but nonlinear techniques perform poorly on various other natural datasets =-=[56, 68, 67, 86]-=-. In particular, our results establish three main weaknesses of the popular sparse spectral techniques for dimensionality reduction: (1) flaws in their objective functions, (2) numerical problems in t... |

18 | Implementation of a primal-dual method for sdp on a shared memory parallel architecture
- Borchers, Young
(Show Context)
Citation Context ...eigendecomposition of Kernel PCA, MVU solves a semidefinite program (SDP) with nk constraints. Both the computational and the memory complexity of solving an SDP are cube in the number of constraints =-=[21]-=-. Since there are nk constraints, the computational and memory complexity of the main part of MVU is O((nk) 3 ). Training an autoencoder using RBM training or backpropagation has a computational compl... |

15 | Robust kernel Isomap
- Choi, Choi
(Show Context)
Citation Context ...impair the performance of Isomap. Several approaches have been proposed to overcome the problem of short-circuiting, e.g., by removing datapoints with large total flows in the shortest-path algorithm =-=[31]-=- or by removing nearest neighbors that violate local linearity of the neighborhood graph [111]. A second weakness is that Isomap may suffer from ‘holes’ in the manifold. This problem can be dealt with... |

9 | A O (2005) Classification constrained dimensionality reduction
- Costa, Hero
(Show Context)
Citation Context ...successfully applied to, e.g., face recognition [58] and the analysis of fMRI data [25]. In addition, variants of Laplacian Eigenmaps may be applied to supervised or semi-supervised learning problems =-=[33, 11]-=-. A linear variant of Laplacian Eigenmaps is presented in [59]. In spectral clustering, clustering is performed based on the sign of the coordinates obtained from Laplacian Eigenmaps [93, 116, 140]. 3... |

8 |
The use of genetic algorithms and neural networks to approximate missing data in database
- Abdella, Marwala
(Show Context)
Citation Context ...ir training may be tedious, although this weakness is (partially) addressed by recent advances in deep learning. Autoencoders have succesfully been applied to problems such as missing data imputation =-=[1]-=- and HIV analysis [18]. Input data X w 1 w 2 w 3 Low-dimensional representation Y w 4 (w 1 ) T Encoder (w 2 ) T (w 3 ) T Decoder (w4) Output data X' T Figure 2: Schematic structure of an autoencoder. ... |

8 |
Stochastic proximity embedding
- Agrafiotis
(Show Context)
Citation Context ...ot on retaining the small pairwise distances, which are much more important to the geometry of the data. Several multidimensional scaling variants have been proposed that aim to address this weakness =-=[3, 38, 81, 108, 62, 92, 129]-=-. In this subsection, we discuss one such MDS variant called Sammon mapping [108]. Sammon mapping adapts the classical scaling cost function (see Equation 2) by weighting the contribution of each pair... |

8 | The manifolds of spatial hearing
- Duraiswami, Raykar
- 2005
(Show Context)
Citation Context ...dentity matrix. The popularity of LLE has led to the proposal of linear variants of the algorithm [58, 74], and to successful applications to, e.g., superresolution [27] and sound source localization =-=[43]-=-. However, there also exist experimental studies that report weak performance of LLE. In [86], LLE was reported to fail in the visualization of even simple synthetic biomedical datasets. In [68], it i... |

6 | From subspaces to submanifolds
- Brand
- 2004
(Show Context)
Citation Context ...est eigenvalues) may be explained by the difficulty of solving eigenproblems. Fourth, local properties of a manifold do not necessarily follow the global structure of the manifold (as noted in, e.g., =-=[104, 24]-=-) in the presence of noise around the manifold. In other words, sparse spectral techniques suffer from overfitting on the manifold. Moreover, sparse spectral techniques suffer 23from folding [23]. Fo... |

5 |
Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, ‘Geometric Methods for Feature Selection and Dimensional Reduction: A Guided Tour
- Burges
- 2005
(Show Context)
Citation Context ...near techniques cannot adequately handle complex nonlinear data. In the last decade, a large number of nonlinear techniques for dimensionality reduction have been proposed. See for an overview, e.g., =-=[26, 110, 83, 131]-=-. In contrast to the traditional linear techniques, the nonlinear techniques have the ability to deal with complex nonlinear data. In particular for realworld data, the nonlinear dimensionality reduct... |

5 |
subspace mixture models using t-distributions
- Ridder, Franc
- 2003
(Show Context)
Citation Context ...lignment of linear models (such as LLC), the sensitivity to the presence of outliers may be addressed by replacing the mixture of factor analyzers by a mixture of t-distributed subspaces (MoTS) model =-=[36, 35]-=-. The intuition behind the use of the MoTS model is that a t-distribution is less sensitive to outliers than a Gaussian (which tends to overestimate variances) because it is heavier-tailed. For autoen... |

2 |
Gram-Schmidt Orthogonalization
- Afken
- 1985
(Show Context)
Citation Context ...ix Zi is formed that contains (in the columns) all cross products of M up to the dth order (including a column with ones). The matrix Zi is orthonormalized by applying Gram-Schmidt orthonormalization =-=[2]-=-. The estimation of the tangent Hessian Hi is now given by the transpose of the last d(d+1) 2 columns of the matrix Zi. Using the Hessian estimators in local tangent coordinates, a matrix H is constru... |

2 |
Autoencoder networks for HIV classification
- Betechuoh, Marwala, et al.
(Show Context)
Citation Context ...dious, although this weakness is (partially) addressed by recent advances in deep learning. Autoencoders have succesfully been applied to problems such as missing data imputation [1] and HIV analysis =-=[18]-=-. Input data X w 1 w 2 w 3 Low-dimensional representation Y w 4 (w 1 ) T Encoder (w 2 ) T (w 3 ) T Decoder (w4) Output data X' T Figure 2: Schematic structure of an autoencoder. 4.3 LLC Locally Linear... |

1 | Model selection in kernel methods based on a spectral analysis of label information - Braun, Lange, et al. - 2006 |