## Dimensionality Reduction: A Comparative Review (2008)

### Cached

### Download Links

Citations: | 16 - 0 self |

### BibTeX

@MISC{Maaten08dimensionalityreduction:,

author = {L.J.P. van der Maaten and E. O. Postma and H. J. van den Herik},

title = { Dimensionality Reduction: A Comparative Review},

year = {2008}

}

### OpenURL

### Abstract

In recent years, a variety of nonlinear dimensionality reduction techniques have been proposed, many of which rely on the evaluation of local properties of the data. The paper presents a review and systematic comparison of these techniques. The performances of the techniques are investigated on artificial and natural tasks. The results of the experiments reveal that nonlinear techniques perform well on selected artificial tasks, but do not outperform the traditional PCA on real-world tasks. The paper explains these results by identifying weaknesses of current nonlinear techniques, and suggests how the performance of nonlinear dimensionality reduction techniques may be improved.

### Citations

8074 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ligning the local linear models in order to obtain the low-dimensional data representation using a variant of LLE. LLC first constructs a mixture of m factor analyzers (MoFA) 7 using the EM algorithm =-=[40, 50, 70]-=-. Alternatively, a mixture of probabilistic PCA model (MoPPCA) could be employed [125]. The local linear models in the mixture are used to construct m data representations zij and their corresponding ... |

3661 |
Convex optimization
- Boyd, Vandenberghe
(Show Context)
Citation Context ...Convex Techniques for Dimensionality Reduction Convex techniques for dimensionality reduction optimize an objective function that does not contain any local optima, i.e., the solution space is convex =-=[22]-=-. Most of the selected dimensionality reduction techniques fall in the class of convex techniques. In these techniques, the objective function usually has the form of a (generalized) Rayleigh quotient... |

2644 |
Introduction to Statistical Pattern Recognition
- Fukunaga
- 1972
(Show Context)
Citation Context ...nality that corresponds to the intrinsic dimensionality of the data. The intrinsic dimensionality of data is the minimum number of parameters needed to account for the observed properties of the data =-=[49]-=-. Dimensionality reduction is important in many domains, since it mitigates the curse of dimensionality and other undesired properties of high-dimensional spaces [69]. As a result, dimensionality redu... |

2354 | Latent dirichlet allocation
- Blei, Ng, et al.
- 2003
(Show Context)
Citation Context ...ants of LLE [58, 74], Laplacian Eigenmaps [59], and LTSA [147]. Also, our review does not cover latent variable models that are tailored to a specific type of data such as Latent Dirichlet Allocation =-=[20]-=-. B Details of the Artificial Datasets In this appendix, we present the equations that we used to generate the five artificial datasets. Suppose we have two random numbers pi and qi that were sampled ... |

1839 | Random graphs - Erdos, Renyi - 1954 |

1610 | Nonlinear dimensionality reduction by locally linear embedding - Roweis, Saul - 2000 |

1427 | A note on two problems in connexion with graphs
- Dijkstra
- 1959
(Show Context)
Citation Context .... The shortest path between two points in the graph forms an estimate of the geodesic distance between these two points, and can easily be computed using Dijkstra’s or Floyd’s shortest-path algorithm =-=[41, 47]-=-. The geodesic distances between all datapoints in X are computed, thereby forming a pairwise geodesic distance matrix. The low-dimensional representations yi of the datapoints xi in the low-dimension... |

1305 | Self-Organization and Associative memory - Kohonen - 1984 |

1093 | On spectral clustering: Analysis and an algorithm - Ng, Jordan, et al. - 2002 |

963 |
The Use of Multiple Measurement in Taxonomic Problems
- Fisher
- 1936
(Show Context)
Citation Context ... discussed in Section 2. Techniques for Independent Component Analysis [12] are not included in our review, because they were mainly designed for blind-source separation. Linear Discriminant Analysis =-=[46]-=-, Generalized Discriminant Analysis [9], and Neighborhood Components Analysis [53, 106], and recently proposed metric learners [32, 8, 51, 137] are not included in the review, because of their supervi... |

545 | Learning the kernel matrix with semidefinite programming
- Lanckriet, Cristianini, et al.
(Show Context)
Citation Context ...putational costs. Alternative approaches to model selection for kernel methods are based on, e.g., maximizing the between-class margins or the data variance (as in MVU) using semidefinite programming =-=[55, 77]-=-. Despite these alternative approaches, the construction of a proper kernel remains an important obstacle for the successful application of Kernel PCA. In addition, depending on the selection of the k... |

521 |
Analysis of a complex of statistical variables into principal components
- Hotelling
- 1933
(Show Context)
Citation Context ...l PCA, (4) Maximum Variance Unfolding, and (5) diffusion maps. The techniques are discussed separately in subsections 3.1.1 to 3.1.5. 3.1.1 PCA / Classical Scaling Principal Components Analysis (PCA) =-=[98, 65]-=- is a linear technique for dimensionality reduction, which means that it performs dimensionality reduction by embedding the data into a linear subspace of lower dimensionality. Although there exist va... |

507 | Training products of experts by minimizing contrastive divergence. Neural computation - Hinton |

446 | A fast learning algorithm for deep belief nets
- Hinton, Osindero, et al.
(Show Context)
Citation Context ...function is usually employed). Multilayer autoencoders usually have a high number of connections. Therefore, backpropagation approaches converge slowly and are likely to get stuck in local minima. In =-=[61]-=-, this drawback is overcome using a learning procedure that consists of three main stages. First, the recognition layers of the network (i.e., the layers from X to Y) are trained one-by-one using Rest... |

423 | Laplacian eigenmaps and spectral techniques for embedding and Clustering
- Belkin, Niyogi
- 2002
(Show Context)
Citation Context ...ngs of the data manifold in the embedding [52]. 3.2.2 Laplacian Eigenmaps Similar to LLE, Laplacian Eigenmaps find a low-dimensional data representation by preserving local properties of the manifold =-=[10]-=-. In Laplacian Eigenmaps, the local properties are based on the pairwise distances between near neighbors. Laplacian Eigenmaps compute a low-dimensional representation of the data in which the distanc... |

408 | Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets
- Faloutsos, Lin
- 1995
(Show Context)
Citation Context ...Variable Models [80], principal curves [28], kernel maps [118], conformal eigenmaps [113], Geodesic Nullspace Analysis [24], Structure Preserving Embedding [114], variants of multidimensional scaling =-=[3, 38, 45, 62, 92]-=-, techniques that (similarly to LLC and manifold charting) globally align a mixture of linear models [104, 109, 133], and linear variants of LLE [58, 74], Laplacian Eigenmaps [59], and LTSA [147]. Als... |

404 |
Multidimensional Scaling
- Cox, Cox
- 2000
(Show Context)
Citation Context ... xi and xj, and the constant in front is added in order to simplify the gradient of the cost function. The minimization of the Sammon cost function is generally performed using a pseudo-Newton method =-=[34]-=-. Sammon mapping is mainly used for visualization purposes [88]. The main weakness of Sammon mapping is that it assigns a much higher weight to retaining a distance of, say, 10 −5 than to retaining a ... |

348 |
Algorithm 97: Shortest path
- Floyd
- 1962
(Show Context)
Citation Context .... The shortest path between two points in the graph forms an estimate of the geodesic distance between these two points, and can easily be computed using Dijkstra’s or Floyd’s shortest-path algorithm =-=[41, 47]-=-. The geodesic distances between all datapoints in X are computed, thereby forming a pairwise geodesic distance matrix. The low-dimensional representations yi of the datapoints xi in the low-dimension... |

337 |
Reducing the dimensionality of data with neural networks
- Hinton, Salakhutdinov
(Show Context)
Citation Context ...ing have been reported on, e.g., gene data [44] and and geospatial data [119]. 4.2 Multilayer Autoencoders Multilayer autoencoders are feed-forward neural networks with an odd number of hidden layers =-=[39, 63]-=- and shared weights between the top and bottom layers (although asymmetric network structures may be employed as well). The middle hidden layer has d nodes, and the input and the output layer have D n... |

328 | Multidimensional scaling by optimizing goodness of fit to a non-metric hypothesis - KRUSKAL - 1964 |

299 | Random Geometric Graphs - Penrose - 2003 |

296 |
An information maximization approach to blind separation and blind deconvolution
- Bell, Sejnowski
- 1995
(Show Context)
Citation Context ...ality reduction technique with clustering, as a result of which they do not fit in the dimensionality reduction framework that we discussed in Section 2. Techniques for Independent Component Analysis =-=[12]-=- are not included in our review, because they were mainly designed for blind-source separation. Linear Discriminant Analysis [46], Generalized Discriminant Analysis [9], and Neighborhood Components An... |

280 | GTM: The generative topographic mapping
- Bishop, Svensén, et al.
- 1998
(Show Context)
Citation Context ...es all main techniques for (nonlinear) dimensionality reduction. However, it is not exhaustive. The comparative review does not include self-organizing maps [73] and their probabilistic extension GTM =-=[19]-=-, because these techniques combine a dimensionality reduction technique with clustering, as a result of which they do not fit in the dimensionality reduction framework that we discussed in Section 2. ... |

273 |
Generalized cross-validation as a method for choosing a good ridge parameter
- Golub, Heath, et al.
- 1979
(Show Context)
Citation Context ...for this incapability is that kernel-based methods require the selection of a proper kernel 22function. In general, model selection in kernel methods is performed using some form of hold-out testing =-=[54]-=-, leading to high computational costs. Alternative approaches to model selection for kernel methods are based on, e.g., maximizing the between-class margins or the data variance (as in MVU) using semi... |

265 |
The principle of minimized iterations in the solution of the matrix eigenvalue problem
- ARNOLDI
- 1951
(Show Context)
Citation Context ...ver, for these techniques the n × n matrix is sparse, which is beneficial, because it lowers the computational complexity of the eigenanalysis. Eigenanalysis of a sparse matrix (using Arnoldi methods =-=[5]-=- or Jacobi-Davidson methods [48]) has computational complexity O(pn 2 ), where p is the ratio of nonzero elements in the sparse matrix to the total number of elements. The memory complexity is O(pn 2 ... |

224 | The EM algorithm for mixtures of factor analyzers
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...ligning the local linear models in order to obtain the low-dimensional data representation using a variant of LLE. LLC first constructs a mixture of m factor analyzers (MoFA) 7 using the EM algorithm =-=[40, 50, 70]-=-. Alternatively, a mixture of probabilistic PCA model (MoPPCA) could be employed [125]. The local linear models in the mixture are used to construct m data representations zij and their corresponding ... |

217 | Generalized discriminant analysis using a kernel approach
- Baudat, Anouar
- 2000
(Show Context)
Citation Context ...Independent Component Analysis [12] are not included in our review, because they were mainly designed for blind-source separation. Linear Discriminant Analysis [46], Generalized Discriminant Analysis =-=[9]-=-, and Neighborhood Components Analysis [53, 106], and recently proposed metric learners [32, 8, 51, 137] are not included in the review, because of their supervised nature. Furthermore, our comparativ... |

208 | Neighbourhood components analysis
- Goldberger, Roweis, et al.
- 2005
(Show Context)
Citation Context ...t included in our review, because they were mainly designed for blind-source separation. Linear Discriminant Analysis [46], Generalized Discriminant Analysis [9], and Neighborhood Components Analysis =-=[53, 106]-=-, and recently proposed metric learners [32, 8, 51, 137] are not included in the review, because of their supervised nature. Furthermore, our comparative review does not cover a number of techniques t... |

205 | Locality preserving projections
- He, Niyogi
- 2004
(Show Context)
Citation Context ...sis of fMRI data [25]. In addition, variants of Laplacian Eigenmaps may be applied to supervised or semi-supervised learning problems [33, 11]. A linear variant of Laplacian Eigenmaps is presented in =-=[59]-=-. In spectral clustering, clustering is performed based on the sign of the coordinates obtained from Laplacian Eigenmaps [93, 116, 140]. 3.2.3 Hessian LLE Hessian LLE (HLLE) [42] is a variant of LLE t... |

186 | Face Recognition Using Laplacianfaces - He, Yan, et al. - 2005 |

161 | Charting a manifold - Brand - 2003 |

156 | Semi-supervised learning on Riemannian manifolds
- Belkin, Niyogi
- 2004
(Show Context)
Citation Context ...successfully applied to, e.g., face recognition [58] and the analysis of fMRI data [25]. In addition, variants of Laplacian Eigenmaps may be applied to supervised or semi-supervised learning problems =-=[33, 11]-=-. A linear variant of Laplacian Eigenmaps is presented in [59]. In spectral clustering, clustering is performed based on the sign of the coordinates obtained from Laplacian Eigenmaps [93, 116, 140]. 3... |

150 | Curvilinear component analysis: a self-orgaaiziog neural network for nonlinear mapping of &ta sets
- Demartines, Hérault
- 1997
(Show Context)
Citation Context ...ot on retaining the small pairwise distances, which are much more important to the geometry of the data. Several multidimensional scaling variants have been proposed that aim to address this weakness =-=[3, 38, 81, 108, 62, 92, 129]-=-. In this subsection, we discuss one such MDS variant called Sammon mapping [108]. Sammon mapping adapts the classical scaling cost function (see Equation 2) by weighting the contribution of each pair... |

142 | Global versus local methods in nonlinear dimensionality reduction
- Silva, Tenenbaum
- 2003
(Show Context)
Citation Context ...mpose considerable demands on computational resources, as compared to PCA. Attempts to reduce the computational and/or memory complexities of nonlinear techniques have been proposed for, e.g., Isomap =-=[37, 79]-=-, MVU [136, 139], and Kernel PCA [124]. 5.3 Out-of-sample Extension An important requirement for dimensionality reduction techniques is the ability to embed new highdimensional datapoints into an exis... |

135 | Probabilistic non-linear principal component analysis with Gaussian process latent variable models
- Lawrence
- 2005
(Show Context)
Citation Context ...re efficient for very high-dimensional data. By using Gaussian processes, probabilistic PCA may also be extended to learn nonlinear mappings between the high-dimensional and the low-dimensional space =-=[80]-=-. Another extension of PCA also includes minor components (i.e., the eigenvectors corresponding to the smallest eigenvalues) in the linear mapping, as minor components may be of relevance in classific... |

130 | Metric learning by collapsing classes
- Globerson, Roweis
- 2006
(Show Context)
Citation Context ...y designed for blind-source separation. Linear Discriminant Analysis [46], Generalized Discriminant Analysis [9], and Neighborhood Components Analysis [53, 106], and recently proposed metric learners =-=[32, 8, 51, 137]-=- are not included in the review, because of their supervised nature. Furthermore, our comparative review does not cover a number of techniques that are variants or extensions of the thirteen reviewed ... |

121 |
Introduction to multivariate analysis
- Chatfield, Collin
- 1980
(Show Context)
Citation Context ...e eigenvectors of the covariance matrix and the Gram matrix of the high-dimensional data: it can be shown that the eigenvectors ui and vi of the matrices XT X and XXT are related through √ λivi = Xui =-=[29]-=-. The connection between PCA and classical scaling is described in more detail in, e.g., [143, 99]. PCA may also be viewed upon as a latent variable model called probabilistic PCA [103]. This model us... |

118 | Stochastic neighbor embedding
- Hinton, Roweis
- 2003
(Show Context)
Citation Context ...ot on retaining the small pairwise distances, which are much more important to the geometry of the data. Several multidimensional scaling variants have been proposed that aim to address this weakness =-=[3, 38, 81, 108, 62, 92, 129]-=-. In this subsection, we discuss one such MDS variant called Sammon mapping [108]. Sammon mapping adapts the classical scaling cost function (see Equation 2) by weighting the contribution of each pair... |

109 | A kernel view of the dimensionality reduction of manifolds
- Ham, Lee, et al.
- 2004
(Show Context)
Citation Context ...aph defined on the data. Third, the spectral techniques Kernel PCA, Isomap, LLE, and Laplacian Eigenmaps can all be viewed upon as special cases of the more general problem of learning eigenfunctions =-=[14, 57]-=-. As a result, Isomap, LLE, and Laplacian Eigenmaps 9 can be considered as special cases of Kernel PCA that use a specific kernel function κ. For instance, this relation is visible in the out-of-sampl... |

107 | i992), Non-linear dimensionality reduction
- DeMers, Cottrell
- 1992
(Show Context)
Citation Context ...ing have been reported on, e.g., gene data [44] and and geospatial data [119]. 4.2 Multilayer Autoencoders Multilayer autoencoders are feed-forward neural networks with an odd number of hidden layers =-=[39, 63]-=- and shared weights between the top and bottom layers (although asymmetric network structures may be employed as well). The middle hidden layer has d nodes, and the input and the output layer have D n... |

107 |
Learning a Mahalanobis metric from equivalence constraints
- Bar-Hillel, Hertz, et al.
- 2005
(Show Context)
Citation Context ...y designed for blind-source separation. Linear Discriminant Analysis [46], Generalized Discriminant Analysis [9], and Neighborhood Components Analysis [53, 106], and recently proposed metric learners =-=[32, 8, 51, 137]-=- are not included in the review, because of their supervised nature. Furthermore, our comparative review does not cover a number of techniques that are variants or extensions of the thirteen reviewed ... |

102 |
der Vorst, Jacobi-Davidson style QR and QZ algorithms for the reduction of matrix pencils
- Fokkema, Sleijpen, et al.
- 1998
(Show Context)
Citation Context ... × n matrix is sparse, which is beneficial, because it lowers the computational complexity of the eigenanalysis. Eigenanalysis of a sparse matrix (using Arnoldi methods [5] or Jacobi-Davidson methods =-=[48]-=-) has computational complexity O(pn 2 ), where p is the ratio of nonzero elements in the sparse matrix to the total number of elements. The memory complexity is O(pn 2 ) as well. From the discussion o... |

100 | Dimension reduction by local principal component analysis
- Kambhatla, Leen
(Show Context)
Citation Context ...ligning the local linear models in order to obtain the low-dimensional data representation using a variant of LLE. LLC first constructs a mixture of m factor analyzers (MoFA) 7 using the EM algorithm =-=[40, 50, 70]-=-. Alternatively, a mixture of probabilistic PCA model (MoPPCA) could be employed [125]. The local linear models in the mixture are used to construct m data representations zij and their corresponding ... |

99 | Hessian eigenmaps: new locally linear embedding techniques for high-dimensional data - Donoho, Grimes |

97 |
Maximum likelihood estimation of intrinsic dimension
- Levina, Bickel
(Show Context)
Citation Context ...ts. The value of k in the k-nearest neighbor classifiers was set to 1. We determined the target dimensionality in the experiments by means of the maximum likelihood intrinsic dimensionality estimator =-=[84]-=-. Note that for Hessian LLE and LTSA, the dimensionality of the actual low-dimensional data representation cannot be higher than the number of nearest neighbors that was used to construct the neighbor... |

95 |
The numerical treatment of integral equations
- Baker
- 1977
(Show Context)
Citation Context ...on is visible in the out-of-sample extensions of Isomap, LLE, and Laplacian Eigenmaps [17]. The out-of-sample extension for these techniques is performed by means of a so-called Nyström approximation =-=[6, 99]-=-, which is known to be equivalent to the Kernel PCA projection (see 5.3 for more details). Laplacian Eigenmaps and Hessian LLE are also intimately related: they only differ in the type of differential... |

95 | Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning and data set parameterization
- Lafon, Lee
(Show Context)
Citation Context ... of the manifold. Despite this weakness, MVU was successfully applied to, e.g., sensor localization [139] and DNA microarray data analysis [71]. 3.1.5 Diffusion Maps The diffusion maps (DM) framework =-=[76, 91]-=- originates from the field of dynamical systems. Diffusion maps are based on defining a Markov random walk on the graph of the data. By performing the random walk for a number of timesteps, a measure ... |

94 | Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. Advances in neural information processing systems
- Bengio, Paiement, et al.
(Show Context)
Citation Context ... can be considered as special cases of Kernel PCA that use a specific kernel function κ. For instance, this relation is visible in the out-of-sample extensions of Isomap, LLE, and Laplacian Eigenmaps =-=[17]-=-. The out-of-sample extension for these techniques is performed by means of a so-called Nyström approximation [6, 99], which is known to be equivalent to the Kernel PCA projection (see 5.3 for more de... |

91 | algorithms for PCA and SPCA - EM - 1997 |

90 |
Self-Organization and Associative Memory, 3 rd Edition
- Kohonen
- 1989
(Show Context)
Citation Context ...tive review presented in this paper addresses all main techniques for (nonlinear) dimensionality reduction. However, it is not exhaustive. The comparative review does not include self-organizing maps =-=[73]-=- and their probabilistic extension GTM [19], because these techniques combine a dimensionality reduction technique with clustering, as a result of which they do not fit in the dimensionality reduction... |