## A review of dimension reduction techniques (1997)

### Cached

### Download Links

Citations: | 33 - 4 self |

### BibTeX

@MISC{Carreira-perpiñán97areview,

author = {Miguel Á. Carreira-perpiñán},

title = {A review of dimension reduction techniques},

year = {1997}

}

### Years of Citing Articles

### OpenURL

### Abstract

The problem of dimension reduction is introduced as a way to overcome the curse of the dimensionality when dealing with vector data in high-dimensional spaces and as a modelling tool for such data. It is defined as the search for a low-dimensional manifold that embeds the high-dimensional data. A classification of dimension reduction problems is proposed. A survey of several techniques for dimension reduction is given, including principal component analysis, projection pursuit and projection pursuit regression, principal curves and methods based on topologically continuous maps, such as Kohonen’s maps or the generalised topographic mapping. Neural network implementations for several of these techniques are also reviewed, such as the projection pursuit learning network and the BCM neuron with an objective function. Several appendices complement the mathematical treatment of the main text.

### Citations

9489 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...onlinear manifolds. According to this and the following results: • For fixed variance, the normal distribution has the least information, in both the senses of Fisher information and negative entropy =-=[16]-=-. • For most high-dimensional clouds, most low-dimensional projections are approximately normal (Diaconis and Freedman [24]). We will consider the normal distribution as the least structured (or least... |

9374 | Maximum Likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...) = p(xi|tn, W, β) = �K i ′ =1 p(tn|xi ′, W, β) . – G is a diagonal K × K matrix with elements gii(W, β) = �N n=1 Rin(W, β). (5.14) Because the EM algorithm increases the log-likelihood monotonically =-=[23]-=-, the convergence of GTM is guaranteed. According to [7], convergence is usually achieved after a few tens of iterations. As initial weights one can take the first L principal components of the sample... |

5536 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ... (e.g. degree of differentiability or computational effort required). This fact finds a parallel with radial basis functions in neural networks: the actual shape of the RBFs is relatively unimportant =-=[5]-=-. For h depending in some way on the sample size n, pointwise as well as global convergence (in probability for both the uniform and the integration sense) can be proven for kernels satisfying a few v... |

4609 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...supersmoother (section F.3.4), no huge regression tables are needed. F.3.6 Projection pursuit regression See section 3.7. F.3.7 Partitioning methods (regression trees) Partitioning methods (e.g. CART =-=[10]-=-, ID3 [82]) operate as follows: • Partition the input space into regions according to the data during training (typically with hyperplanes parallel to the coordinate axes). Binary partitions are commo... |

3867 |
Self-Organizing Maps
- Kohonen
(Show Context)
Citation Context ...ction, but unfortunately it is too linked to Kohonen’s maps in the literature. 5.1 Kohonen’s self-organising maps Let {tn} N n=1 a sample in the data space R D . Kohonen’s self-organising maps (SOMs) =-=[69]-=- can be considered as a form of dimension reduction in the sense that they learn, in an unsupervised way, a mapping between a 2-D lattice 20 and the data space. The mapping preserves the two-dimension... |

3725 | Induction of decision trees
- Quinlan
- 1986
(Show Context)
Citation Context ...her (section F.3.4), no huge regression tables are needed. F.3.6 Projection pursuit regression See section 3.7. F.3.7 Partitioning methods (regression trees) Partitioning methods (e.g. CART [10], ID3 =-=[82]-=-) operate as follows: • Partition the input space into regions according to the data during training (typically with hyperplanes parallel to the coordinate axes). Binary partitions are common; each re... |

1792 | Generalized Additive Models
- Hastie, Tibshirani
- 1984
(Show Context)
Citation Context ...refore less general because there is no interaction between input variables (e.g. the function x1x2 cannot be modelled), but it is more easily interpretable (the functions gk(xk) can be plotted); see =-=[47, 48]-=- for some applications. One could add cross-terms of the form gkl(xk, xl) to achieve greater flexibility but the combinatorial explosion quickly sets in. 3.8.1 Backfitting The ridge functions {gk} D i... |

1459 |
Multilayer feedforward networks are universal approximators
- Hornik, Stinchombe, et al.
- 1989
(Show Context)
Citation Context ...rmance of PPL compared to that of BPL Although approximation theorems of the form “any noise-free square-integrable function can be approximated to an arbitrary degree of accuracy” exist for both BPL =-=[18, 50, 51, 52]-=- and PPL (see section 6.1), nothing is said about the number of neurons required in practice. Hwang et al. report some empirical results (based on a simulation with D = 2, q = 1): • PPL (with Hermite ... |

1226 |
The algebraic eigenvalue problem
- Wilkinson, Ed
- 1988
(Show Context)
Citation Context ...27, 61, 62] for a more comprehensive treatment. Also, see section C.2 for a comparison with other transformations of the covariance matrix. 10sdecomposition, Cholesky decomposition, etc.; see [81] or =-=[99]-=-. When the covariance matrix, of order D × D, is too large to be explicitly computed one could use neural network techniques (section 6.2), some of which do not require more memory space other than th... |

960 |
Approximations by superpositions of sigmoidal functions
- Cybenko
- 1989
(Show Context)
Citation Context ...rmance of PPL compared to that of BPL Although approximation theorems of the form “any noise-free square-integrable function can be approximated to an arbitrary degree of accuracy” exist for both BPL =-=[18, 50, 51, 52]-=- and PPL (see section 6.1), nothing is said about the number of neurons required in practice. Hwang et al. report some empirical results (based on a simulation with D = 2, q = 1): • PPL (with Hermite ... |

722 | The Cascade-Correlation Learning Architecture
- Fahlman, Lebiere
- 1990
(Show Context)
Citation Context ...ith the dimension for fixed accuracy: the curse of dimensionality has been reduced. 6.4 Cascade correlation learning network (CCLN) Figure 13 shows a cascade correlation learning network (CCLN). CCLN =-=[30]-=- is a supervised learning architecture that dynamically grows layers of hidden neurons with fixed nonlinear activations (e.g. sigmoids), so that the network topology (size, length) can be efficiently ... |

717 | Quantization
- Gray, Neuhoff
- 1998
(Show Context)
Citation Context ...as a plane sheet that twists around itself in D dimensions to resemble as much as possible the distribution of the data vectors. 5.1.1 The neighbourhood function In a SOM, like in vector quantisation =-=[41]-=-, we have a set of reference or codebook vectors {µ i} M i=1 in data space R D , initially distributed at random 21 , but each of them is associated to a node i in a 2-D lattice —unlike in vector quan... |

690 |
Multivariate Density Estimation: Theory, Practice and Visualization
- Scott
- 1992
(Show Context)
Citation Context ...from tens to tens of thousands. Proteins with the same spatial structure —but often with very different aminoacid sequences— are grouped together in families. 1 Several representation techniques (see =-=[89]-=- for a review) exist that allow to visualise up to about 5-dimensional data sets, using colours, rotation, stereography, glyphs or other devices, but they lack the appeal of a simple plot; a well-know... |

659 | Bayesian learning for neural networks
- Neal
- 1996
(Show Context)
Citation Context ...l networks Assume we have data D which we want to model using parameters w and define the likelihood L(w) = p(D|w) as the probability of the data given the parameters. Learning can be classified into =-=[76]-=-: • Traditional (frequentist), e.g. the MLP: no distribution over the parameters w is assumed; we are interested in a single value w ∗ which is often found as a maximum likelihood estimator: w ∗ = arg... |

625 |
A stochastic approximation method
- Robbins, Monro
- 1951
(Show Context)
Citation Context ...e small for the approximation to be good. An online version of the M step is obtained by decomposing the objective function over the data points, l = � N n=1 ln, and using the Robbins-Monro procedure =-=[85]-=-: w (t+1) kj = w(t) kj + α(t) � ∂ln ∂wkj � (t) β (t+1) = β (t) + α (t) � � (t) ∂ln . ∂β If the learning rate α (t) is an appropriately decreasing function of the iteration step t, convergence to an ex... |

575 | Hidden Markov models in computational biology: applications to protein modeling
- Krogh, Brown, et al.
- 1994
(Show Context)
Citation Context ...lies and can also help to identify new members of a family or to discover new families. Probabilistic approaches to the investigation of the structure of protein families include hidden Markov models =-=[70]-=- and density networks [74]. • Speech modelling. It has been conjectured 2 that speech recognition, undoubtedly an exceedingly complex process, could be accomplished with only about 5 variables. 1.2 De... |

534 |
Adaptive Control Processes: A Guided Tour
- Bellman
- 1961
(Show Context)
Citation Context ...t information, producing a more economic representation of the data. 1.4 The curse of the dimensionality and the empty space phenomenon The curse of the dimensionality (term coined by Bellman in 1961 =-=[3]-=-) refers to the fact that, in the absence of simplifying assumptions, the sample size needed to estimate a function of several variables to a given degree of accuracy (i.e. to get a reasonably low-var... |

462 |
A nonlinear mapping for data structure analysis
- Sammon
- 1969
(Show Context)
Citation Context ...ry it is related to the problem of data compression and coding. • Many visualisation techniques are actually performing some kind of dimension reduction: multidimensional scaling [71], Sammon mapping =-=[87]-=-, etc. • Complexity reduction: if the complexity in time or memory of an algorithm depends on the dimension of its input data as a consequence of the curse of the dimensionality, reducing this will ma... |

460 | Projection Pursuit Regression
- Friedman, Stuetzle
- 1981
(Show Context)
Citation Context ...ality due to the feature extraction step. • Less biased to the training data due to the CART method. 3.7 Projection pursuit regression (PPR) Projection pursuit regression (PPR) (Friedman and Stuetzle =-=[36]-=-) is a nonparametric regression approach for the multivariate regression problem (see section F.1) based in projection pursuit. It works by additive composition, constructing an approximation to the d... |

412 |
A users guide to principal components
- Jackson
- 1990
(Show Context)
Citation Context ... numerical techniques exist for finding all or the first few eigenvalues and eigenvectors of a square, symmetric, semidefinite positive matrix (the covariance matrix) in O(D 3 ): singular value 7 See =-=[27, 61, 62]-=- for a more comprehensive treatment. Also, see section C.2 for a comparison with other transformations of the covariance matrix. 10sdecomposition, Cholesky decomposition, etc.; see [81] or [99]. When ... |

339 | Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex
- Bienenstock, Cooper, et al.
- 1982
(Show Context)
Citation Context ...en units (hidden layer not retrained). Table 2 compares CCLNs with PPLNs. 6.5 BCM neuron using an objective function 6.5.1 The BCM neuron Figure 14 shows the BCM neuron (Bienenstock, Cooper and Munro =-=[4]-=-). Let x ∈ R D be the input to the neuron and c = x T w ∈ R its output. We define the threshold θw = E � (x T w) 2� = E � c 2� and 36sthe functions31 ˆ φ(c, θw) = c2 − 1 2cθw and φ(c, θw) = c2 − cθw. ... |

326 |
Principal curves
- Hastie, Stuetzle
- 1989
(Show Context)
Citation Context ...ric: principal curve. PSfrag replacements mmetric: regression line. principal-component line. mmetric: principal curve. 4 Principal Curves and Principal Surfaces Principal curves (Hastie and Stuetzle =-=[46]-=-) are smooth 1-D curves that pass through the middle of a p-dimensional data set, providing a nonlinear summary of it. They are estimated in a nonparametric way, i.e. their shape is suggested by the d... |

280 | Growing Cell Structures- A Self-Organizing Networks for Unsupervised and
- Fritzke
(Show Context)
Citation Context ...ed to dimension reduction that have not been included in this work due to lack of time. These would include the Helmholtz machine [19, 20], some variations of self-organising maps (growing neural gas =-=[39, 40]-=-, Bayesian approaches [96, 97], etc.), population codes [100], and curvilinear component analysis [22], among others. 39sA Glossary • Backfitting algorithm: an iterative method to fit additive models,... |

272 |
A proje pursuit algorithm for exploratory data analysis
- Friedman, Tukey
- 1974
(Show Context)
Citation Context ...on 3.2) is not a good indicator of structure. However, the projection on the plane spanned by e1 and e2 clearly shows both clusters. 8 The term projection pursuit was introduced by Friedman and Tukey =-=[38]-=- in 1974, along with the first projection index. Good reviews of projection pursuit can be found in Huber [53] and Jones and Sibson [65]. 9 Varimax rotation [66] is a procedure that, given a subspace ... |

251 |
Projection pursuit
- Huber
- 1985
(Show Context)
Citation Context ...y shows both clusters. 8 The term projection pursuit was introduced by Friedman and Tukey [38] in 1974, along with the first projection index. Good reviews of projection pursuit can be found in Huber =-=[53]-=- and Jones and Sibson [65]. 9 Varimax rotation [66] is a procedure that, given a subspace or projection, selects a new basis for it that maximises the variance but giving large loadings to as few vari... |

250 |
Approximation capabilities of multilayer feedforward networks
- hornik
- 1991
(Show Context)
Citation Context ...rmance of PPL compared to that of BPL Although approximation theorems of the form “any noise-free square-integrable function can be approximated to an arbitrary degree of accuracy” exist for both BPL =-=[18, 50, 51, 52]-=- and PPL (see section 6.1), nothing is said about the number of neurons required in practice. Hwang et al. report some empirical results (based on a simulation with D = 2, q = 1): • PPL (with Hermite ... |

246 |
The Use of Faces to Represent Points in K-Dimensional Space Graphically
- Chernoff
- 1973
(Show Context)
Citation Context ... up to about 5-dimensional data sets, using colours, rotation, stereography, glyphs or other devices, but they lack the appeal of a simple plot; a well-known one is the grand tour [1]. Chernoff faces =-=[13]-=- allow even a few more dimensions, but are difficult to interpret and do not produce a spatial view of the data. 5sPSfrag replacements x3 M x2 B A x1 f(t) = (R sin 2πt, R cos 2πt, st) T Figure 2: An e... |

241 | Optimal unsupervised learning in a single-layer linear feedfonvard neural network
- Sanger
- 1989
(Show Context)
Citation Context ... 15] for applications. • Networks based in Oja’s rule [77] with some kind of decorrelating device (e.g. Kung and Diamantaras’ APEX [72], Földiák’s network [32], Sanger’s Generalised Hebbian Algorithm =-=[88]-=-). PSfrag replacements Linear autoassociator APEX network n input units h < n hidden, output units n output units Figure 9: Two examples of neural networks able to perform a principal component analys... |

206 | The Helmholtz machine
- Dayan, Hinton, et al.
- 1995
(Show Context)
Citation Context ...ld like to conclude by mentioning a number of further techniques related to dimension reduction that have not been included in this work due to lack of time. These would include the Helmholtz machine =-=[19, 20]-=-, some variations of self-organising maps (growing neural gas [39, 40], Bayesian approaches [96, 97], etc.), population codes [100], and curvilinear component analysis [22], among others. 39sA Glossar... |

203 |
Some aspects of the spline smoothing approach to non-parametric regression curve fitting
- Silverman
- 1985
(Show Context)
Citation Context ...) p ′′ i (xi) = p ′′ i−1(xi) p ′′ 1(x1) = p ′′ n(xn) = 0 Smoothing grows with λ, from 0 (pure interpolation) to ∞ (least squares linear fit). The shape of the smoothing spline converges to the kernel =-=[92]-=-: K(u) = 1 � |u| √ |u| e− 2 sin √2 + 2 π � . (F.2) 4 Unfortunately, there are theoretical difficulties to extend analytically the variational problem (F.1) to several dimensions, which prevents using ... |

200 |
Neural networks and principal component analysis: Learning from examples without local minima. Neural networks
- Baldi, Hornik
- 1989
(Show Context)
Citation Context ...en units and n outputs, trained to replicate the input in the output layer minimising the squared sum of errors, and typically trained with backpropagation. Bourlard and Kamp [9] and Baldi and Hornik =-=[2]-=- showed that this network finds a basis of the subspace spanned by the first h PCs, not necessarily coincident with them 30 ; see [11, 15] for applications. • Networks based in Oja’s rule [77] with so... |

192 |
The Varimax criterion for analytic rotation in factor analysis
- Kaiser
- 1958
(Show Context)
Citation Context ...t was introduced by Friedman and Tukey [38] in 1974, along with the first projection index. Good reviews of projection pursuit can be found in Huber [53] and Jones and Sibson [65]. 9 Varimax rotation =-=[66]-=- is a procedure that, given a subspace or projection, selects a new basis for it that maximises the variance but giving large loadings to as few variables as possible. The projection will be mainly ex... |

189 |
Hyperdimensional Data Analysis Using Parallel Coordinates
- Wegman
- 1990
(Show Context)
Citation Context ...s in its corners and the centre becomes less important. Table 7) and figure 22 show the volumes V (S D 1 ), V (C D 1 ) and the ratio between them for several dimensions. • Hypervolume of a thin shell =-=[98]-=-: consider the volume between two concentric spheric shells of respective radii R and R(1 − ɛ), with ɛ small. Then the ratio V (S D R ) − V (SD R(1−ɛ) ) V (S D R ) = 1 − (1 − ɛ) D −−−−→ D→∞ 1. Hence, ... |

187 |
The Grand Tour: A Tool for Viewing Multidimensional Data
- Asimov
- 1985
(Show Context)
Citation Context ...t allow to visualise up to about 5-dimensional data sets, using colours, rotation, stereography, glyphs or other devices, but they lack the appeal of a simple plot; a well-known one is the grand tour =-=[1]-=-. Chernoff faces [13] allow even a few more dimensions, but are difficult to interpret and do not produce a spatial view of the data. 5sPSfrag replacements x3 M x2 B A x1 f(t) = (R sin 2πt, R cos 2πt,... |

187 |
Smoothing by spline functions
- Reinsch
- 1967
(Show Context)
Citation Context ...= E {y}; for k = 1, ˆ fh(x) is a step function matching yi for xi and jumping in the middle between adjacent xi. F.3.3 Spline regression smoothing Assume D = 1: x = x ∈ R and x1 ≤ · · · ≤ xn. Reinsch =-=[83]-=- proved that the variational problem (where λ > 0 is a smoothing parameter) min Sλ( ˆf ˆ n� f) = (yi − ˆ f(xi)) 2 + λ� ˆ f ′′ � 2 2 (F.1) i=1 has a unique solution ˆ f = ˆ fh(x) ∈ C 2 ([x1, xn]): the ... |

175 | J.Hérault, “Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of datasets
- Demartines
- 1997
(Show Context)
Citation Context ...e the Helmholtz machine [19, 20], some variations of self-organising maps (growing neural gas [39, 40], Bayesian approaches [96, 97], etc.), population codes [100], and curvilinear component analysis =-=[22]-=-, among others. 39sA Glossary • Backfitting algorithm: an iterative method to fit additive models, by fitting each term to the residuals given the rest (see section 3.8.1). It is a version of the Gaus... |

173 |
Neural networks: a review from a statistical perspective
- Cheng, DM
- 1994
(Show Context)
Citation Context ... the activation of the hidden layer, v, and the weights of the second layer, β: y = f(ϕ(v, β)). The following particular cases of this network implement several of the techniques previously mentioned =-=[12]-=-: • Projection pursuit regression (cf. eq. (3.8)): ϕ(v, β) = v T 1, vk = gk(w T k x) ⇒ y = j� k=1 gk(w T k x). The activation functions gk are determined from the data during training; wk represent th... |

148 |
Neural networks and related methods for classi
- Ripley
- 1994
(Show Context)
Citation Context ... approximated by a sigmoidal MLP of one input. Therefore, the approximation capabilities of MLPs and PP are very similar [25, 63]. This architecture admits generalisations to several output variables =-=[84]-=- depending on whether the output share the common “basis functions” gk and, if not, whether the separate gk share common projection directions wk. • Generalised additive models: j = D, wk0 = 0, wk = e... |

147 |
An Introdu to Latent Variable Models
- Everitt
- 1984
(Show Context)
Citation Context ...e system. Sometimes, a phenomenon which is in appearance high-dimensional, and thus complex, can actually be governed by a few simple variables (sometimes called “hidden causes” or “latent variables” =-=[29, 20, 21, 6, 74]-=-). Dimension reduction can be a powerful tool for modelling such phenomena and improve our understanding of them (as often the new variables will have an interpretation). For example: • Genome sequenc... |

138 |
Smoothing Techniques: with Implementation in
- Härdle
- 1990
(Show Context)
Citation Context .... Problems: The choice of origin can affect the estimate. The next estimators are independent of origin choice. 33 This appendix is mainly based on the books by Silverman [93], Scott [89] and Haerdle =-=[45]-=-. 34 It is also possible to estimate the cdf F (x) and then obtain from it the pdf f = dF � n i=1 I (−∞,x](Xi). It can be shown that this is an unbiased estimator and has the smallest variance of all ... |

136 |
Principal components, minor components, and linear neural networks
- Oja
- 1992
(Show Context)
Citation Context ...d Hornik [2] showed that this network finds a basis of the subspace spanned by the first h PCs, not necessarily coincident with them 30 ; see [11, 15] for applications. • Networks based in Oja’s rule =-=[77]-=- with some kind of decorrelating device (e.g. Kung and Diamantaras’ APEX [72], Földiák’s network [32], Sanger’s Generalised Hebbian Algorithm [88]). PSfrag replacements Linear autoassociator APEX netw... |

131 |
Principal Component Neural Networks: Theory and Applications
- Diamantaras, Kung
- 1996
(Show Context)
Citation Context ... on regularisation and wavelets. 31 k=1s6.2 Principal component analysis networks There exist several neural network architectures capable to extract PCs (see fig. 9), which can be classified in (see =-=[26]-=- or [11] for more details): • Autoassociators (also called autoencoders, bottlenecks or n-h-n networks), which are linear two-layer perceptrons with n inputs, h hidden units and n outputs, trained to ... |

98 |
What is projection pursuit
- Jones, Sibson
- 1987
(Show Context)
Citation Context ...he term projection pursuit was introduced by Friedman and Tukey [38] in 1974, along with the first projection index. Good reviews of projection pursuit can be found in Huber [53] and Jones and Sibson =-=[65]-=-. 9 Varimax rotation [66] is a procedure that, given a subspace or projection, selects a new basis for it that maximises the variance but giving large loadings to as few variables as possible. The pro... |

96 |
Asymptotics of graphical projection pursuit
- Diaconis, Freedman
- 1984
(Show Context)
Citation Context ...t information, in both the senses of Fisher information and negative entropy [16]. • For most high-dimensional clouds, most low-dimensional projections are approximately normal (Diaconis and Freedman =-=[24]-=-). We will consider the normal distribution as the least structured (or least interesting) density. For example, figure 5 shows two 2-D projections of a 3-D data set consisting of two clusters. The pr... |

87 | Objective function formulation of the BCM theory of visual cortical plasticity
- Intrator, Cooper
- 1992
(Show Context)
Citation Context ...Sibson 13 [65] propose two ways to evaluate the entropy index (3.6): 11Although many synaptic plasticity models are based on second-order statistics and lead to the extraction of principal components =-=[59]-=-. 12The visualisation program Xgobi [95, 68] implements several of these indices. 13In a variant of projection pursuit they call projection pursuit exploratory data analysis (PPEDA). 15s– Implementing... |

85 |
Auto-association by multilayer perceptrons and singular value decomposition
- Bourlard, Kamp
- 1988
(Show Context)
Citation Context ...ons with n inputs, h hidden units and n outputs, trained to replicate the input in the output layer minimising the squared sum of errors, and typically trained with backpropagation. Bourlard and Kamp =-=[9]-=- and Baldi and Hornik [2] showed that this network finds a basis of the subspace spanned by the first h PCs, not necessarily coincident with them 30 ; see [11, 15] for applications. • Networks based i... |

79 |
Calculus on Manifolds: A Modern Approach to Classical Theorems of Advanced Calculus
- Spivak
- 1965
(Show Context)
Citation Context ...eorem introduces the concept of coordinate system of a manifold. Theorem H.2. M ⊂ R n is a k-manifold of R n iff for all x ∈ M the following condition holds: 40 Which is mainly based in Spivak’s book =-=[94]-=-. 58 V = 1.s(C) There exist two open sets U ⊂ R n , W ⊂ R k with x ∈ U, and a differentiable one-to-one mapping f : W −→ R n such that: 1. f(W ) = M ∩ U. 2. The Jacobian f ′ (y) has rank k for all y ∈... |

69 | Regression Modeling in Back-Propagation and Projection Pursuit Learning
- Hwang, Lay, et al.
- 1994
(Show Context)
Citation Context ...ive mapping G D L D Linear Sigmoidal Figure 10: Autoencoder, implemented as a four-layer nonlinear perceptron where L < D and ˆx = G(R(x)). 6.3 Projection pursuit learning network (PPLN) Hwang et al. =-=[54]-=- propose the two-layer perceptron depicted in fig. 11 with a projection pursuit learning (PPL) algorithm to solve the multivariate nonparametric regression problem of section F.1. The outputs 30 If ne... |

68 | Adaptive network for optimal linear feature extraction
- Foldiak
- 1989
(Show Context)
Citation Context ...ecessarily coincident with them 30 ; see [11, 15] for applications. • Networks based in Oja’s rule [77] with some kind of decorrelating device (e.g. Kung and Diamantaras’ APEX [72], Földiák’s network =-=[32]-=-, Sanger’s Generalised Hebbian Algorithm [88]). PSfrag replacements Linear autoassociator APEX network n input units h < n hidden, output units n output units Figure 9: Two examples of neural networks... |

67 | Projection Pursuit Density Estimation
- Friedman, Stuetzle, et al.
- 1984
(Show Context)
Citation Context ...nue until j = D or D(f||g) = 0, when ˆ f will be the Gaussian density. 3.7.3 Projection pursuit density estimation (PPDE) Projection pursuit density estimation (PPDE; Friedman, Stuetzle and Schroeder =-=[37]-=-) is appropriate when the variation of densities is concentrated in a linear manifold of the high-dimensional space. Given the data sample in R D , it operates as follows: 1. Sphere the data. 2. Take ... |