## Factor analysis using delta-rule wake-sleep learning (1997)

### Cached

### Download Links

- [www.informatik.uni-osnabrueck.de]
- [www.gatsby.ucl.ac.uk]
- [www.gatsby.ucl.ac.uk]
- [ftp.cs.utoronto.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | Neural Computation |

Citations: | 26 - 3 self |

### BibTeX

@TECHREPORT{Neal97factoranalysis,

author = {Radford M. Neal and Peter Dayan},

title = {Factor analysis using delta-rule wake-sleep learning},

institution = {Neural Computation},

year = {1997}

}

### Years of Citing Articles

### OpenURL

### Abstract

We describe a linear network that models correlations between real-valued visible variables using one or more real-valued hidden variables — a factor analysis model. This model can be seen as a linear version of the “Helmholtz machine”, and its parameters can be learned using the “wake-sleep ” method, in which learning of the primary “generative” model is assisted by a “recognition ” model, whose role is to fill in the values of hidden variables based on the values of visible variables. The generative and recognition models are jointly learned in “wake ” and “sleep ” phases, using just the delta rule. This learning procedure is comparable in simplicity to Oja’s version of Hebbian learning, which produces a somewhat different representation of correlations in terms of principal components. We argue that the simplicity of wake-sleep learning makes factor analysis a plausible alternative to Hebbian learning as a model of activity-dependent cortical plasticity. 1

### Citations

9486 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

2420 |
Principal Component Analysis
- Jolliffe
- 1986
(Show Context)
Citation Context ...oblem in neurobiological modeling. Although there are some special circumstances in which factor analysis is equivalent to principal component analysis, the techniques are in general quite different (=-=Jolliffe 1986-=-). Loosely speaking, principal component analysis pays attention to both variance and covariance, whereas factor analysis looks only at covariance. In particular, if one of the components of x is corr... |

1060 | Emergence of simple-cell receptive field properties by learning a sparse code for natural images - Olshausen, Field - 1996 |

869 | Adaptive mixture of local experts - Jacobs, Jordan, et al. - 1991 |

671 | Distributed hierarchical processing in the primate cerebral cortex - Felleman, Essen - 1991 |

490 | The Organization of Behavior: A Neuropsychological Theory - Hebb - 1949 |

337 | Mood and memory - Bower - 1981 |

322 | Self-organization in a perceptual network - Linsker - 1988 |

307 |
Learning and relearning in Boltzmann machines
- Hinton, Sejnowski
- 1986
(Show Context)
Citation Context ...ases. Such balanced operation of the two phases is not essential for convergence to the correct solution, however. This contrasts with the "wake" and "sleep" phases of learning in =-=Boltzmann Machines (Hinton and Sejnowski 1986-=-), in which the two phases must be exactly balanced for the learning to follow an appropriate gradient. 3.4 Wake-sleep learning for multiple-factor models A Helmholtz machine with more than one hidden... |

249 |
Unsupervised learning
- Barlow
- 1989
(Show Context)
Citation Context ...ing as a goal for subsequent levels of processing, once sensory signals have reached cortex. Several other computational goals have been suggested from this stage upwards, including factorial coding (=-=Barlow 1989-=-), sparsification (Olshausen and Field 1995), and various methods for encouraging the cortex to respect reasonable invariances, such as translation or scale invariance for visual processing (Li and At... |

243 | Optimal unsupervised learning in a single-layer linear feedforward neural network
- Sanger
- 1989
(Show Context)
Citation Context ...ance in the input space. Extracting the subsidiary eigenvectors of the covariance matrix of the inputs is somewhat more challenging, requiring some form of inhibition between successive output units (=-=Sanger 1989-=-; Fsoldiak 1989; Plumbley 1993). Linsker (1988) views Hebbian learning as a way of maximising the information retained by y about x. Under the simplifying assumption that the distribution of the input... |

228 | Toward a modern theory of adaptive networks: Expectation and prediction - Sutton, Barto - 1981 |

212 |
Neural Networks, Principal Components, and Subspaces
- Oja
- 1989
(Show Context)
Citation Context ...nobserved factors. Our interest in these models stems from their potential as a way of building high-level representations from sensory data. Oja's version of Hebbian learning (Oja and Karhunen 1985; =-=Oja 1989-=-, 1992) is a particularly convenient counterpoint. This rule applies to a linear unit with weight vector w that computes an output y = w T x when presented with a real-valued input vector x (which, fo... |

209 | The Helmholtz machine - Dayan, Hinton, et al. - 1995 |

205 |
From basic network principles to neural architecture
- Linsker
- 1986
(Show Context)
Citation Context ...te brain has typically been modeled in terms of Hebbian learning (Hebb 1959), in which weight changes are based on the covariance of presynaptic and post-synaptic activity (eg, von der Malsburg 1973; =-=Linsker 1986-=-; Miller, Keller, and Stryker 1989). These models derive support from neurobiological evidence of long-term potentiation (see, for example, Collingridge and Bliss (1987), and for a recent review, Baud... |

160 | Ocular dominance column development: Analysis and simulation - Miller, Keller, et al. - 1989 |

147 | An Introduction to Latent Variable Models - Everitt - 1984 |

136 | Principal components, minor components, and linear neural networks - Oja - 1992 |

120 | Multiscale recursive estimation, data fusion, and regularization - Chou, Willsky, et al. - 1994 |

116 | Autoencoders, minimum description length, and helmholtz free energy. Advances in neural information processing systems (NIPS - Hinton, Zemel - 1994 |

115 | Self-organization of orientation sensitive cells in the striate cortex - Malsburg - 1973 |

113 |
algorithms for ML factor analysis
- Rubin, Thayer, et al.
- 1982
(Show Context)
Citation Context ...ts analysis using a standard matrix technique such as singular-value decomposition rather than by using Hebbian learning, factor analysis is probably better implemented on a computer using either EM (=-=Rubin and Thayer 1982-=-) or the second order Newton methods of Jsoreskog (1967, 1969, 1977) than by the wake-sleep algorithm. In our view, the application of the wakesleep algorithm to factor analysis is interesting as a po... |

103 |
On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix
- Oja
- 1985
(Show Context)
Citation Context ... of a small number of unobserved factors. Our interest in these models stems from their potential as a way of building high-level representations from sensory data. Oja's version of Hebbian learning (=-=Oja and Karhunen 1985-=-; Oja 1989, 1992) is a particularly convenient counterpoint. This rule applies to a linear unit with weight vector w that computes an output y = w T x when presented with a real-valued input vector x ... |

96 | A general approach to confirmatory maximum likelihood factor analysis - Joreskog - 1969 |

85 | Neuronal architecture for pattern-theoretic problems
- Mumford
- 1993
(Show Context)
Citation Context ...nable invariances, such as translation or scale invariance for visual processing (Li and Atick 1994). In this paper, we pursue the suggestion of Hinton and Zemel (1994) (see also Grenander 1976-1981; =-=Mumford 1994; Dayan, H-=-inton, Neal, and Zemel 1995) that the cortex might be constructing a hierarchical stochastic "generative" model of its input in the top-down connections, while implementing in the bottom-up ... |

83 | A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activity-dependent competition between ON- and OFF-center inputs
- Miller
- 1994
(Show Context)
Citation Context ...er than the other (Miller, Keller, and Stryker 1989), and orientation domains, in which groups of nearby cells respond to bars of light of nearby position and orientation on the retina (Linkser 1988; =-=Miller 1994-=-). However, Hebbian learning is theoretically unsatisfying in two respects. First, it is justified as maximising the transfer of Shannon information through a network. This may be undesirable in situa... |

73 | How Patterned Neural Connections Can Be Set Up by Self-Organization - Willshaw, Malsburg - 1976 |

72 | The role of constraints in Hebbian learning - Miller, MacKay - 1994 |

68 | Adaptive network for optimal linear feature extraction
- Foldiak
- 1989
(Show Context)
Citation Context ...nput space. Extracting the subsidiary eigenvectors of the covariance matrix of the inputs is somewhat more challenging, requiring some form of inhibition between successive output units (Sanger 1989; =-=Földiák 1989-=-; Plumbley 1993). Linsker (1988) views Hebbian learning as a way of maximising the information retained ¡ by about ¦ . Under the simplifying assumption that the distribution of the inputs is Gaussian,... |

61 | Likelihood calculation for a class of multiscale stochastic-models, with application to texture-discrimination
- Luettgen, Willsky
- 1995
(Show Context)
Citation Context ...(Ghahramani and Hinton, personal communication; Rao and Ballard 1995). Willsky and his colleagues (Chou, Willsky, and Benveniste 1994; Chou, Willsky, and Nikoukhah 1994; Krim, Willsky, and Karl 1994; =-=Luettgen and Willsky 1995-=-) have built a sophisticated multi-resolution tree architecture for images that combines interconnected factor analysers at different spatial resolutions. The advantage of the tree is that the E-step ... |

59 | Recognizing handwritten digits using mixtures of linear model
- Hinton, Revow, et al.
- 1995
(Show Context)
Citation Context ...bler divergence in which the two distributions appear in the wrong order, and in which the distribution over visible variables is that produced by the generative model rather than the external world (=-=Hinton et al. 1995-=-). An algorithm that correctly performs stochastic gradient descent in the recognition parameters using a correct cost function does exist (Dayan and Hinton 1996), but unfortunately, it involves reinf... |

55 |
Some contributions to maximum likelihood factor analysis
- Jöreskog
- 1967
(Show Context)
Citation Context ... of the technique early in this century. In the 1960’s, computationally feasible algorithms were developed for performing factor analysis by the statistically attractive method of maximum likelihood (=-=Jöreskog 1967-=-, 1969, 1977). In maximum likelihood learning, the parameters of the model are chosen so as to maximize the probability density assigned by the model to the data that were observed (the “likelihood”).... |

43 | The Predictive Brain: Temporal Coincidence and temporal order in Synaptic learning mechanisms
- Montague, Sejnowski
(Show Context)
Citation Context ...tion). Rules equivalent to the delta rule are conventional in classical conditioning (Rescorla and Wagner 1972; Sutton and Barto, 1981) and have also been suggested as underlying cortical plasticity (=-=Montague and Sejnowski 1994-=-). Of course, the wake-sleep learning rule requires two phases of activation, with different connections being primarily responsible for driving the cells in each phase. Although there is some suggest... |

38 | Multiscale systems, Kalman filters, and Riccati equations - Chou, Willsky, et al. - 1994 |

37 | Toward a theory of the striate cortex
- Li, Atick
- 1994
(Show Context)
Citation Context ...rlow 1989), sparsification (Olshausen and Field 1995), and various methods for encouraging the cortex to respect reasonable invariances, such as translation or scale invariance for visual processing (=-=Li and Atick 1994-=-). In this paper, we pursue the suggestion of Hinton and Zemel (1994) (see also Grenander 1976-1981; Mumford 1994; Dayan, Hinton, Neal, and Zemel 1995) that the cortex might be constructing a hierarch... |

36 |
A theory of Pavlovian conditioning: the effectiveness of reinforcement and nonreinforcement
- Rescorla, Wagner
- 1972
(Show Context)
Citation Context ...tions (such as g j y (c) of equation (11)) or prediction errors (such as x (c) j \Gamma g j y (c) of the same equation). Rules equivalent to the delta rule are conventional in classical conditioning (=-=Rescorla and Wagner 1972-=-; Sutton and Barto, 1981) and have also been suggested as underlying cortical plasticity (Montague and Sejnowski 1994). Of course, the wake-sleep learning rule requires two phases of activation, with ... |

28 |
Efficient information transfer and anti-Hebbian neural networks
- Plumbley
- 1993
(Show Context)
Citation Context ...racting the subsidiary eigenvectors of the covariance matrix of the inputs is somewhat more challenging, requiring some form of inhibition between successive output units (Sanger 1989; Fsoldiak 1989; =-=Plumbley 1993-=-). Linsker (1988) views Hebbian learning as a way of maximising the information retained by y about x. Under the simplifying assumption that the distribution of the inputs is Gaussian, setting the out... |

24 | How to label nerve cells so that they can interconnect in an ordered fashion - Malsburg, Willshaw - 1977 |

21 | A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem - Willshaw, Malsburg - 1979 |

19 | NMDA receptors: Their role in long-term potentiation. Trends Neurocsi - Collingridge, Bliss - 1987 |

19 |
The wake-sleep algorithm for selforganizing neural networks
- Hinton, Dayan, et al.
- 1995
(Show Context)
Citation Context ...bler divergence in which the two distributions appear in the wrong order, and in which the distribution over visible variables is that produced by the generative model rather than the external world (=-=Hinton et al. 1995-=-). An algorithm that correctly performs stochastic gradient descent in the recognition parameters using a correct cost function does exist (Dayan and Hinton 1996), but unfortunately, it involves reinf... |

17 | Factor analysis by least-squares and maximum-likelihood methods - Joreskog - 1993 |

15 | Sparse coding of natural images produces localized, oriented, bandpass receptive fields
- Olshausen, Field
- 1485
(Show Context)
Citation Context ...s largely determines the width of the stripes. Lateral connections have also been used with generative models similar to those of Helmholtz machines for which inference is done by mean field methods (=-=Olshausen and Field, 1996-=-; Rao and Ballard, 1995). Lateral connections are not present in the linear Helmholtz machines we have so far described, but they could play a role in inducing correlations between the hidden factors ... |

9 | Dynamic model of visual memory predicts neural response properties in the visual cortex
- Rao, Ballard
- 1995
(Show Context)
Citation Context ...oportions to vary under the control of a gating network (Jacobs, Jordan, Nowlan, and Hinton 1991). Another possibility is to build a hierarchical model (Ghahramani and Hinton, personal communication; =-=Rao and Ballard 1995-=-). Willsky and his colleagues (Chou, Willsky, and Benveniste 1994; Chou, Willsky, and Nikoukhah 1994; Krim, Willsky, and Karl 1994; Luettgen and Willsky 1995) have built a sophisticated multi-resoluti... |

6 | Lectures in Pattern Theory I, II and III: Pattern Analysis, Pattern 131 and Regular Structures - Grenander - 1976 |

4 | Adaptive Network for Optimal Linear Feature Extraction - oldiak, P - 1989 |

2 | Long-Term Potentiation: A Debate of Current Issues - Baudry, Davis - 1991 |

2 | A general approach to confirmatory maximum likelihood factor analysis - oreskog, G - 1969 |

1 | Some contributions to maximum likelihood factor analysis - oreskog, G - 1967 |

1 | Factor analysis by least-squares and maximum-likelihood methods - oreskog, G - 1977 |