## Optimization of Entropy with Neural Networks (1995)

Citations: | 7 - 3 self |

### BibTeX

@TECHREPORT{Schraudolph95optimizationof,

author = {Nicol Norbert Schraudolph},

title = {Optimization of Entropy with Neural Networks},

institution = {},

year = {1995}

}

### Years of Citing Articles

### OpenURL

### Abstract

### Citations

9216 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...thermodynamics it never decreases in closed systems. Shannon's insight as to the equivalence of entropy and information spawned a half-century of extensive research into information theory, to which (=-=Cover and Thomas, 1991-=-) provide a comprehensive introduction. One of the early results of this endeavor were a number of theorems on the topic of optimal coding that provided further justification for H(X) as a measure of ... |

4294 |
Neural Networks: A Comprehensive Foundation, 2nd edition, Upper Saddle River
- Haykin
(Show Context)
Citation Context ...as published, neural networks have become a popular (if not the preferred) framework for machine learning. A thorough introduction to this topic can be found in textbooks such as (Hertz et al., 1991; =-=Haykin, 1994-=-); the current state of the art in this very active research area is covered comprehensively in (Arbib, 1995). Here we limit ourselves to a brief introduction of the basic concepts, techniques and not... |

4172 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...metric model can represent, while the other optimizes an objective that is only indirectly related to the data. We therefore turn to another alternative: Parzen window (or kernel) density estimation (=-=Duda and Hart, 1973-=-, chapter 4.3). This nonparametric technique assumes that the output density is a smoothed version of an empirical sample rather than any particular functional form --- in a way, the data sample itsel... |

3931 | Optimization by simulated annealing
- Kirkpatrick, Gelatt, et al.
- 1983
(Show Context)
Citation Context ...on has to be performed for each update. Moreover, the noise associated with small sample sizes can in fact be very effective in helping gradient descent escape from local minima. Simulated Annealing (=-=Kirkpatrick et al., 1983-=-) is a global optimization technique which gradually progresses from a noisy to a noise-free regime in order to obtain accurate results while retaining the ability to escape from local minima. Analogo... |

3017 |
Learning Internal Representations by Error Propagation, Parallel distributed processing: Explorations of the Microstructure of Cognition
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...supervised learning that can be implemented with neural networks: the minimum description length principle, and exploratory projection pursuit. I.B.1 Learning with Neural Networks In the years since (=-=Rumelhart et al., 1986-=-b) was published, neural networks have become a popular (if not the preferred) framework for machine learning. A thorough introduction to this topic can be found in textbooks such as (Hertz et al., 19... |

2296 |
Principal Component Analysis
- Jolliffe
- 1986
(Show Context)
Citation Context ...iate weight normalization and differentiation characteristics. 19 20 II.A Learning Invariant Structure II.A.1 Motivation Many unsupervised learning algorithms share with principal component analysis (=-=Jolliffe, 1986-=-) the strategy of extracting the directions of highest variance from the input. A single Hebbian node, for instance, will come to encode the input's first principal component (Oja and Karhunen, 1985);... |

2188 |
Numerical recipes in C: The art of scientific computing
- Press, Flannery, et al.
- 1992
(Show Context)
Citation Context ...be generated by far shorter programs: the digits ofsor p 2 for instance can be computed to arbitrary precision with short programs, so they have very low Kolmogorov complexity. Pseudo-random numbers (=-=Press et al., 1992-=-, chapter 7) are in fact designed to have very high entropy but low algorithmic complexity. In its pure form, Kolmogorov complexity is unfortunately not computable: there is no algorithm for finding t... |

1793 | An Introduction to Kolmogorov Complexity and its Applications” 3rd Edition
- Li, Vitányi
- 2008
(Show Context)
Citation Context ...ive objective functions for unsupervised learning from information-theoretic arguments. I.B.2 The Minimum Description Length Principle The minimum description length or MDL principle (Rissanen, 1989; =-=Li and Vitanyi, 1993-=-) can be understood as an attempt to put the idea of Kolmogorov complexity into practice. Its starting point is William of Occam's famous dictum: Causes shall not be multiplied beyond necessity. 11 X ... |

1692 | Finding Structure in time
- Elman
- 1990
(Show Context)
Citation Context ...ailable in the error signal of a supervised network. The most popular way of doing this is to require the network to predict the next patch of some structured input from the preceding context, as in (=-=Elman, 1990-=-); the same prediction technique can be used across space as well as time (Fontaine and Shastri, 1993). It is also possible to explicitly derive an error signal from the mutual information betweentwo ... |

1490 |
Robot Vision
- Horn
- 1991
(Show Context)
Citation Context ...without constructing a model of the object. An important application area where this is required is the registration of medical images such as CT and MRI scans. Another example is photometric stereo (=-=Horn, 1986-=-), where two identical views of an object under different lighting conditions are used to 82 infer a 3-D model whose pose can then be aligned to novel views. Nonparametric entropy optimization is capa... |

1439 |
Self-Organization and Associative Memory
- Kohonen
- 1988
(Show Context)
Citation Context ...rning rate be positive. By reversing the sign of this constant in a recurrent autoassociator, Kohonen constructed a "novelty filter " that learned to be insensitive to familiar features in i=-=ts input (Kohonen, 1989-=-). More recently, such anti-Hebbian synapses have been used for lateral decorrelation of feature detectors (Barlow and Foldiak, 1989; Foldiak, 1989; Leen, 1991) as well as --- in differential form ---... |

464 | A learning algorithm for Boltzmann machines
- Ackley, Hinton, et al.
- 1985
(Show Context)
Citation Context ...bution is analytically available, information-theoretic objectives such as the entropy of the network can be optimized directly. A well-known example of this type of network is the Boltzmann Machine (=-=Ackley et al., 1985-=-; Hinton and Sejnowski, 1986); the Helmholtz Machine (Dayan et al., 1995) is an interesting recent development in this field. Note that networks of probabilistic nodesneed not necessarily be stochasti... |

452 |
A simplified neuron model as a principal component analyzer
- Oja
- 1982
(Show Context)
Citation Context ...g weight vectors, as we did in our stereogram experiments. However, computation of the weight vector length is nonlocal, and therefore neurobiologically implausible and computationally unattractive. (=-=Oja, 1982) has deve-=-loped an "active decay" rule that locally approximates explicit weight normalization: \Delta ~ w = j (y ~x \Gamma y 2 ~ w) (II:10) Here the first term in parentheses represents the standard ... |

360 |
Increased rates of convergence through learning rate adaptation
- Jacobs
- 1988
(Show Context)
Citation Context ...mizes the log-likelihood elegantly and efficiently. For even faster and more robust convergence we can adapt the step size parameters j k (t) 69 via a mechanism akin to the delta-bar-delta algorithm (=-=Jacobs, 1988-=-): j k (t+1) = 8 ? ! ? : 2 j k (t) if ffi k (t) ffi k (t \Gamma 1) ? 0 ; j k (t)=3 otherwise. (IV:13) We find that this algorithm reliably converges to the optimal kernel shape in just a few iteration... |

333 | Theory of the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex
- Bienenstock, Cooper, et al.
- 1982
(Show Context)
Citation Context ...lgorithm by incorporating additional information in the estimator. We share the goal of seeking highly informative, bimodal projections of the input with the Bienenstock-Cooper-Munro (BCM) algorithm (=-=Bienenstock et al., 1982-=-). In our notation, the BCM learning rule according to (Intrator, 1990, 1991a, 1991b, 1992) is \Delta ~ w / f 0 (y) ~x ` z 2 \Gamma 4 3 z D z 2 E ' : (III:17) Whereas in the derivation of BINGO some o... |

315 | Self-organization in a perceptual network
- Linsker
- 1988
(Show Context)
Citation Context ...nik, 1989). Since the entropy of a probability density scales with its log-variance, these techniques can also be viewed as a special (linear) case of the maximum information preservation or Infomax (=-=Linsker, 1988-=-; Bell and Sejnowski, 1995) strategy. A comprehensive approach to unsupervised learning, however, must also take the cost of representing the data into account. From this perspective maximum informati... |

305 |
Learning and Relearning in Boltzmann Machines
- Hinton, Sejnowski
- 1986
(Show Context)
Citation Context ...y available, information-theoretic objectives such as the entropy of the network can be optimized directly. A well-known example of this type of network is the Boltzmann Machine (Ackley et al., 1985; =-=Hinton and Sejnowski, 1986-=-); the Helmholtz Machine (Dayan et al., 1995) is an interesting recent development in this field. Note that networks of probabilistic nodesneed not necessarily be stochastic though: we may use a stoch... |

269 |
Control methods used in a study of vowels
- Peterson, Barney
- 1952
(Show Context)
Citation Context ...xample), a better theoretical argument 14 Figure I.4: A simple example of exploratory projection pursuit. On the left, a random 2-D projection of the 4-D data set of vowel formant frequencies due to (=-=Peterson and Barney, 1952-=-). On the right, a 2-D projection of the same data set that has been optimized by exploratory projection pursuit, using the Friedman-Tukey projection index. can be made for entropy as a projection ind... |

262 |
A projection pursuit algorithm for exploratory data analysis
- FRIEDMAN, TUKEY
- 1974
(Show Context)
Citation Context ...al network techniques. Variants of projection pursuit have been developed for tasks such as regression and density estimation; its unsupervised realization is known as exploratory projection pursuit (=-=Friedman and Tukey, 1974-=-). Like other approaches to unsupervised learning, exploratory projection pursuit must address the question of what to learn when there is no teacher. What objective function --- in this context also ... |

255 | Learning invariance from transformation sequences
- Földiák
- 1991
(Show Context)
Citation Context ... this might be achieved. In fact several methods exist that reconcile this dichotomy obliquely; we briefly review these before turning to our own, more direct approach. II.A.2 Previous Approaches In (=-=Foldiak, 1991-=-), spatial invariance is turned into a temporal feature by using transformation sequences within invariance classes as a stimulus. For translation-invariant object recognition, for instance, short seq... |

243 |
Projection pursuit
- Huber
- 1985
(Show Context)
Citation Context ...e; we note here that when X is encoded with a neural network, the MDL objective demands that the entropy at the network's output be minimized. I.B.3 Exploratory Projection Pursuit Projection pursuit (=-=Huber, 1985-=-) is a statistical method that condenses high-dimensional data by projecting it into a low-dimensional subspace before further processing. This facilitates subsequent analysis by methods that would be... |

235 | Optimal unsupervised learning in a single-layer linear feedforward network
- Sanger
- 1989
(Show Context)
Citation Context ...rincipal component (Oja and Karhunen, 1985); various forms of lateral interaction can be used to force a layer of such nodes to differentiate and span the principal component subspace (Foldiak, 1989; =-=Sanger, 1989-=-; Oja, 1989; Kung and Diamantaras, 1991; Leen, 1991). The same kind of representation also develops in the hidden layer of backpropagationautoassociatornetworks (Cottrell and Munro, 1988; Baldi and Ho... |

206 |
Neural networks, principal components and subspaces
- Oja
- 1989
(Show Context)
Citation Context ...nent (Oja and Karhunen, 1985); various forms of lateral interaction can be used to force a layer of such nodes to differentiate and span the principal component subspace (Foldiak, 1989; Sanger, 1989; =-=Oja, 1989-=-; Kung and Diamantaras, 1991; Leen, 1991). The same kind of representation also develops in the hidden layer of backpropagationautoassociatornetworks (Cottrell and Munro, 1988; Baldi and Hornik, 1989)... |

200 | The Helmholtz machine
- Dayan, Hinton, et al.
- 1995
(Show Context)
Citation Context ...s the entropy of the network can be optimized directly. A well-known example of this type of network is the Boltzmann Machine (Ackley et al., 1985; Hinton and Sejnowski, 1986); the Helmholtz Machine (=-=Dayan et al., 1995-=-) is an interesting recent development in this field. Note that networks of probabilistic nodesneed not necessarily be stochastic though: we may use a stochastic model to compute the output probabilit... |

195 |
Neural networks for principal component analysis: learning from examples without local minima
- Baldi, Hornik
- 1989
(Show Context)
Citation Context ...Sanger, 1989; Oja, 1989; Kung and Diamantaras, 1991; Leen, 1991). The same kind of representation also develops in the hidden layer of backpropagationautoassociatornetworks (Cottrell and Munro, 1988; =-=Baldi and Hornik, 1989-=-). Since the entropy of a probability density scales with its log-variance, these techniques can also be viewed as a special (linear) case of the maximum information preservation or Infomax (Linsker, ... |

142 | First-second-order methods for learning: between steepest descent and Newton’s method. Neural Computation
- Battiti
- 1992
(Show Context)
Citation Context ...cal difference is that in a supervised network, each output node has a given target, and very efficient gradient descent techniques exist for approaching this target given a quadratic error function (=-=Battiti, 1992-=-). In an unsupervised network, on the other hand, there is no obvious assignment of targets to nodes: a node may have to select among multiple targets, or multiple nodes may compete for the same targe... |

137 |
Self-organizing neural network that discovers surfaces in random-dot stereograms
- Becker, Hinton
- 1992
(Show Context)
Citation Context ...que can be used across space as well as time (Fontaine and Shastri, 1993). It is also possible to explicitly derive an error signal from the mutual information betweentwo patches of structured input (=-=Becker and Hinton, 1992-=-); this approach has been applied to viewpoint-invariant object recognition by (Zemel and Hinton, 1991). II.A.3 Anti-Hebbian Feedforward Learning In most formulations of Hebbian learning it is tacitly... |

102 |
A Stochastic Approximation of the Eigenvectors and Eigenvalues of the Expectation of a Random Matrix
- Oja, Karhunen
(Show Context)
Citation Context ...analysis (Jolliffe, 1986) the strategy of extracting the directions of highest variance from the input. A single Hebbian node, for instance, will come to encode the input's first principal component (=-=Oja and Karhunen, 1985-=-); various forms of lateral interaction can be used to force a layer of such nodes to differentiate and span the principal component subspace (Foldiak, 1989; Sanger, 1989; Oja, 1989; Kung and Diamanta... |

101 |
Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters
- Bridle
- 1989
(Show Context)
Citation Context ...ochastic model to compute the output probability of a node but then prefer to use that probability directly as a deterministic input for the next stage of processing. The softmax activation function (=-=Bridle, 1990-=-) for instance models a multinomial stochastic system: the net inputs y i to a layer of n nodes are interpreted as energy levels; the output probability z i for each node is then given by the correspo... |

95 |
Asymptotics of graphical projection pursuit
- Diaconis, Freedman
- 1984
(Show Context)
Citation Context ... the right, a 2-D projection of the same data set that has been optimized by exploratory projection pursuit, using the Friedman-Tukey projection index. can be made for entropy as a projection index: (=-=Diaconis and Freedman, 1984-=-) have shown that most low-dimensional projections of high-dimensional data are approximately normally distributed. Ab-normal (that is, far from Gaussian) distributions are comparatively rare, and thu... |

93 |
Logical basis for information theory and probability theory
- Kolmogorov
- 1968
(Show Context)
Citation Context ...or task that is deemed relevant by fiat. By contrast, a generic measure of inherent information is the Kolmogorov or algorithmic complexity proposed independently by (Solomonoff, 1964; Chaitin, 1966; =-=Kolmogorov, 1968-=-). In essence, the Kolmogorov complexity of a message is the length (that is, entropy) of the shortest program that can send it. Due to the universality of computation, this length is independent of t... |

80 |
Maximum likelihood competitive learning
- Nowlan
- 1990
(Show Context)
Citation Context ...d activation functions at the outputs. That is, our layer of anti-Hebbian nodes mapped input vectors ~x to outputs z i via z i = e \Gammay 2 i , where y i = ~ w i \Delta ~x : (II:1) Soft competition (=-=Nowlan, 1990-=-) between the nodes in the layer was then implemented by interpreting the z i as class membership probabilities, normalizing them by dividing through their sum, then using them to scale the amount by ... |

66 | Adaptive network for optimal linear feature extraction
- Földiák
- 1989
(Show Context)
Citation Context ...input's first principal component (Oja and Karhunen, 1985); various forms of lateral interaction can be used to force a layer of such nodes to differentiate and span the principal component subspace (=-=Foldiak, 1989-=-; Sanger, 1989; Oja, 1989; Kung and Diamantaras, 1991; Leen, 1991). The same kind of representation also develops in the hidden layer of backpropagationautoassociatornetworks (Cottrell and Munro, 1988... |

53 |
Introduction to the Theory
- Hertz, Krogh, et al.
- 1991
(Show Context)
Citation Context ...art et al., 1986b) was published, neural networks have become a popular (if not the preferred) framework for machine learning. A thorough introduction to this topic can be found in textbooks such as (=-=Hertz et al., 1991-=-; Haykin, 1994); the current state of the art in this very active research area is covered comprehensively in (Arbib, 1995). Here we limit ourselves to a brief introduction of the basic concepts, tech... |

50 |
Adaptation and decorrelation in the cortex
- Barlow, F€oldiak
- 1989
(Show Context)
Citation Context ...lty filter " that learned to be insensitive to familiar features in its input (Kohonen, 1989). More recently, such anti-Hebbian synapses have been used for lateral decorrelation of feature detect=-=ors (Barlow and Foldiak, 1989-=-; Foldiak, 1989; Leen, 1991) as well as --- in differential form --- removal of temporal variations from the input (Mitchison, 1991). We suggest that in certain cases the use of anti-Hebbian feedforwa... |

43 | Feature extraction using an unsupervised neural network - Intrator - 1992 |

40 |
On the length of programs for computing binary sequences
- Chaitin
- 1966
(Show Context)
Citation Context ...e to a message or task that is deemed relevant by fiat. By contrast, a generic measure of inherent information is the Kolmogorov or algorithmic complexity proposed independently by (Solomonoff, 1964; =-=Chaitin, 1966-=-; Kolmogorov, 1968). In essence, the Kolmogorov complexity of a message is the length (that is, entropy) of the shortest program that can send it. Due to the universality of computation, this length i... |

30 |
Quadratic logistic discrimination
- Anderson
- 1975
(Show Context)
Citation Context ...+ e \Gamma2y=oe 2 (III.5) which is the logistic function with gain 2=oe 2 . The logistic function is thus a natural choice of nonlinearity for performing "soft" (probabilistic) binary discri=-=mination (Anderson, 1972-=-). We will now derive a learning algorithm that uses logistic nodes to seek out informative binary features of the input data. III.B Mathematical Derivation Our algorithm uses logistic nodes with unit... |

21 |
Principal Component Analysis of Images via back-propagation
- Cottrell, Munro
- 1988
(Show Context)
Citation Context ... subspace (Foldiak, 1989; Sanger, 1989; Oja, 1989; Kung and Diamantaras, 1991; Leen, 1991). The same kind of representation also develops in the hidden layer of backpropagationautoassociatornetworks (=-=Cottrell and Munro, 1988-=-; Baldi and Hornik, 1989). Since the entropy of a probability density scales with its log-variance, these techniques can also be viewed as a special (linear) case of the maximum information preservati... |

21 |
Removing time variation with the anti-Hebbian differential synapse
- Mitchison
- 1991
(Show Context)
Citation Context ...e been used for lateral decorrelation of feature detectors (Barlow and Foldiak, 1989; Foldiak, 1989; Leen, 1991) as well as --- in differential form --- removal of temporal variations from the input (=-=Mitchison, 1991-=-). We suggest that in certain cases the use of anti-Hebbian feedforward connections to learn invariant structure may eliminate the need to bring in the 23 heavy machinery of supervised learning algori... |

17 |
Visualization of 2-D hidden unit space
- Munro
- 1991
(Show Context)
Citation Context ...capes detection by standard Hebbian learning rules. Figure III.2 shows (from left to right) the initial, intermediate and final phase of this experiment, using a visualization technique suggested by (=-=Munro, 1992-=-). Each plot shows a set of three lines superimposed on a scatter plot of the 50 ! ! Figure III.2: A single BINGO node discovers the distinction between front and back vowels in an unlabelled data set... |

15 | Dynamics of Learning in Linear Feature-Discovery Networks,” Network
- Leen
- 1991
(Show Context)
Citation Context ...forms of lateral interaction can be used to force a layer of such nodes to differentiate and span the principal component subspace (Foldiak, 1989; Sanger, 1989; Oja, 1989; Kung and Diamantaras, 1991; =-=Leen, 1991-=-). The same kind of representation also develops in the hidden layer of backpropagationautoassociatornetworks (Cottrell and Munro, 1988; Baldi and Hornik, 1989). Since the entropy of a probability den... |

13 |
Supervised Learning on Large Redundant Training sets
- Mller
- 1993
(Show Context)
Citation Context ...ning the ability to escape from local minima. Analogously, nonparametric entropy optimization could be initiated with a small batch size jT j that is gradually increased over the course of learning. (=-=Mller, 1993-=-) has investigated methods to estimate the optimal batch size during optimization by conjugate gradient techniques; his results may well apply here. IV.C Application to Image Alignment (Viola andWells... |

11 |
A neural network for feature extraction
- Intrator
- 1990
(Show Context)
Citation Context ...goal of seeking highly informative, bimodal projections of the input with the Bienenstock-Cooper-Munro (BCM) algorithm (Bienenstock et al., 1982). In our notation, the BCM learning rule according to (=-=Intrator, 1990-=-, 1991a, 1991b, 1992) is \Delta ~ w / f 0 (y) ~x ` z 2 \Gamma 4 3 z D z 2 E ' : (III:17) Whereas in the derivation of BINGO some of the nonlinearities cancel (III.11), yielding a relatively straightfo... |

10 | Comparing different neural network architectures for classifying handwritten digits - Guyon, Poujaud, et al. - 1989 |

7 | A non-linear information maximisation algorithm that performs blind separation
- Bell, Sejnowski
- 1995
(Show Context)
Citation Context ...ce the entropy of a probability density scales with its log-variance, these techniques can also be viewed as a special (linear) case of the maximum information preservation or Infomax (Linsker, 1988; =-=Bell and Sejnowski, 1995-=-) strategy. A comprehensive approach to unsupervised learning, however, must also take the cost of representing the data into account. From this perspective maximum information preservation is not alw... |

5 | Exploratory Feature Extraction in Speech Signals - Intrator - 1991 |

4 |
Neural networks for extracting unsymmetric principal components
- Kung, Diamantaras
- 1991
(Show Context)
Citation Context ...nd Karhunen, 1985); various forms of lateral interaction can be used to force a layer of such nodes to differentiate and span the principal component subspace (Foldiak, 1989; Sanger, 1989; Oja, 1989; =-=Kung and Diamantaras, 1991-=-; Leen, 1991). The same kind of representation also develops in the hidden layer of backpropagationautoassociatornetworks (Cottrell and Munro, 1988; Baldi and Hornik, 1989). Since the entropy of a pro... |

2 | Recognizing handprinted digit strings: A hybrid connectionist/procedural approach
- Fontaine, Shastri
- 1993
(Show Context)
Citation Context ...s is to require the network to predict the next patch of some structured input from the preceding context, as in (Elman, 1990); the same prediction technique can be used across space as well as time (=-=Fontaine and Shastri, 1993-=-). It is also possible to explicitly derive an error signal from the mutual information betweentwo patches of structured input (Becker and Hinton, 1992); this approach has been applied to viewpoint-in... |

1 |
Tao Te Ching / Lao-Tsu
- Feng, English
- 1972
(Show Context)
Citation Context ...more information it contains. Thus while information is conveyed by the presence of a message, its measure relies on the potential absence of that message. Or, as Lao Tsu put it some 2,500 years ago (=-=Feng and English, 1972-=-): Profit stems from what is there, Utility from what is not there. Owing to this holistic quality, information must be considered a property of the entire probability distribution of X , rather than ... |