## Learning stochastic feedforward networks (1990)

Citations: | 12 - 1 self |

### BibTeX

@TECHREPORT{Neal90learningstochastic,

author = {Radford M. Neal},

title = {Learning stochastic feedforward networks},

institution = {},

year = {1990}

}

### OpenURL

### Abstract

Introduction The work reported here began with the desire to find a network architecture that shared with Boltzmann machines [6, 1, 7] the capacity to learn arbitrary probability distributions over binary vectors, but that did not require the negative phase of Boltzmann machine learning. It was hypothesized that eliminating the negative phase would improve learning performance. This goal was achieved by replacing the Boltzmann machine's symmetric connections with feedforward connections. In analogy with Boltzmann machines, the sigmoid function was used to compute the conditional probability of a unit being on from the weighted input from other units. Stochastic simulation of such a network is somewhat more complex than for a Boltzmann machine, but is still possible using local communication. Maximum likelihood, gradient-ascent learning can be done with a local Hebb-type rule.

### Citations

8090 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...obabilities that would be swamped by noise in the simulations. For comparison, results from a maximum-likelihood fit of a mixture model with six components to the training data using the EM algorithm =-=[3, 15]-=- are given as well, evaluated on a test sample of 5000 items. The sigmoid feedforward networks, the Boltzmann machine with \Gamma 1= + 1 valued units, and the mixture model all show signs of overfitti... |

7052 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...cal communication. Maximum likelihood, gradient-ascent learning can be done with a local Hebb-type rule. These networks turn out to fall within the general class of "belief networks" studied by Pearl =-=[11]-=- and others as a means of representing probabilistic knowledge in expert systems. However, the specific network architectures considered by Pearl do not use a sigmoid probability function. Rather, the... |

1284 | Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems (with Discussion - Lauritzen, Spiegelhalter - 1988 |

745 |
Learning representations by back-propagating errors
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...cation problems is a probability distribution over possible classes, conditional on the attributes presented as input. A deterministic feedforward network, trained by a method such as backpropagation =-=[12]-=-, can represent a distribution over two classes by simply producing the probability of one of the classes as its output. Such a network that uses the sigmoid function to compute the output of a unit f... |

721 |
Cross-Validatory Choices and Assessment of Statistical Prediction (with Discussion
- Stone
- 1974
(Show Context)
Citation Context ...he other two runs of this network were rather poor. Generalization performance for all these networks might well be improved by stopping learning before convergence using a cross-validation criterion =-=[14]-=-. As an aside, it is interesting that the weights found during network training generally bear only a vague resemblance to those that would result from manually solving the problem using the clusters ... |

625 |
Statistical Analysis of Finite Mixture Distributions
- Titterington, Smith, et al.
- 1985
(Show Context)
Citation Context ...re. Each component produces its own distribution for V , and these component distributions are then combined in the proportions P (M ). Mixture distributions are commonly encountered and much studied =-=[15]-=-. To represent a mixture distribution in a network, we need first to represent the mixture variable, M . For a mixture of n components, one way to do this is via a cluster of n units, exactly one of w... |

431 | A learning algorithm for Boltzmann machines
- Ackley, Hinton, et al.
- 1985
(Show Context)
Citation Context ...connectionist learning and work on the representation of expert knowledge. Introduction The work reported here began with the desire to find a network architecture that shared with Boltzmann machines =-=[6, 1, 7]-=- the capacity to learn arbitrary probability distributions over binary vectors, but that did not require the negative phase of Boltzmann machine learning. It was hypothesized that eliminating the nega... |

283 |
Learning and relearning in Boltzmann machines
- Hinton, Sejnowski
- 1986
(Show Context)
Citation Context ...connectionist learning and work on the representation of expert knowledge. Introduction The work reported here began with the desire to find a network architecture that shared with Boltzmann machines =-=[6, 1, 7]-=- the capacity to learn arbitrary probability distributions over binary vectors, but that did not require the negative phase of Boltzmann machine learning. It was hypothesized that eliminating the nega... |

129 |
Probabilistic Inference and Influence Diagrams
- Shachter
- 1988
(Show Context)
Citation Context ...tional probabilities and sampling from conditional distributions are in general difficult problems. Various methods for computing exact conditional probabilities in belief networks have been proposed =-=[11, 13, 8]-=-, but all are either restricted to special forms of network or have exponential time complexity in the worst case. It appears that the only plausible method of sampling from conditional distributions ... |

93 |
Evidential reasoning using stochastic simulation of causal models
- Pearl
- 1987
(Show Context)
Citation Context ...mplexity in the worst case. It appears that the only plausible method of sampling from conditional distributions in belief networks with high connectivity is stochastic simulation, described by Pearl =-=[10, 11]-=-. As with Boltzmann machines, a step in the simulation requires selecting a new value for unit i from its distribution conditional on the values of the other units. For a belief network, this distribu... |

66 |
Boltzman machines: Constraint satisfaction networks that learn
- Hinton, Sejnowski, et al.
- 1984
(Show Context)
Citation Context ...connectionist learning and work on the representation of expert knowledge. Introduction The work reported here began with the desire to find a network architecture that shared with Boltzmann machines =-=[6, 1, 7]-=- the capacity to learn arbitrary probability distributions over binary vectors, but that did not require the negative phase of Boltzmann machine learning. It was hypothesized that eliminating the nega... |

49 |
Influence Diagrams, Belief Nets, and Decision Analysis
- Oliver, Smith
- 1990
(Show Context)
Citation Context ...nfluence diagrams", and "relevance diagrams", are designed, like Boltzmann machines, to represent a probability distribution over a set of attributes. Study of these networks by Pearl [11] and others =-=[9]-=- has been motivated principally by the desire to represent knowledge obtained from human experts, however. Accordingly, hard-to-interpret parameters such as the weights in a Boltzmann machine have bee... |

19 |
Towards efficient probabilistic diagnosis in multiply connected belief networks
- Henrion
- 1990
(Show Context)
Citation Context ...units requires 2i\Gamma 1 parameters. Even if some of the preceding units are not connected to unit i, more compact specifications will generally be necessary. One method, termed the "noisy-OR" model =-=[11, 5]-=-, views the units as 0=1 valued OR-gates with the preceding units as inputs. An input of 1 does not invariably force a unit to take on the value 1, however. Rather, there is a certain probability, qij... |

3 |
Variations on the Boltzmann machine learning algorithm
- Derthick
- 1984
(Show Context)
Citation Context ... similar for the various networks, since they all have the same number of free parameters. The learning procedure used. Numerous variations of the Boltzmann machine learning procedure have been tried =-=[4]-=-, each of which requires fixing a number of parameters, such as the learning rate, and the temperatures in an annealing schedule. This presents a problem in 15scomparing learning in Boltzmann machines... |