## Learning in Boltzmann Trees (1995)

Venue: | Neural Computation |

Citations: | 24 - 2 self |

### BibTeX

@ARTICLE{Saul95learningin,

author = {Lawrence Saul and Michael Jordan},

title = {Learning in Boltzmann Trees},

journal = {Neural Computation},

year = {1995},

volume = {6},

pages = {1173--1183}

}

### OpenURL

### Abstract

We introduce a large family of Boltzmann machines that can be trained using standard gradient descent. The networks can have one or more layers of hidden units, with tree-like connectivity. We show how to implement the supervised learning algorithm for these Boltzmann machines exactly, without resort to simulated or mean-field annealing. The stochastic averages that yield the gradients in weight space are computed by the technique of decimation. We present results on the problems of N-bit parity and the detection of hidden symmetries. 1 Introduction Boltzmann machines (Ackley, Hinton, & Sejnowski, 1985) have several compelling virtues. Unlike simple perceptrons, they can solve problems that are not linearly separable. The learning rule, simple and locally based, lends itself to massive parallelism. The theory of Boltzmann learning, moreover, has a solid foundation in statistical mechanics. Unfortunately, Boltzmann machines--- as originally conceived---also have some serious drawbacks...

### Citations

7092 | Probabilistic reasoning in intelligent systems: networks of plausible inference - Pearl - 1988 |

3584 | Optimization by Simulated Annealing
- Kirkpatrick, Gelatt, et al.
- 1983
(Show Context)
Citation Context ...o compute the gradients in weight space directly. Instead, one must resort to estimating the correlations hS i S j i by Monte Carlo simulation (Binder et al, 1988). The method of simulated annealing (=-=Kirkpatrick et al, 1983-=-) leads to accurate estimates but has the disadvantage of being very computation-intensive. A mean-field version of the algorithm (Peterson & Anderson, 1987) was proposed to speed up learning. It make... |

1788 |
Introduction to the Theory of Neural Computation
- Hertz, Krogh, et al.
- 1991
(Show Context)
Citation Context ...of deterministic and true Boltzmann learning. Finally, we discuss a number of possible extensions to our work. 2 Boltzmann Machines We briefly review the learning algorithm for the Boltzmann machine (=-=Hertz, Krogh, and Palmer, 1991-=-). The Boltzmann machine is a recurrent network with binary units S i = \Sigma1 and symmetric weights w ij = w ji . Each configuration of units in the network represents a state of energy H = \Gamma X... |

1293 |
Local Computations with Probabilities on Graphical Structures and their Application to Expert Systems
- Lauritzen, Spiegelhalter
- 1988
(Show Context)
Citation Context ...actability of Boltzmann trees is reminiscent of the tractability of tree-like belief networks, proposed by Pearl (1986, 1988); more sophisticated rules for computing probabilities in belief networks (=-=Lauritzen & Spiegelhalter, 1988-=-) may have useful counterparts in Boltzmann machines. These issues and others are left for further study. Acknowledgements The authors thank Mehran Kardar for useful discussions. This research was sup... |

978 |
Quantum Field Theory
- Itzykson, Zuber
- 1980
(Show Context)
Citation Context ...ure and the generalization to many output units will be discussed later. The key technique to compute partition functions and expectation values in these trees is known as decimation (Eggarter, 1974; =-=Itzykson & Drouffe, 1991-=-). The idea behind decimation is the following. Consider three units connected in series, as shown in Figure 2a. Though not directly connected, the end units S 1 and S 2 have an effective interation t... |

432 | A learning algorithm for Boltzmann machines
- ACKLEY, HINTON, et al.
- 1985
(Show Context)
Citation Context ...he gradients in weight space are computed by the technique of decimation. We present results on the problems of N-bit parity and the detection of hidden symmetries. 1 Introduction Boltzmann machines (=-=Ackley, Hinton, & Sejnowski, 1985-=-) have several compelling virtues. Unlike simple perceptrons, they can solve problems that are not linearly separable. The learning rule, simple and locally based, lends itself to massive parallelism.... |

374 | Structuring in Belief Networks - Pearl, “Fusion - 1986 |

266 |
Learning representations by back-propagating errors. Nature 323, 533–536. doi: 10.1038/323533a0
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...In practice, they are relatively slow. Simulated annealing (Kirkpatrick, Gellat, & Vecchi, 1983), though effective, entails a great deal of computation. Finally, compared to backpropagation networks (=-=Rumelhart, Hinton, & Williams, 1986-=-), where weight updates are computed by the chain rule, Boltzmann machines lack a certain degree of exactitude. Monte Carlo estimates of stochastic averages (Binder & Heerman, 1988) are not sufficient... |

172 |
The upstart algorithm: A method for constructing and training feedforward neural networks
- Frean
- 1990
(Show Context)
Citation Context ... viable option for problems in which the basic assumption behind mean-field learning--- that the units in the network can be treated independently---does not hold. We know of constructive algorithms (=-=Frean, 1990-=-) for feed-forward nets that yield tree-like solutions; an analogous construction for Boltzmann machines has obvious appeal, in view of the potential for exact computations. Finally, the tractability ... |

145 |
A mean field theory learning algorithm for neural networks
- Peterson, Anderson
- 1987
(Show Context)
Citation Context ...al, 1988). The method of simulated annealing (Kirkpatrick et al, 1983) leads to accurate estimates but has the disadvantage of being very computation-intensive. A mean-field version of the algorithm (=-=Peterson & Anderson, 1987) wa-=-s proposed to speed up learning. It makes the approximation hS i S j i �� hS i ihS j i in the learning rule and estimates the magnetizations hS i i by solving a set of nonlinear equations. This is... |

67 |
Deterministic Boltzmann Learning Performs Steepest Descent
- Hinton
- 1989
(Show Context)
Citation Context ...e 1: Boltzmann tree with two layers of hidden units. The input units (not shown) are fully connected to all the units in the tree. rule. For many problems, this approximation works surprisingly well (=-=Hinton, 1989-=-), so that mean-field Boltzmann machines learn much more quickly than their stochastic counterparts. Under certain circumstances, however, the approximation breaks down, and the mean-field learning ru... |

63 | Unsupervised learning of distributions on binary vectors using two layer networks - Freund, Haussler - 1994 |

44 | A scaled conjugate gradient algorithm for fast supervised learning - Mller - 1993 |

35 |
Learning algorithms and probability distributions in feedforward and feed-back networks
- Hopfield
- 1987
(Show Context)
Citation Context ...ckly than their stochastic counterparts. Under certain circumstances, however, the approximation breaks down, and the mean-field learning rule works badly if at all (Galland, 1993). Another approach (=-=Hopfield, 1987-=-) is to focus on Boltzmann machines with architectures simple enough to permit exact computations. Learning then proceeds by straightforward gradient descent on the cost function (Yair & Gersho, 1988)... |

4 |
The Limitations of Deterministic Boltzmann
- Galland
- 1993
(Show Context)
Citation Context ...zmann machines learn much more quickly than their stochastic counterparts. Under certain circumstances, however, the approximation breaks down, and the mean-field learning rule works badly if at all (=-=Galland, 1993-=-). Another approach (Hopfield, 1987) is to focus on Boltzmann machines with architectures simple enough to permit exact computations. Learning then proceeds by straightforward gradient descent on the ... |

1 |
Cayley Trees, the Ising Problem
- Eggarter
- 1974
(Show Context)
Citation Context ... basic architecture and the generalization to many output units will be discussed later. The key technique to compute partition functions and expectation values in these trees is known as decimation (=-=Eggarter, 1974-=-; Itzykson & Drouffe, 1991). The idea behind decimation is the following. Consider three units connected in series, as shown in Figure 2a. Though not directly connected, the end units S 1 and S 2 have... |