## Information Geometry of the EM and em Algorithms for Neural Networks (1995)

Venue: | Neural Networks |

Citations: | 101 - 8 self |

### BibTeX

@ARTICLE{Amari95informationgeometry,

author = {Shun-ichi Amari},

title = {Information Geometry of the EM and em Algorithms for Neural Networks},

journal = {Neural Networks},

year = {1995},

volume = {8},

pages = {1379--1408}

}

### Years of Citing Articles

### OpenURL

### Abstract

In order to realize an input-output relation given by noise-contaminated examples, it is effective to use a stochastic model of neural networks. A model network includes hidden units whose activation values are not specified nor observed. It is useful to estimate the hidden variables from the observed or specified input-output data based on the stochastic model. Two algorithms, the EM - and em-algorithms, have so far been proposed for this purpose. The EM-algorithm is an iterative statistical technique of using the conditional expectation, and the em-algorithm is a geometrical one given by information geometry. The em-algorithm minimizes iteratively the Kullback-Leibler divergence in the manifold of neural networks. These two algorithms are equivalent in most cases. The present paper gives a unified information geometrical framework for studying stochastic models of neural networks, by forcussing on the EM and em algorithms, and proves a condition which guarantees their equ...

### Citations

8089 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

3719 |
Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images
- Geman, Geman
- 1984
(Show Context)
Citation Context ... algorithms nor shows computer simulated results. These remain to be studied separately in future by applying the framework proposed in the present paper. Applications to hidden Markov random fields (=-=Geman and Geman, 1984-=-; Besag and Green, 1993; Gunsch, Geman and Kehagias, 1994), dynamics of Boltzmann machines with asymmetric connections, etc. are also important subjects of future research. The present paper is organi... |

772 |
A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ...s much larger than that of Markov chains in spite of its simple mechanism of generation. It is widely used in applications, in particular in 43 speech recognition. The so-called Baum-Welch algorithm (=-=Baum et al., 1970-=-) is the same as the EM algorithm. Recently, Kunsch, Geman and Kehagias (1994) tries to apply a hidden Markov random field to vision research. An old problem of state identification in the hidden Mark... |

764 | A view of the EM algorithm that justifies incremental sparse and other variants - Neal, Hinton - 1998 |

469 |
Linear Statistical Inference and Its Applications
- Rao
- 1973
(Show Context)
Citation Context ...r 1 ; 1 1 1 ; r T ) is given through their arithmetic meansr, implying thatsr is a sufficient statistic for estimating ` or u (see standard textbooks on statistics, for example Cox and Hinkley, 1974; =-=Rao, 1973-=-). Hence, the distributions ofsr again form the same type of exponential family as S, except for the scale factor T and k( r). So it is possible to discuss repeated observations and estimation in the ... |

431 | A learning algorithm for Boltzmann machines
- Ackley, Hinton, et al.
- 1985
(Show Context)
Citation Context ...table distributions of Boltzmann machines. A Boltzmann machine is a stochastic neural network consisting of n neurons. The neurons are connected with symmetric connection weights W = (w ij ) (Ackley, =-=Hinton and Sejnowski, 1985-=-; Aarts and Korst, 1989). The self-connection is zero, w ii = 0. Instead, we put w ii = s i for the sake of notational convenience, where s i is the bias term of the ith unit. Let x = (x 1 ; 1 1 1 ; x... |

258 |
I-divergence geometry of probability distributions and minimization problems. The Annals of Probability
- Csiszár
- 1975
(Show Context)
Citation Context .... When a network is of a stochastic nature, each network is accompanied with a probability distribution p(x; `) or a conditional probability distributionsp(yjx; `). Information geometry (Amari, 1985; =-=Csisz'ar, 1975-=-; Chentsov, 1972) connects these two sources of ideas. It originated from the information structure of a manifold of probability distributions and has been developed to be a new mathematical subject w... |

147 |
Information and accuracy attainable in the estimation of statistical parameters
- Rao
- 1945
(Show Context)
Citation Context ...re G = (g ij ) is an n 2 n positive-definite matrix. It is natural to define it by g ij (`) = E[ @ @` i log p(x; `) @ @` j log p(x; `)]; (A:6) where E denotes the expectation with respect to p(x; `) (=-=Rao, 1945-=-). This G = (g ij ) is called the Fisher information matrix, which plays a central role in theoretical statistics. If we use the m-coordinate system, the same inner product should be given in terms of... |

136 |
Information geometry and alternating minimization procedures, Statistics and Decisions (supplement 1
- Csiszar, Tusnady
- 1984
(Show Context)
Citation Context ...though we do not mention about it (see, e.g., Amari, 1987b; BarndorffNielsen and Jupp, 1989; Kawanabe and Amari, 1994). The information geometry and the EM algorithm are also applicable to this case (=-=Csisz'ar and Tusn'ady, 1984-=-). It should also be remarked that a stochastic perceptron reduces to the ordinary analog perceptron if the stochastic outputs z i and y are replaced by their expected values. Hence, an ordinary multi... |

130 |
Spatial statistics and Bayesian computation
- J, Green
- 1993
(Show Context)
Citation Context ...omputer simulated results. These remain to be studied separately in future by applying the framework proposed in the present paper. Applications to hidden Markov random fields (Geman and Geman, 1984; =-=Besag and Green, 1993-=-; Gunsch, Geman and Kehagias, 1994), dynamics of Boltzmann machines with asymmetric connections, etc. are also important subjects of future research. The present paper is organized as follows. Section... |

96 | Convergence results for the EM approach to mixture of experts architectures - Jordan, Xu - 1995 |

54 | Differential geometry of curved exponential families — curvatures and information loss - Amari - 1982 |

52 |
Differential geometry and statistics
- Murray, Rice
- 1993
(Show Context)
Citation Context ...Appendix : Information Geometry A.1 Dual geometry of exponential family Invariant geometrical structures of a general manifold S of probability distributions have been studied in detail (Amari, 1985; =-=Murray and Rice, 1993-=-, etc.) in order to obtain intrinsic properties of a statistical model. The geometrical theory has successfully been applied to various fields of information sciences such as statistics (Amari, 1985; ... |

36 |
Information geometry of Boltzmann machines
- Amari, Kurata, et al.
- 1992
(Show Context)
Citation Context ...mation sciences such as statistics (Amari, 1985; Kass, 1989), systems theory (Amari, 1987; Ohara and Amari, 1992), information theory (Amari and Han, 1989; Amari, 1989), neural networks (Amari, 1991; =-=Amari et al., 1992-=-) and many others. Mathematicians are studying this new geometrical structure of differential geometry (Nomizu and Simon, 1991; Kurose, 1990). It is a Riemannian manifold equipped with a couple of dua... |

34 |
Statistical Decision Rules and Optimal Inference
- Chentsov
- 1981
(Show Context)
Citation Context ... is of a stochastic nature, each network is accompanied with a probability distribution p(x; `) or a conditional probability distributionsp(yjx; `). Information geometry (Amari, 1985; Csisz'ar, 1975; =-=Chentsov, 1972-=-) connects these two sources of ideas. It originated from the information structure of a manifold of probability distributions and has been developed to be a new mathematical subject with new differen... |

32 | Identifiability of Hidden Markov Information Sources - Ito, Amari, et al. - 1992 |

30 |
Information geometry of estimating functions in semi-parametric statistical models
- MAEDA, Amari, et al.
- 1997
(Show Context)
Citation Context ... the generalized Pythagoras theorem holds. Information geometry studies new geometrical structures existing in manifolds of probability distributions. It is generalized to the fibre bundle structure (=-=Amari and Kawanabe, 1994-=-), and the conformal structure (Okamoto, Amari and Takeuchi, 1988). It has been applied successfully to various fields of information sciences as is mentioned earlier. It is also related to completely... |

29 |
Mathematical foundations of neurocomputing
- Amari
- 1990
(Show Context)
Citation Context ...ugh it behaves deterministically in the execution mode. This is a stochastic model of (deterministic) neural networks. This suggests the usefulness of statistical ideas in neural networks (see, e.g., =-=Amari, 1990-=-; Cheng and Titterington, 1994; Ripley, 1994; White, 1989). Another quite different but important idea for developing a theory of neural networks originates from geometry. Let us consider a neural net... |

21 |
Limit Theorems For Large Deviations
- Saulis, Statulevicius
- 1991
(Show Context)
Citation Context ... Q 3 be the e-projction of P to D that minimizes K(Q k P ), Q 2 D. Let Q be a point in D whose j-coordinates are denoted by j Q = (j Q v ; j Q h ). By virtue of the large deviation theory (see, e.g., =-=Saulis and Statulevicius, 1989-=-), the probability density ofsr = j Q , that is, the probability of point Q being observed when the true distribution is P , is written as p( r; P )sexpf0TK(Q k P )g; (8:21) asymptotically. Hence, the... |

20 |
Dualistic geometry of the manifold of higher-order neurons
- Amari
- 1991
(Show Context)
Citation Context ...a binary output y i . In the present simplest case, we assume that N i is a simple stochastic neuron such that it emits a binary output y i depending on the weighted sum u i = w i 1 x of the input x (=-=Amari, 1991-=-). The probability of y i given x is written as p(y i jx; w i ) = '(y i ; w i 1 x) = expfy i x 1 w i 0 /(w i 1 x)g; (2.17) where ' is the sigmoidal function '(y; u) = exp(yu) 1 + exp(u) (2:18) and /(w... |

17 | Alternating minimization and Boltzmann machine learning - Byrne - 1992 |

16 | Differential geometry of smooth families of probability distributions. Mathematical Engineering - Nagaoka, Amari - 1982 |

15 | Parametric Statistical Models and Likelihood - Barndorff-Nielsen - 1988 |

15 | The Role of Differential Geometry in Statistical Theory - Barndorff-Nielsen, Cox, et al. - 1986 |

14 |
Hierachical mixtures of experts and the EM algorithm
- Jordan, Jacobs
- 1994
(Show Context)
Citation Context ...dom Markov fields is also represented in an exponential form. 2.3 Mixtures of expert neural nets Example 6. Mixture of expert nets. A simplest example of mixtures of expert nets (Jacobs et al., 1991; =-=Jordan and Jacobs, 1994-=-) is presented here. There are various generalizations, including the hierarchical mixture (Jordan and Jacobs, 1994). Some generalization will be shown after this example. Let N i (i = 0; 1; 1 1 1 ; k... |

13 | Differential geometry of a parametric family of invertible linear systems - Amari - 1987 |

13 |
Statistical inference under multiterminal rate restrictions: a differential geometric approach
- Amari, Han
- 1989
(Show Context)
Citation Context ...l theory has successfully been applied to various fields of information sciences such as statistics (Amari, 1985; Kass, 1989), systems theory (Amari, 1987; Ohara and Amari, 1992), information theory (=-=Amari and Han, 1989-=-; Amari, 1989), neural networks (Amari, 1991; Amari et al., 1992) and many others. Mathematicians are studying this new geometrical structure of differential geometry (Nomizu and Simon, 1991; Kurose, ... |

13 |
The geometry of asymptotic inference (with Discussion), Statistical S c i e n c e
- Kass
- 1989
(Show Context)
Citation Context ..., etc.) in order to obtain intrinsic properties of a statistical model. The geometrical theory has successfully been applied to various fields of information sciences such as statistics (Amari, 1985; =-=Kass, 1989-=-), systems theory (Amari, 1987; Ohara and Amari, 1992), information theory (Amari and Han, 1989; Amari, 1989), neural networks (Amari, 1991; Amari et al., 1992) and many others. Mathematicians are stu... |

12 |
The EM algorithm and information geometry in neural network learning
- Amari
- 1995
(Show Context)
Citation Context ...ton [1993]). See also Jordan and Xu [1994], Xu, Jordan [1994]. The present paper elucidates the relation between the statistical EM-algorithm and the geometrical em-algorithm reported in short notes (=-=Amari 1994-=-). They are the same in most cases, and we prove a necessasry and sufficient condition which guarantees their equivalence. We also give a simple example where the two algorithms give different solutio... |

12 |
Hidden Markov random fields
- Kunsch, Geman, et al.
- 1995
(Show Context)
Citation Context ...e remain to be studied separately in future by applying the framework proposed in the present paper. Applications to hidden Markov random fields (Geman and Geman, 1984; Besag and Green, 1993; Gunsch, =-=Geman and Kehagias, 1994-=-), dynamics of Boltzmann machines with asymmetric connections, etc. are also important subjects of future research. The present paper is organized as follows. Sections 2 and 3 are devoted to introduct... |

9 |
Differential Geometrical Theory of Statistics
- Amari
- 1987
(Show Context)
Citation Context ...ntial family. However, it is possible to generalize the information geometry to be responsible for such cases by introducing the manifold of functions, although we do not mention about it (see, e.g., =-=Amari, 1987-=-b; BarndorffNielsen and Jupp, 1989; Kawanabe and Amari, 1994). The information geometry and the EM algorithm are also applicable to this case (Csisz'ar and Tusn'ady, 1984). It should also be remarked ... |

7 | Approximating Exponential Models - Barndorff-Nielsen, Jupp - 1989 |

7 |
Estimation of network parameters in semiparametric stochastic perceptron
- Kawanabe, Amari
- 1994
(Show Context)
Citation Context ...lize the information geometry to be responsible for such cases by introducing the manifold of functions, although we do not mention about it (see, e.g., Amari, 1987b; BarndorffNielsen and Jupp, 1989; =-=Kawanabe and Amari, 1994-=-). The information geometry and the EM algorithm are also applicable to this case (Csisz'ar and Tusn'ady, 1984). It should also be remarked that a stochastic perceptron reduces to the ordinary analog ... |

5 |
Fisher information under restriction of the shannon information in multiterminal situations. Ann inst statist math
- Amari
- 1989
(Show Context)
Citation Context ...ully been applied to various fields of information sciences such as statistics (Amari, 1985; Kass, 1989), systems theory (Amari, 1987; Ohara and Amari, 1992), information theory (Amari and Han, 1989; =-=Amari, 1989-=-), neural networks (Amari, 1991; Amari et al., 1992) and many others. Mathematicians are studying this new geometrical structure of differential geometry (Nomizu and Simon, 1991; Kurose, 1990). It is ... |

4 | A new criterion for selecting models from partially observed data - Shimodaira - 1994 |

4 |
Learning in artificial networks: A statistical perspective. Neural Computation
- White
- 1989
(Show Context)
Citation Context ...his is a stochastic model of (deterministic) neural networks. This suggests the usefulness of statistical ideas in neural networks (see, e.g., Amari, 1990; Cheng and Titterington, 1994; Ripley, 1994; =-=White, 1989-=-). Another quite different but important idea for developing a theory of neural networks originates from geometry. Let us consider a neural network including modifiable parameters (connection weights)... |

3 | Asymptotic Theory of Sequential Estimation: Differential Geometrical Approach - Okamoto, Amari - 1991 |

1 | Dualistic dynamical systems in the framework of information geometry - Fujiwara, Amari - 1994 |

1 |
Dual connections and affine geometry
- Kurose
- 1990
(Show Context)
Citation Context ...an, 1989; Amari, 1989), neural networks (Amari, 1991; Amari et al., 1992) and many others. Mathematicians are studying this new geometrical structure of differential geometry (Nomizu and Simon, 1991; =-=Kurose, 1990-=-). It is a Riemannian manifold equipped with a couple of dual affine connections. The duality in affine connections is a new notion introduced in differential geometry originated from information scie... |

1 |
Differential geometric structures of stable feedback systems with dual connections
- Ohara, Amari
- 1992
(Show Context)
Citation Context ...rties of a statistical model. The geometrical theory has successfully been applied to various fields of information sciences such as statistics (Amari, 1985; Kass, 1989), systems theory (Amari, 1987; =-=Ohara and Amari, 1992-=-), information theory (Amari and Han, 1989; Amari, 1989), neural networks (Amari, 1991; Amari et al., 1992) and many others. Mathematicians are studying this new geometrical structure of differential ... |

1 |
Neural networks and related method for clasification
- Ripley
- 1994
(Show Context)
Citation Context ...cution mode. This is a stochastic model of (deterministic) neural networks. This suggests the usefulness of statistical ideas in neural networks (see, e.g., Amari, 1990; Cheng and Titterington, 1994; =-=Ripley, 1994-=-; White, 1989). Another quite different but important idea for developing a theory of neural networks originates from geometry. Let us consider a neural network including modifiable parameters (connec... |

1 | New gating net for mixture of experts, EM algorithm and piecewise function approximations, preprint - Xu, Jordan, et al. - 1994 |

1 | Piecewise-linear division of signal space by a multilayer neural networks with the maximum detector (in Japanese), Trans. Inst - Zhuang, Amari - 1993 |