## Diffusion of context and credit information in Markovian models (1995)

### Cached

### Download Links

- [www.iro.umontreal.ca]
- [www-dsi.ing.unifi.it]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Artificial Intelligence Research |

Citations: | 18 - 2 self |

### BibTeX

@ARTICLE{Bengio95diffusionof,

author = {Yoshua Bengio and Paolo Frasconi},

title = {Diffusion of context and credit information in Markovian models},

journal = {Journal of Artificial Intelligence Research},

year = {1995},

volume = {3},

pages = {3--249}

}

### OpenURL

### Abstract

This paper studies the problem of ergodicity of transition probabilitymatricesinMarkovian models, such as hidden Markov models (HMMs), and how itmakes very di cult the task of learning to represent long-term context for sequential data. This phenomenon hurts the forward propagation of long-term context information, as well as learning a hidden state representation to represent long-term context, which depends on propagating credit information backwards in time. Using results from Markov chain theory, weshow that this problem of di usion of context and credit is reduced when the transition probabilities approach 0 or 1, i.e., the transition probability matrices are sparse and the model essentially deterministic. The results found in this paper apply to learning approachesbasedon continuous optimization, such asgradient descent and the Baum-Welch algorithm. 1.

### Citations

8089 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...controls transition probabilities, i.e. inputs for IOHMMs and actions for POMDPs. The negative results presented in this paper are directly applicable to learning algorithms such as the EM algorithm (=-=Dempster, Laird, & Rubin, 1977-=-) or other gradient-based optimization algorithms, which rely on gradually and iteratively modifying continuous-valued parameters (such as transition probabilities, or parameters of a function computi... |

4273 | A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...iable xt can take values in f1�:::ng at each time step. We will write Aij for the element (i� j) ofamatrixA, A n = AA : : : A for the n th power of A, and (A n )ij for the element (i� j) ofA n . See (=-=Rabiner, 1989-=-) for an introduction to HMMs, and (Seneta, 1981) for a basic reference on positive matrices. The Markovian independence assumption implies that the state variable xt summarizes the past of the sequen... |

2723 | Learning internal representations by error propagation - Rumelhart, Hinton, et al. - 1986 |

772 |
A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ... Publishers. All rights reserved.sBengio & Frasconi later in the sequence, one can recursively propagate credit or error information backwards in time. For example, the Baum-Welch algorithm for HMMs (=-=Baum, Petrie, Soules, & Weiss, 1970-=-� Levinson, Rabiner, & Sondhi, 1983) and the back-propagation through time algorithm for recurrent neural networks (Rumelhart et al., 1986) rely on such kind of recursion. Numerous gradient-descent ba... |

373 |
The estimation of stochastic context-free grammars using the Inside–Outside algorithm. Computer Speech and Language
- Lari, Young
- 1990
(Show Context)
Citation Context ... are found in the area of grammar inference for natural language modeling (e.g., variable memory length Markov models, Ron et al., 1994, or constructive algorithms for learning context-free grammars, =-=Lari & Young, 1990-=-, Stolcke & Omohundro, 1993). The problem of di usion studied here applies only to algorithms that use gradient information (such as the Baum-Welch and gradient-based algorithms) and a gradual modi ca... |

336 | The Optimal Control of Partially Observable Markov Processes - Sondik - 1971 |

315 | Non-negative Matrices and Markov Chains - Seneta - 1981 |

253 | Learning longterm dependencies with gradient descent is difficult
- Bengio, Simard, et al.
- 1994
(Show Context)
Citation Context ...). Yet, many researchers have found practical di culties in training recurrent networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals (=-=Bengio, Simard, & Frasconi, 1994-=-� Mozer, 1992� Rohwer, 1994). Bengio et al. (1994) have also found theoretical reasons for this di culty and proved a negative result for parametric dynamical systems with a non-linear state to next-s... |

205 |
Automatic Speech Recognition: The Development of the Sphinx Recognition System
- Lee
- 1988
(Show Context)
Citation Context ...tion 5 and illustrated in Figure 5). Unfortunately, this generally supposes prior knowledge of an appropriate connectivity graph. In practical applications of HMMs, for example to speech recognition (=-=Lee, 1989-=-� Rabiner, 1989) or protein secondary structure modeling (Chauvin & Baldi, 1995), prior knowledge is heavily used in setting up the connectivity graph. As illustrated in Figure 4, in speech recognitio... |

194 | Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach
- Chrisman
- 1992
(Show Context)
Citation Context ...0� Levinson et al., 1983) as well as variations of HMMs such as Input/Output HMMs (IOHMMs) (Bengio & Frasconi, 1995b), and Partially Observable Markov Decision Processes (POMDPs) (Sondik, 1973, 1978� =-=Chrisman, 1992-=-). We nd that in general, a phenomenon of di usion of context and credit assignment, due to the ergodicity of the transition probability matrices, hampers both the representation and the learning of l... |

181 |
An introduction to the application of the theory of probabilistic functions of a return process to automatic speech recognition
- Levinson, Rabiner
- 1983
(Show Context)
Citation Context ...gio & Frasconi later in the sequence, one can recursively propagate credit or error information backwards in time. For example, the Baum-Welch algorithm for HMMs (Baum, Petrie, Soules, & Weiss, 1970� =-=Levinson, Rabiner, & Sondhi, 1983-=-) and the back-propagation through time algorithm for recurrent neural networks (Rumelhart et al., 1986) rely on such kind of recursion. Numerous gradient-descent based algorithms have been proposed f... |

163 | Maximum Mutual Information Estimation of Hidden Markov Models Parameters for Speech Recognition - Souza, V, et al. - 1986 |

134 | Hidden markov model induction by bayesian model merging - Stolcke, Omohundro - 1993 |

108 | An input output HMM architecture
- Bengio, Frasconi
- 1995
(Show Context)
Citation Context ... found by Bengio et al. (1994) to the case of Markovian models, whichinclude standard HMMs (Baum et al., 1970� Levinson et al., 1983) as well as variations of HMMs such as Input/Output HMMs (IOHMMs) (=-=Bengio & Frasconi, 1995-=-b), and Partially Observable Markov Decision Processes (POMDPs) (Sondik, 1973, 1978� Chrisman, 1992). We nd that in general, a phenomenon of di usion of context and credit assignment, due to the ergod... |

76 | The power of amnesia
- RON, SINGER, et al.
- 1994
(Show Context)
Citation Context ...te to explore the (legal) corners of this hypercube. Examples of to this approach are found in the area of grammar inference for natural language modeling (e.g., variable memory length Markov models, =-=Ron et al., 1994-=-, or constructive algorithms for learning context-free grammars, Lari & Young, 1990, Stolcke & Omohundro, 1993). The problem of di usion studied here applies only to algorithms that use gradient infor... |

69 | Global optimization of a neural network-hidden Markov model hybrid - Bengio, Mori, et al. - 1992 |

55 |
Induction of multiscale temporal structure
- Mozer
- 1992
(Show Context)
Citation Context ...d practical di culties in training recurrent networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals (Bengio, Simard, & Frasconi, 1994� =-=Mozer, 1992-=-� Rohwer, 1994). Bengio et al. (1994) have also found theoretical reasons for this di culty and proved a negative result for parametric dynamical systems with a non-linear state to next-state recurren... |

33 | Unified integration of explicit rules and learning by example in recurrent networks - Frasconi, Gori, et al. - 1995 |

31 |
ALPHA-NETS: A recurrent `neural' network architecture with a hidden Markov model interpretation
- Bridle
- 1990
(Show Context)
Citation Context ...e that this product gives the gradient of i�T with respect to j�t (from equation 1) and is used in the EM algorithm (Baum et al., 1970� Levinson et al., 1983) as well as in gradient-based algorithms (=-=Bridle, 1990-=-� Bengio, De Mori, Flammia, & Kompe, 1992� Bengio & Frasconi, 1995b). For example, in the case of a learning criterion L, where @L @ t @L @L = Vt @ t @ T is the vector [ @L @ 1�t ::: @L @ n�t ]. Since... |

26 | Credit assignment through time: Alternative to backpropagation
- Bengio
- 1994
(Show Context)
Citation Context ... arguments were also supported by experiments on arti cial data, studying the phenomenon of di usion of credit and the corresponding di culty in training HMMs to learn long-term dependencies. IOHMMs (=-=Bengio & Frasconi, 1994-=-, 1995b) and POMDPs (Sondik, 1973, 1978� Chrisman, 1992) are non-homogeneous variants of HMMs, i.e., the transition probabilities are function of the input (for IOHMMs) or the action (for POMDPs) at e... |

16 |
Hidden Markov Models of the G-Protein Coupled Receptor Family
- Baldi, Chauvin
- 1994
(Show Context)
Citation Context ...supposes prior knowledge of an appropriate connectivity graph. In practical applications of HMMs, for example to speech recognition (Lee, 1989� Rabiner, 1989) or protein secondary structure modeling (=-=Chauvin & Baldi, 1995-=-), prior knowledge is heavily used in setting up the connectivity graph. As illustrated in Figure 4, in speech recognition systems the meaning of individual states is usually xed a-priori except withi... |

9 | Diffusion of credit in markovian models - Bengio, Frasconi - 1995 |

7 |
Introduction to matrix analysis, 2nd Edition
- Bellman
- 1970
(Show Context)
Citation Context ...that a primitive stochastic matrix is necessarily allowable. 2.3 The Perron-Frobenius Theorem Right eigenvectors v of a matrix A and their corresponding eigenvalues have the following properties (see =-=Bellman, 1974-=-, for more on eigenvalues and eigenvectors): where I is the identity matrix, and determinant(A ; I) =0: Av = v i.e., X Aijvj = vi: j Note that for a stochastic matrix A the largest eigenvalue has norm... |

5 |
The time dimension of neural network models
- Rohwer
- 1994
(Show Context)
Citation Context ...i culties in training recurrent networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals (Bengio, Simard, & Frasconi, 1994� Mozer, 1992� =-=Rohwer, 1994-=-). Bengio et al. (1994) have also found theoretical reasons for this di culty and proved a negative result for parametric dynamical systems with a non-linear state to next-state recurrence 1 xt = ft(x... |

2 | Uni ed Integration of Explicit Rules and Learning by Example in Recurrent Networks - Frasconi, Gori, et al. - 1994 |

1 |
Diusion of credit in markovian models
- Bengio, Frasconi
- 1995
(Show Context)
Citation Context ... found by Bengio et al. (1994) to the case of Markovian models, whichinclude standard HMMs (Baum et al., 1970� Levinson et al., 1983) as well as variations of HMMs such as Input/Output HMMs (IOHMMs) (=-=Bengio & Frasconi, 1995-=-b), and Partially Observable Markov Decision Processes (POMDPs) (Sondik, 1973, 1978� Chrisman, 1992). We nd that in general, a phenomenon of di usion of context and credit assignment, due to the ergod... |