## Variational learning for switching state-space models (1998)

Venue: | Neural Computation |

Citations: | 148 - 6 self |

### BibTeX

@ARTICLE{Ghahramani98variationallearning,

author = {Zoubin Ghahramani and Geoffrey E. Hinton},

title = {Variational learning for switching state-space models},

journal = {Neural Computation},

year = {1998},

volume = {12},

pages = {963--996}

}

### Years of Citing Articles

### OpenURL

### Abstract

We introduce a new statistical model for time series which iteratively segments data into regimes with approximately linear dynamics and learns the parameters of each of these linear regimes. This model combines and generalizes two of the most widely used stochastic time series models -- hidden Markov models and linear dynamical systems -- and is closely related to models that are widely used in the control and econometrics literatures. It can also be derived by extending the mixture of experts neural network (Jacobs et al., 1991) to its fully dynamical version, in which both expert and gating networks are recurrent. Inferring the posterior probabilities of the hidden states of this model is computationally intractable, and therefore the exact Expectation Maximization (EM) algorithm cannot be applied. However, we present a variational approximation that maximizes a lower bound on the log likelihood and makes use of both the forward-backward recursions for hidden Markov models and the Kalman lter recursions for linear dynamical systems. We tested the algorithm both on artificial data sets and on a natural data set of respiration force from a patient with sleep apnea. The results suggest that variational approximations are a viable method for inference and learning in switching state-space models.

### Citations

9088 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...t g (10) X fS t g Z Q(fS t ; X t g) log P (fS t ; X t ; Y t gj) Q(fS t ; X t g) dfX t g = B(Q; ); (11) where denotes the parameters of the model and we have made use of Jensen's inequality (Cover =-=and Thomas, 1991-=-) to establish (11). Both steps of EM increase the lower bound on the log probability of the observed data. The E-step holds the parameterssxed and sets Q to be the posterior distribution over the hid... |

8844 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...jung and Soderstrom, 1983). Similar gradient-based methods can be obtained for o- line methods. An alternative method for o-line learning makes use of the Expectation Maximization (EM) algorithm (Demp=-=ster et al., 1977-=-). This procedure iterates between an E-step thatsxes the current parameters and computes posterior probabilities over the hidden states given the observations, and an M-step that maximizes the expect... |

7414 |
Probabilistic reasoning in intelligent systems: Networks of plausible inference
- Pearl
- 1988
(Show Context)
Citation Context ... Smyth, Heckerman and Jordan (1997), the forward{backward algorithm is a special case of exact inference algorithms for more general graphical probabilistic models (Lauritzen and Spiegelhalter, 1988; =-=Pearl, 1988-=-). The same observation holds true for the Kalman smoothing recursions. The other inference problem commonly posed for HMMs is to compute the single most likely sequence of hidden states. The solution... |

2373 | Time Series Analysis - Hamilton - 1994 |

1334 |
Local computations with probabilities on graphical structures and their applications to expert systems
- Lauritzen, Spiegelhalter
- 1988
(Show Context)
Citation Context ... (personal communication, 1985) and Smyth, Heckerman and Jordan (1997), the forward{backward algorithm is a special case of exact inference algorithms for more general graphical probabilistic models (=-=Lauritzen and Spiegelhalter, 1988-=-; Pearl, 1988). The same observation holds true for the Kalman smoothing recursions. The other inference problem commonly posed for HMMs is to compute the single most likely sequence of hidden states.... |

1221 | A new approach to the economic analysis of nonstationary time series and the business cycle
- Hamilton
- 1989
(Show Context)
Citation Context ...from one linear operating regime to another. There is a large literature on models of this kind in econometrics, signal processing, and otherselds (Harrison and Stevens, 1976; Chang and Athans, 1978; =-=Hamilton, 198-=-9; Shumway and Stoer, 1991; 1 Bar-Shalom and Li, 1993). Here we extend these models to allow for multiple real-valued state vectors, draw connections between theseselds and the relevant literature on ... |

895 | An introduction to hidden Markov models
- Rabiner, Juang
- 1986
(Show Context)
Citation Context ...l., 1994), and fault detection (Smyth, 1994). Given an HMM with known parameters and a sequence of observations, two algorithms are commonly used to solve two dierent forms of the inference problem (R=-=abiner and Juang, 1986-=-). Thesrst computes the posterior probabilities of the hidden states using a recursive algorithm known as the forward{backward algorithm. The computations in the forward pass are exactly analogous to ... |

866 | An Introduction to Variational Methods for Graphical Models - Jordan, Ghahramani, et al. - 1997 |

836 |
A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ...ich also consists of a forward and backward pass through the model. To learn maximum likelihood parameters for an HMM given sequences of observations, one can use the well-known Baum-Welch algorithm (=-=Baum et al., 1970-=-). This algorithm is a special case of EM that uses the forward{backward algorithm to infer the posterior probabilities of the hidden states in the E-step. The Mstep uses expected counts of transition... |

826 |
Adaptive mixtures of local experts
- Jacobs, Jordan, et al.
- 1991
(Show Context)
Citation Context ... linear dynamical systems|and is closely related to models that are widely used in the control and econometrics literatures. It can also be derived by extending the mixture of experts neural network (=-=Jacobs et al., 1991-=-) to its fully dynamical version, in which both expert and gating networks are recurrent. Inferring the posterior probabilities of the hidden states of this model is computationally intractable, and t... |

767 |
Optimal Filtering
- Anderson, Moore
- 1979
(Show Context)
Citation Context ...ian and the priors for the hidden states are Gaussian, the resulting posterior is also Gaussian. Three special cases of the inference problem are often considered:sltering, smoothing, and prediction (=-=Anderson and Moore, 1979-=-; Goodwin and Sin, 1984). The goal ofsltering is to compute the probability of the current hidden state X t given the sequence of inputs and outputs up to time t|P (X t jfY g t 1 ; fUg t 1 ). 3 The re... |

755 | Hierarchical mixtures of experts and the EM algorithm
- Jordan, Jacobs
- 1994
(Show Context)
Citation Context ... large family of models. With regard to the literature on neural computation, the model presented in this paper is a generalization both of the mixture of experts neural network (Jacobs et al., 1991; =-=Jordan and Jacobs, 1994-=-) and the related mixture of factor analyzers (Hinton et al., 1996; Ghahramani and Hinton, 1996b). Previous dynamical generalizations of the mixture of experts architecture consider the case in which ... |

511 | Factorial hidden Markov models
- Ghahramani, Jordan
- 1997
(Show Context)
Citation Context ...ion probabilities for the HMM. 5 Note that the state vectors could be concatenated into one large state vector with factorized (block-diagonal) transition matrices (cf. factorial hidden Markov model; =-=Ghahramani and Jordan, 1997-=-). However, this obscures the decoupled structure of the model. 6 Both classes of methods can be seen as minimizing Kullback-Liebler (KL) divergences. However, the KL divergence is asymmetrical, and w... |

369 |
On Gibbs Sampling for State Space Models
- Carter, Kohn
- 1994
(Show Context)
Citation Context ...eter estimation for this model, although without making reference to the EM algorithm. Other authors have used Markov chain Monte Carlo methods for state and parameter estimation in switching models (=-=Carter and Kohn, 1994-=-; Athaide, 1995) and in other related dynamic probabilistic networks (Dean and Kanazawa, 1989; Kanazawa et al., 1995). Hamilton (1989; 1994, section 22.4) describes a class of switching models in whic... |

276 | A unifying review of linear gaussian models - Roweis, Ghahramani - 1999 |

255 | An approach to time series smoothing and forecasting using the em algorithm - Shumway, Stoffer - 1982 |

249 |
Statistical Field Theory
- Parisi
- 1988
(Show Context)
Citation Context ... parameters. A completely factorized approximation is often used in statistical physics, where it provides the basis for simple yet powerful meanseld approximations to statistical mechanical systems (=-=Parisi, 1988-=-). Theoretical arguments motivating approximate E-steps are presented in Neal and Hinton (1998; originally in a technical report in 1993). Saul and Jordan (1996) showed that approximate E-steps could ... |

233 | The EM algorithm for mixtures of factor analyzers
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...sented in this paper is a generalization both of the mixture of experts neural network (Jacobs et al., 1991; Jordan and Jacobs, 1994) and the related mixture of factor analyzers (Hinton et al., 1996; =-=Ghahramani and Hinton, 1996-=-b). Previous dynamical generalizations of the mixture of experts architecture consider the case in which the gating network has Markovian dynamics (Cacciatore and Nowlan, 1994; Kadirkamanathan and Kad... |

232 | Dynamic Linear Models with Markov-Switching - Kim - 1994 |

197 |
Multitarget-Multisensor Tracking
- Bar-Shalom
- 1992
(Show Context)
Citation Context ... There is a large literature on models of this kind in econometrics, signal processing, and otherselds (Harrison and Stevens, 1976; Chang and Athans, 1978; Hamilton, 1989; Shumway and Stoer, 1991; 1 B=-=ar-Shalom and Li, 1993-=-). Here we extend these models to allow for multiple real-valued state vectors, draw connections between theseselds and the relevant literature on neural computation and probabilistic graphical models... |

173 | Probabilistic independence networks for hidden markov probability models - Smyth, Heckerman, et al. - 1997 |

164 | Parameter Estimation for Linear Dynamical Systems
- Ghahramani, Hinton
(Show Context)
Citation Context ...sented in this paper is a generalization both of the mixture of experts neural network (Jacobs et al., 1991; Jordan and Jacobs, 1994) and the related mixture of factor analyzers (Hinton et al., 1996; =-=Ghahramani and Hinton, 1996-=-b). Previous dynamical generalizations of the mixture of experts architecture consider the case in which the gating network has Markovian dynamics (Cacciatore and Nowlan, 1994; Kadirkamanathan and Kad... |

161 |
Hidden Markov models of biological primary sequence information
- Baldi, Chauvin, et al.
- 1994
(Show Context)
Citation Context ...y dierent forms, such as a Gaussian, mixture of Gaussians, or a neural network. HMMs have been applied extensively to problems in speech recognition (Juang and Rabiner, 1991), computational biology (B=-=aldi et al., 199-=-4), and fault detection (Smyth, 1994). Given an HMM with known parameters and a sequence of observations, two algorithms are commonly used to solve two dierent forms of the inference problem (Rabiner ... |

155 | Modeling the Manifolds of Images of Handwritten Digits
- Hinton, Dayan, et al.
- 1997
(Show Context)
Citation Context ...tation, the model presented in this paper is a generalization both of the mixture of experts neural network (Jacobs et al., 1991; Jordan and Jacobs, 1994) and the related mixture of factor analyzers (=-=Hinton et al., 1996-=-; Ghahramani and Hinton, 1996b). Previous dynamical generalizations of the mixture of experts architecture consider the case in which the gating network has Markovian dynamics (Cacciatore and Nowlan, ... |

155 | Stochastic simulation algorithms for dynamic probabilistic networks
- Kanazawa, Koller, et al.
- 1995
(Show Context)
Citation Context ...v chain Monte Carlo methods for state and parameter estimation in switching models (Carter and Kohn, 1994; Athaide, 1995) and in other related dynamic probabilistic networks (Dean and Kanazawa, 1989; =-=Kanazawa et al., 1995-=-). Hamilton (1989; 1994, section 22.4) describes a class of switching models in which the real-valued observation at time t, Y t , depends both on the observations at times t 1 to t r and on the discr... |

111 | An input/output HMM architecture
- Bengio, Frasconi
- 1995
(Show Context)
Citation Context ...dels, HMMs can be augmented to allow for input variables, such that they model the conditional distribution of sequences of output observations given sequences of inputs (Cacciatore and Nowlan, 1994; =-=Bengio and Frasconi, 1995-=-; Meila and Jordan, 1996). 2.3 Hybrids A burgeoning literature on models which combine the discrete transition structure of HMMs with the linear dynamics of SSMs has developed inselds ranging from eco... |

103 | Exploiting tractable substructures in intractable networks - Saul, Jordan - 1996 |

99 | Hidden Markov Models: Estimation and Control - Elliott, Aggoun, et al. - 1995 |

96 | Hidden Markov Models for Speech Recognition-Strengths and Limitations
- Rabiner, Juang
- 1991
(Show Context)
Citation Context ...ation vector, P (Y t jS t ) can be modeled in many dierent forms, such as a Gaussian, mixture of Gaussians, or a neural network. HMMs have been applied extensively to problems in speech recognition (J=-=uang and Rabiner, 199-=-1), computational biology (Baldi et al., 1994), and fault detection (Smyth, 1994). Given an HMM with known parameters and a sequence of observations, two algorithms are commonly used to solve two dier... |

88 | ML Estimation of a Stochastic Linear System with the EM Algorithm and Its Application to Speech Recognition
- Digalakis, Rohlicek, et al.
- 1993
(Show Context)
Citation Context ... E-step. For linear Gaussian state-space models, the E-step is exactly the Kalman smoothing problem as dened above, and the M-step simplies to a linear regression problem (Shumway and Stoer, 1982; Dig=-=alakis et al., 199-=-3). Details on the EM algorithm for state-space models can be found in Ghahramani and Hinton (1996b), as well as in the original Shumway and Stoer (1982) paper. 2.2 Hidden Markov models Hidden Markov ... |

71 | Dynamic linear models with switching - Shumway, Stoffer - 1991 |

70 | Annealed competition of experts for a segmentation and classification of switching dynamics - Pawelzik, Kohlmorgen, et al. - 1996 |

61 |
Bayesian Forecasting (with discussion
- Harrison, Stevens
- 1976
(Show Context)
Citation Context ...ch the dynamics can transition in a discrete manner from one linear operating regime to another. There is a large literature on models of this kind in econometrics, signal processing, and otherselds (=-=Harrison and Stevens, 197-=-6; Chang and Athans, 1978; Hamilton, 1989; Shumway and Stoer, 1991; 1 Bar-Shalom and Li, 1993). Here we extend these models to allow for multiple real-valued state vectors, draw connections between th... |

59 | Mixtures of controllers for jump linear and non-linear plants
- Cacciatore, Nowlan
- 1994
(Show Context)
Citation Context ...tributed). Like statespace models, HMMs can be augmented to allow for input variables, such that they model the conditional distribution of sequences of output observations given sequences of inputs (=-=Cacciatore and Nowlan, 1994-=-; Bengio and Frasconi, 1995; Meila and Jordan, 1996). 2.3 Hybrids A burgeoning literature on models which combine the discrete transition structure of HMMs with the linear dynamics of SSMs has develop... |

58 |
Solutions to the Linear Smoothing Problem
- Rauch
- 1963
(Show Context)
Citation Context ...ard direction to compute the probability of X t given fY g t 1 and fUg t 1 . A similar set of backward recursions from T to t complete the computation by accounting for the observations after time t (=-=Rauch, 1963-=-). We will refer to the combined forward and backward recursions for smoothing as the Kalman smoothing recursions (also known as the RTS or Rauch-Tung-Streibel smoother). Finally, the goal of predicti... |

57 | A view of the EM algorithm that justi incremental, sparse, and other variants. Learning in Graphical Models - Neal, Hinton - 1998 |

50 |
On State Estimation in Switching Environments
- Ackerson, Fu
- 1970
(Show Context)
Citation Context ...ed neural network models. 4 Shortly after Kalman and Bucy solved the problem of state estimation for linear Gaussian state-space models attention turned to the analogous problem for switching models (=-=Ackerson and Fu, 1970-=-). Chang and Athans (1978) derive the equations for computing the conditional mean and variance of the state when the parameters of a linear state-space model switch according to arbitrary and Markovi... |

41 |
Hidden Markov models for fault detection in dynamic systems
- Smyth
- 1994
(Show Context)
Citation Context ...ure of Gaussians, or a neural network. HMMs have been applied extensively to problems in speech recognition (Juang and Rabiner, 1991), computational biology (Baldi et al., 1994), and fault detection (=-=Smyth, 199-=-4). Given an HMM with known parameters and a sequence of observations, two algorithms are commonly used to solve two dierent forms of the inference problem (Rabiner and Juang, 1986). Thesrst computes ... |

40 |
Forecasting probability densities by using hidden Markov models with mixed states
- Fraser, Dimitriadis
- 1993
(Show Context)
Citation Context ...dden Markov model driving an r th order auto-regressive process, and are tractable for small r and 5 number of discrete states in S. Hamilton's models are closely related to Hidden Filter HMM (HFHMM; =-=Fraser and Dimitriadis 1993-=-). HFHMMs have both discrete and real-valued states. However, the real-valued states are assumed to be either observed or a known, deterministic function of the past observations (i.e. an embedding). ... |

26 |
State estimation for discrete systems with switching parameters
- Chang, Athans
- 1978
(Show Context)
Citation Context ...on in a discrete manner from one linear operating regime to another. There is a large literature on models of this kind in econometrics, signal processing, and otherselds (Harrison and Stevens, 1976; =-=Chang and Athans, 197-=-8; Hamilton, 1989; Shumway and Stoer, 1991; 1 Bar-Shalom and Li, 1993). Here we extend these models to allow for multiple real-valued state vectors, draw connections between theseselds and the relevan... |

26 | Time-series segmentation using predictive modular neural networks - Kehagias, Petridis - 1997 |

23 |
Deterministic annealing variant of the EM algorithm
- Ueda, Nakano
- 1995
(Show Context)
Citation Context ...ature parameter, which is initialized to a large value and gradually reduced to 1. The above equations maximize a modied form of the bound B in (11), where the entropy of Q has been multiplied by T (U=-=eda and Nakano, 1995-=-). 4.2 Merging Gaussians Almost all the approximate inference methods that are described in the literature for switching state-space models are based on the idea of merging, at each time step, a mixtu... |

19 | A Mixtureof-Experts Framework for Adaptive Kalman Filtering
- Chaer, Bishop, et al.
- 1997
(Show Context)
Citation Context ... to control engineering, (Harrison and Stevens, 1976; Chang and Athans, 1978; Hamilton, 1989; Shumway and Stoer, 1991; Bar-Shalom and Li, 1993; Deng, 1993; Kadirkamanathan and Kadirkamanathan, 1996; C=-=haer et al., 1997-=-). These models are known alternately as hybrid models, state-space models with switching, and jump-linear systems. We brie y review some of this literature, including some related neural network mode... |

15 | Multi-channel physiological data: Description and analysis - Rigney, Goldberger, et al. - 1993 |

11 |
A stochastic model of speech incorporating hierarchical nonstationarity
- Deng
- 1993
(Show Context)
Citation Context ...of SSMs has developed inselds ranging from econometrics to control engineering, (Harrison and Stevens, 1976; Chang and Athans, 1978; Hamilton, 1989; Shumway and Stoer, 1991; Bar-Shalom and Li, 1993; D=-=eng, 1993-=-; Kadirkamanathan and Kadirkamanathan, 1996; Chaer et al., 1997). These models are known alternately as hybrid models, state-space models with switching, and jump-linear systems. We brie y review some... |

11 | On Structured Variational Approximations
- Ghahramani
- 1997
(Show Context)
Citation Context ...quence, the zeros of the derivatives of KL with respect to the variational parameters can be obtained simply by equating derivatives of hHi and hHQ i with respect to corresponding sucient statistics (=-=Ghahramani, 1997-=-): @hHQ Hi @hS (m) t i = 0 (43) @hHQ Hi @hX (m) t i = 0 (44) @hHQ Hi @hP (m) t i = 0 (45) where P (m) t = hX (m) t X (m) t 0 i hX (m) t ihX (m) t i 0 is the covariance of X (m) t under Q. Many terms c... |

11 |
Recursive estimation of dynamic modular RBF networks
- Kadirkamanathan, Kadirkamanathan
- 1996
(Show Context)
Citation Context ...developed inselds ranging from econometrics to control engineering, (Harrison and Stevens, 1976; Chang and Athans, 1978; Hamilton, 1989; Shumway and Stoer, 1991; Bar-Shalom and Li, 1993; Deng, 1993; K=-=adirkamanathan and Kadirkamanathan, 1996-=-; Chaer et al., 1997). These models are known alternately as hybrid models, state-space models with switching, and jump-linear systems. We brie y review some of this literature, including some related... |

10 | Theory and Practice of Recursive Identi - Ljung, Soderstrom - 1983 |

8 |
A model for reasoning about persitence and causation
- Dean, Kanazawa
- 1989
(Show Context)
Citation Context ...r authors have used Markov chain Monte Carlo methods for state and parameter estimation in switching models (Carter and Kohn, 1994; Athaide, 1995) and in other related dynamic probabilistic networks (=-=Dean and Kanazawa, 1989-=-; Kanazawa et al., 1995). Hamilton (1989; 1994, section 22.4) describes a class of switching models in which the real-valued observation at time t, Y t , depends both on the observations at times t 1 ... |

5 |
Likelihood evaluation and state estimation for nonlinear state space models. Unpublished doctoral dissertation
- Athaide
- 1995
(Show Context)
Citation Context ...s model, although without making reference to the EM algorithm. Other authors have used Markov chain Monte Carlo methods for state and parameter estimation in switching models (Carter and Kohn, 1994; =-=Athaide, 1995-=-) and in other related dynamic probabilistic networks (Dean and Kanazawa, 1989; Kanazawa et al., 1995). Hamilton (1989; 1994, section 22.4) describes a class of switching models in which the real-valu... |