## Learning Latent Variable and Predictive Models of Dynamical Systems (2009)

### BibTeX

@MISC{Siddiqi09learninglatent,

author = {Sajid M. Siddiqi and Andrew W. Moore and Jeff Schneider and Zoubin Ghahramani},

title = {Learning Latent Variable and Predictive Models of Dynamical Systems},

year = {2009}

}

### OpenURL

### Abstract

Despite the single author listed on the cover, this dissertation is not the product of one person alone. I would like to acknowledge many, many people who influenced me, my life and my work. They have all aided this research in different ways over the years and helped it come to a successful conclusion. Geoff Gordon, my advisor, has taught me a lot over the years; how to think methodically and analyze a problem, how to formulate problems mathematically, and how to choose interesting problems. From the outset, he has helped me develop the ideas that went into the thesis. Andrew Moore, my first advisor, got me started in machine learning and data mining and helped make this field fun and accessible to me, and his guidance and mentoring was crucial for work done early in my Ph.D. Both Geoff and Andrew are the very best kind of advisor I could have asked for: really smart, knowledgeable, caring and hands-on. They showed me how be a good researcher while staying relaxed, calm and happy. Though I wasn’t always able to strike that balance, the example they set was essential for me to be able to make it through without burning out in the process. All the members of the AUTON lab deserve much thanks, especially Artur Dubrawski

### Citations

8090 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...algorithm in detail in Chapter 6. 2.4.1 Expectation Maximization Given one or several sequences of observations and a desired number of states m, we can fit an HMM to the data using an instance of EM =-=[14]-=- called Baum-Welch [15] which was discovered before the general EM algorithm. Baum-Welch alternates between steps of computing a set of expected sufficient statistics from the observed data (the E-ste... |

4828 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...rning HMM parameters from data. We focus on two of the most common techniques here, namely Baum-Welch and Viterbi Training. These are both iterative methods analogous to the popular k-means algorithm =-=[10, 11]-=- for clustering independent and identically distributed (IID) data, in the sense that they monotonically minimize a distortion function of the data with respect to a fixed number of “centers” (here, s... |

4676 |
Matrix Analysis
- Horn, Johnson
- 1986
(Show Context)
Citation Context ...T∑ t=2 ) −1 E{xt−1x T t | y1:T } ) (3.8b) (3.8c) (3.8d) Also note that the state covariance matrix ̂ Q (see Equation (3.8d)) is positive semi-definite by construction since it is the Schur complement =-=[32]-=- of T∑ {[ xtx E T t xtxT ] } t−1 | y1:T ≥ 0 (3.9) t=1 xt−1x T t 3.3.2 Subspace Identification xt−1x T t−1 Learning algorithms based on Expectation-Maximization (EM) iteratively optimize the observed d... |

3921 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...rning HMM parameters from data. We focus on two of the most common techniques here, namely Baum-Welch and Viterbi Training. These are both iterative methods analogous to the popular k-means algorithm =-=[10, 11]-=- for clustering independent and identically distributed (IID) data, in the sense that they monotonically minimize a distortion function of the data with respect to a fixed number of “centers” (here, s... |

3667 |
L.: Convex Optimization
- BOYD, VANDENBERGHE
- 2004
(Show Context)
Citation Context ...e first obtain an estimate of the underlying state sequence using subspace identification. We then formulate the least-squares minimization problem for the dynamics matrix as a quadratic program (QP) =-=[59]-=-, initially without constraints. When we solve this QP, the estimate Â we obtain may be unstable. However, any unstable solution allows us to derive a linear constraint which we then add to our origin... |

2307 |
Estimating the dimension of a model
- SCHWARZ
- 1978
(Show Context)
Citation Context ...es method. We also evaluate STACS as an alternative learning algorithm for models of predetermined size. To determine the stopping point for state-splitting, we use the Bayesian Information Criterion =-=[41]-=-, or BIC score. We discuss the benefits and drawbacks of this in Section 4.3.3. We compare our approach to previous work on synthetic data as well as several real-world data sets from the literature, ... |

2162 |
Density Estimation for Statistics and Data Analysis. Monographson Statistics and Applied Probability
- Silverman
- 1986
(Show Context)
Citation Context ...ce ɛ < 1 by assumption) = 0.9827ɛ < ɛ This completes the proof of Theorem 2. □ A.1.5 Proof of Theorem 2 for Continuous Observations For continuous observations, we use Kernel Density Estimation (KDE) =-=[109]-=- to model the observation probability density function (PDF). We use a fraction of the training data points as kernel centers, placing one multivariate Gaussian kernel at each point. 1 The KDE estimat... |

2112 |
A New Approach to Linear Filtering and Prediction Problems
- Kalman
- 1960
(Show Context)
Citation Context ...ystems In the case where the state of an LVM is multivariate real-valued and the noise terms are Gaussian, the resulting model is called a linear dynamical system (LDS), also known as a Kalman Filter =-=[24]-=- or a state-space model [25]. LDSs are an important tool for modeling time series in engineering, controls and economics as well as the physical and social sciences. In this section we define LDSs and... |

1191 |
System Identification Theory for the User
- Ljung
- 1999
(Show Context)
Citation Context ...s it relates to the LDS transition model, which will be relevant later in Chapter 6. More details on LDSs and algorithms for inference and learning in LDSs can be found in several standard references =-=[26, 27, 28, 29]-=-. 3.1 Definition Linear dynamical systems can be described by the following two equations: xt+1 = Axt + wt wt ∼ N (0, Q) (3.1a) yt = Cxt + vt vt ∼ N (0, R) (3.1b) Time is indexed by the discrete index... |

1164 |
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm
- Viterbi
- 1967
(Show Context)
Citation Context ...Let λ denote the HMM and X denote the sequence of observations. Then, path inference computes H ∗ such that H ∗ = arg max Pr[X, H | λ] H For an observation sequence of length τ, the Viterbi algorithm =-=[9]-=- computes an optimal path in running time O(τm 2 ) using dynamic programming. Define δ(t, i) as δ(t, i) = max Pr[h1h2 · · · ht = i, x1x2 · · · xt | λ] h1,··· ,ht−1 Though computing δ(t, i) for all t, ... |

764 | A view of the EM algorithm that justifies incremental sparse and other variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...is problem by guaranteeing a consistent increase in data likelihood as it searches the space of HMMs of varying state space sizes, based on monotonicity results regarding variants of the EM algorithm =-=[44]-=-. We describe the overall algorithm as well as details of STACS and V-STACS below. First, some notation: when considering a split of state h, HMM parameters related to state h (denoted by λh), includi... |

651 |
UCI repository of machine learning databases. URL http://www.ics.uci.edu/∼mlearn/ MLRepository.html
- Newman, Hettich, et al.
- 1998
(Show Context)
Citation Context ...alistic domains. The dimensionality of all datasets was reduced by PCA to 5 for efficiency, except for the Motionlogger dataset which is 2-D. The AUSL and Vowel data sets are from the UCI KDD archive =-=[46]-=-. We list the data sets with (training-set, test-set) sizes below. Robot: This data set consists of laser-range data provided by the Radish Robotics Data set Repository [47] gathered by a Pioneer indo... |

647 |
Time Series: Theory and Methods
- Brockwell, Davis
- 1991
(Show Context)
Citation Context ... state of an LVM is multivariate real-valued and the noise terms are Gaussian, the resulting model is called a linear dynamical system (LDS), also known as a Kalman Filter [24] or a state-space model =-=[25]-=-. LDSs are an important tool for modeling time series in engineering, controls and economics as well as the physical and social sciences. In this section we define LDSs and describe their inference an... |

564 | Dynamic bayesian networks: representation, inference and learning
- Murphy
(Show Context)
Citation Context ...and smoothing inference algorithms [24, 30] are instantiations of the junction tree algorithm for Bayesian Networks on a dynamic Bayesian network analogous to the one in Figure 2.1 (see, for example, =-=[31]-=-). 3.2.1 The Forward Pass (Kalman Filter) Let the mean and covariance of the belief state estimate Pr[Xt | y1:t] at time t be denoted by ˆxt and ˆ Pt respectively. The estimates ˆxt and ˆ Pt can be pr... |

538 | Hierarchical dirichlet processes
- Teh, Jordan, et al.
- 2006
(Show Context)
Citation Context ...ailored for a particular task is the constrained HMM [20] which was developed originally in the context of speech recognition. Nonparametric methods 13such as Hierarchical Dirichlet Processes (HDPs) =-=[21]-=- have been used to define samplingbased versions of HMMs with “infinitely” many states [21, 22] which integrate out the hidden state parameter. This class of models has since been improved upon in sev... |

512 |
An inequality and associated maximization technique in statistical estimation of probabilistic functions of a markov process
- Baum
- 1972
(Show Context)
Citation Context ...Chapter 6. 2.4.1 Expectation Maximization Given one or several sequences of observations and a desired number of states m, we can fit an HMM to the data using an instance of EM [14] called Baum-Welch =-=[15]-=- which was discovered before the general EM algorithm. Baum-Welch alternates between steps of computing a set of expected sufficient statistics from the observed data (the E-step) and updating the par... |

489 | Factorial Hidden Markov Models
- Ghahramani, Jordan
- 1998
(Show Context)
Citation Context ...modeling multiple interacting processes, Input-Output HMMs [17] which incorporate inputs into the model, hierarchical HMMs [18] for modeling hierarchically structured state spaces, and factorial HMMs =-=[19]-=- that model the state space in a distributed fashion. Another notable example of a specialized sub-class of HMMs tailored for a particular task is the constrained HMM [20] which was developed original... |

391 |
A Maximum Likelihood Approach to Continuous Speech Recognition
- Bahl, Jelinek, et al.
- 1983
(Show Context)
Citation Context ...e DBN view of sequential data modeling can be found in Murphy (2002). 3.1 Hidden Markov Models Introduced in the late 1960s, HMMs have been used most extensively in speech recognition (Rabiner, 1989; =-=Bahl et al., 1983-=-), language modeling and bioinformatics (Krogh et al., 1994) but also in diverse application areas such as computer vision (Sunderesan et al., 2003) and information extraction (Seymore et al., 1999). ... |

389 |
On the method of bounded differences
- McDiarmid
- 1989
(Show Context)
Citation Context ... √ 1 N (√ ∑ k 3 ɛ3,x,1 ≤ min ln k N η + √ ) √ k 1 3 + 2ɛ(k) + ln N N η + √ 1 N x Before proving this lemma, we need some definitions and a preliminary result. First, we restate McDiarmid’s Inequality =-=[108]-=-: Theorem 19 Let Z1, . . . , Zm be independent random variables all taking values in the set Z. Let ci be some positive real numbers. Further, let f : Z m ↦→ R be a function of Z1, . . . , Zm that sat... |

369 | Coupled Hidden Markov Models for complex action recognition
- Brand, Oliver, et al.
- 1997
(Show Context)
Citation Context ...ed in the late 1960s, HMMs have been used most extensively in speech recognition [4, 5] and bioinformatics [6] but also in diverse application areas such as computer vision and information extraction =-=[7, 8]-=-. For an excellent tutorial on HMMs, see Rabiner [4]. In this chapter we define HMMs and describe their standard inference and learning algorithms. 2.1 Definition Let ht ∈ 1, . . . , m denote the disc... |

364 |
A Tutorial on
- Rabiner
- 1989
(Show Context)
Citation Context ...th survey of the DBN view of sequential data modeling can be found in Murphy (2002). 3.1 Hidden Markov Models Introduced in the late 1960s, HMMs have been used most extensively in speech recognition (=-=Rabiner, 1989-=-; Bahl et al., 1983), language modeling and bioinformatics (Krogh et al., 1994) but also in diverse application areas such as computer vision (Sunderesan et al., 2003) and information extraction (Seym... |

288 | Dynamic textures
- Doretto, Chiuso, et al.
- 2003
(Show Context)
Citation Context ...raint generation approach also results in much greater efficiency than previous methods in nearly all cases. 60One application of LDSs in computer vision is learning dynamic textures from video data =-=[61]-=-. An advantage of learning dynamic textures is the ability to play back a realisticlooking generated sequence of desired duration. In practice, however, videos synthesized from dynamic texture models ... |

288 | Non-negative Matrix Factorization with Sparseness Constraints
- Hoyer
- 2004
(Show Context)
Citation Context ...nsition matrix. We could use EM to estimate T followed by (or combined with) matrix factorization algorithms such as Singular Value Decomposition (SVD) [32] or Non-negative Matrix Factorization (NMF) =-=[76]-=-. This approach has several drawbacks. For example, if the noisy estimate of a low-rank transition matrix is not low-rank itself, SVD could cause negative numbers to appear in the reconstructed transi... |

279 | Visual recognition of American Sign Language using Hidden Markov models
- Starner, Pentland
- 1995
(Show Context)
Citation Context ... is carried out by scoring a test sequence with each HMM, and the sequence is labeled with the class of the highest-scoring HMM. One such classification problem is automatic sign-language recognition =-=[50]-=-. We test the effectiveness of our automatically learned HMMs at classification of Australian sign language using the AUSL dataset [1]. The data consists of sensor readings from a pair of Flock instru... |

275 | Variational algorithms for approximate bayesian inference
- Beal
- 2003
(Show Context)
Citation Context ... posterior, and hence is only an approximate scoring and stopping criterion. Furthermore, regular BIC is not entirely suitable for temporal models [45]. A scoring criterion based on variational Bayes =-=[58]-=- might be a better option, though scoring and stopping based on test-set likelihood, as mentioned earlier, would be best. In Chapter 6 we see a different approach to learning HMMs using matrix decompo... |

245 | Point-based value iteration: An anytime algorithm for pomdps
- Pineau, Gordon, et al.
- 2003
(Show Context)
Citation Context ... out using hand-coded POMDPs whose belief evolves in a pre-specified state space. However, the spectral learning algorithm for RR-HMMs, together with efficient point-based planning algorithms such as =-=[98, 99, 100]-=-, can allow us to close the loop by learning models, planning in them and using the resultant data to update the learnt model. This effectively leaves the task of state space formulation up to spectra... |

236 | The Hierarchical Hidden Markov Model: Analysis and applications - Fine, Singer, et al. - 1998 |

228 |
A Tutorial on Hidden Markov Models and Selected
- Rabiner
- 1989
(Show Context)
Citation Context ... Models (HMMs) are LVMs where the underlying hidden variable can take on one of finitely many discrete values. Introduced in the late 1960s, HMMs have been used most extensively in speech recognition =-=[4, 5]-=- and bioinformatics [6] but also in diverse application areas such as computer vision and information extraction [7, 8]. For an excellent tutorial on HMMs, see Rabiner [4]. In this chapter we define H... |

222 |
An approach to time series smoothing and forecasting using the EM algorithm
- Shumway, Stoffer
- 1982
(Show Context)
Citation Context ... the matrix Q in equation (3.22) above is required to be positive definite. If ρ(A) = 1, the LDS is said to be marginally stable. 3.5 Related Work The EM algorithm for LDS was originally presented in =-=[34]-=-. Auto-Regressive (AR), Moving Average (MA) and Auto-Regressive Moving Average (ARMA) models are simpler time series modeling methods that are provably subsumed by LDSs [25]. Nonlinear dynamical syste... |

181 |
Estimation and Tracking
- Bar-Shalom, Li
(Show Context)
Citation Context ...onentially over time. They also propose an approximate filtering algorithm for this model based on a single Gaussian. [87] proposes learning algorithms for an LDS with switching observation matrices. =-=[88]-=- reviews models where both the observations and state variable switch according to a discrete variable with Markov transitions. Hidden Filter HMMs (HFHMMs) [89] combine discrete and real-valued state ... |

180 | Predictive representations of state
- Littman, Sutton, et al.
- 2001
(Show Context)
Citation Context ...iple restarts to escape local minima. Though the RR-HMM is in itself novel, its low-dimensional Rk representation is a special case of several existing models such as Predictive State Representations =-=[68]-=-, Observable Operator Models [69], generalized HMMs [70] and multiplicity automata [71, 72], 78A. a Training Data from 4-state HMM B. a Estimated 3-state HMM observation b b c c time C. Estimated ran... |

156 | Parameter estimation for linear dynamical systems
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...or Bayesian Networks (see, for example, (Murphy, 2002)). Standard algorithms for learning LDS parameters learn locally optimal values by gradient descent (Ljung, 1999), Expectation Maximization (EM) (=-=Ghahramani & Hinton, 1996-=-) or least squares on a state sequence estimate obtained by subspace identification methods (Van Overschee & De Moor, 1996). 2.3 Predictive Models Predictive State Representations (Littman et al., 200... |

155 | Learning hiddenMarkov model structure for information extraction,” AAAI-99
- Seymore, McCallum, et al.
- 1999
(Show Context)
Citation Context ...1989; Bahl et al., 1983), language modeling and bioinformatics (Krogh et al., 1994) but also in diverse application areas such as computer vision (Sunderesan et al., 2003) and information extraction (=-=Seymore et al., 1999-=-). A classic tutorial on HMMs can be found in the work of Rabiner (1989). More recently, HMMs and their algorithms have been re-examined in light of their connections to Bayesian Networks, such as in ... |

151 | Mixture Kalman filters
- Cheng, Liu
- 2000
(Show Context)
Citation Context .... The real-valued state is deterministically dependent on previous observations in a known manner, and only the discrete variable is hidden. This allows exact inference in this model to be tractable. =-=[90]-=- formulates the Mixture Kalman Filter (MKF) model along with a filtering algorithm, similar to [86] except that the filtering algorithm is based on sequential Monte-Carlo sampling. The commonly used H... |

143 | Variational learning for switching state-space models
- Ghahramani, Hinton
- 2000
(Show Context)
Citation Context ...e shortly after the advent of LDSs, there have been attempts to combine the discrete states of HMMs with the smooth dynamics of LDSs. We perform a brief review of the literature on hybrid models; see =-=[85]-=- for a more thorough review. [86] formulates a switching LDS variant where both the state and observation variable noise models are mixture of Gaussians with the mixture switching variable evolving ac... |

124 | Learning dynamic bayesian networks - Ghahramani - 1998 |

111 |
Subspace Identification for Linear Systems: Theory, Implementation, Applications
- Overschee, Moor
- 1996
(Show Context)
Citation Context ...ing EM, though the latter needs multiple restarts to discover the overlapping states and avoid local minima. 79and is also related to the representation of LDSs learned using Subspace Identification =-=[27]-=-. These and other related models and algorithms are discussed further in Section 6.4. To learn RR-HMMs from data, we adapt and extend a recently proposed spectral learning algorithm by Hsu, Kakade and... |

109 |
Romeijn (Eds.), Handbook of Global Optimization
- Pardalos, E
- 2002
(Show Context)
Citation Context ...wo steps are iterated until we reach a stable solution, which is then refined by a simple interpolation to obtain the best possible stable estimate. Our method can be viewed as constraint generation (=-=Horst & Pardalos, 1995-=-) for an underlying convex program with a feasible set of all matrices with singular values at most 1, similar to work in control systems such as (Lacy & Bernstein, 2002). However, we terminate before... |

108 | An input output HMM architecture
- Bengio, Frasconi
- 1995
(Show Context)
Citation Context ...nections to Bayesian Networks, such as in [16]. Many variations on the basic HMM model have also been proposed, such as coupled HMMs [7] for modeling multiple interacting processes, Input-Output HMMs =-=[17]-=- which incorporate inputs into the model, hierarchical HMMs [18] for modeling hierarchically structured state spaces, and factorial HMMs [19] that model the state space in a distributed fashion. Anoth... |

108 | A generalizaionof principal component analysis to the exponential family
- Collins, Dasgupta, et al.
- 2001
(Show Context)
Citation Context ...al learning of exponential family RR-HMMs Belief Compression [102] exploits the stochasticity of HMM beliefs in order to compress them more effectively for POMDP planning using exponential family PCA =-=[103]-=-. In a similar vein, it should be possible to extend RR-HMMs and their learning algorithms to the exponential family case. Gordon (2002)[104] presents a more efficient and general algorithm for carryi... |

99 | Predictive State Representations: A New Theory for Modeling Dynamical Systems
- Singh, James, et al.
- 2004
(Show Context)
Citation Context ...ing SVD of a probability matrix relating past and future observations. This idea has roots in subspace identification [27, 29] and multiplicity automata [71, 72, 70] as well as the PSR/OOM literature =-=[69, 77]-=- and was recently formulated in a paper by Hsu, Kakade and Zhang [12] for full-rank HMMs. We use their algorithm, extending its theoretical guarantees for the low-rank HMM case where the rank of the t... |

92 | Best-First Model Merging for Hidden Markov Model Induction," TR-94-003, International Computer Science lnstitute
- Stolcke, Omohundro
- 1994
(Show Context)
Citation Context ...ions. Bottom-up topology learning techniques start off with a superfluously large HMM and prune parameters and/or states incrementally to shrink the model to an appropriate size. Stolke and Omohundro =-=[42]-=- demonstrate a Bayesian technique for learning HMMs by successively merging pairs of states for discrete-observation HMMs, followed by Baum-Welch to optimize parameter settings. Their model-merging te... |

88 | Linear time inference in hierarchical hmms - Murphy, Paskin - 2001 |

86 |
Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition
- Kumar, Andreou
- 1998
(Show Context)
Citation Context ...stributions. 1006.4.2 Hybrid Models, Mixture Models and other recent approaches RR-HMMs and their algorithms are also related to other hybrid models. Note that previous models of the same name (e.g. =-=[84]-=-) address a completely different problem, i.e. reducing the rank of the Gaussian observation parameters. Since shortly after the advent of LDSs, there have been attempts to combine the discrete states... |

84 | Gaussian process dynamical models
- Wang, Fleet, et al.
- 2006
(Show Context)
Citation Context ...ar dynamics though the resulting model was less interpretable than the best SSSM. More recently, models for nonlinear time series modeling such as Gaussian Process Dynamical Models have been proposed =-=[91]-=-. However, the parameter learning algorithm is only locally optimal, and exact inference and simulation are very expensive, requiring MCMC over a long sequence of frames all at once. This necessitates... |

72 | Learning nonlinear dynamical systems using an EM algorithm - Ghahramani, Roweis - 1999 |

67 | Observable Operator Models for Discrete Stochastic Time Series
- Jaeger
- 2000
(Show Context)
Citation Context ...ima. Though the RR-HMM is in itself novel, its low-dimensional Rk representation is a special case of several existing models such as Predictive State Representations [68], Observable Operator Models =-=[69]-=-, generalized HMMs [70] and multiplicity automata [71, 72], 78A. a Training Data from 4-state HMM B. a Estimated 3-state HMM observation b b c c time C. Estimated rank-3 RR-HMM D. a a time Estimated ... |

66 | Structure learning in conditional probability models via an entropic prior and parameter extinction - Brand - 1998 |

64 |
Dynamic linear models with switching
- Shumway, Stoffer
- 1991
(Show Context)
Citation Context ...uations where the number of Gaussians needed to represent the belief increases exponentially over time. They also propose an approximate filtering algorithm for this model based on a single Gaussian. =-=[87]-=- proposes learning algorithms for an LDS with switching observation matrices. [88] reviews models where both the observations and state variable switch according to a discrete variable with Markov tra... |

63 | Finding Approximate POMDP Solutions Through Belief Compression
- Roy, Gordon, et al.
- 2005
(Show Context)
Citation Context ...stic, the bounds and proofs may differ. However, the same basic perturbation and sampling error ideas should, in theory, hold. 7.1.7 Spectral learning of exponential family RR-HMMs Belief Compression =-=[102]-=- exploits the stochasticity of HMM beliefs in order to compress them more effectively for POMDP planning using exponential family PCA [103]. In a similar vein, it should be possible to extend RR-HMMs ... |