## An Unsupervised Ensemble Learning Method for Nonlinear Dynamic State-Space Models (2001)

### Cached

### Download Links

- [www.cis.hut.fi]
- [www.cis.hut.fi]
- [www.cis.hut.fi]
- [cochlea.hut.fi]
- [www.lce.hut.fi]
- [www.gatsby.ucl.ac.uk]
- DBLP

### Other Repositories/Bibliography

Venue: | Neural Computation |

Citations: | 89 - 32 self |

### BibTeX

@ARTICLE{Valpola01anunsupervised,

author = {Harri Valpola and Juha Karhunen},

title = {An Unsupervised Ensemble Learning Method for Nonlinear Dynamic State-Space Models},

journal = {Neural Computation},

year = {2001},

volume = {14},

pages = {2647--2692}

}

### Years of Citing Articles

### OpenURL

### Abstract

A Bayesian ensemble learning method is introduced for unsupervised extraction of dynamic processes from noisy data. The data are assumed to be generated by an unknown nonlinear mapping from unknown factors. The dynamics of the factors are modeled using a nonlinear statespace model. The nonlinear mappings in the model are represented using multilayer perceptron networks. The proposed method is computationally demanding, but it allows the use of higher dimensional nonlinear latent variable models than other existing approaches. Experiments with chaotic data show that the new method is able to blindly estimate the factors and the dynamic process which have generated the data. It clearly outperforms currently available nonlinear prediction techniques in this very di#cult test problem.

### Citations

5575 |
Neural Network for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ... assumed that the factors or sources s i (t) (components of the vector s(t)) are gaussian, and the nonlinear mapping f is modeled by the standard multilayer perceptron (MLP) network. The MLP network (=-=Bishop, 1995-=-; Haykin, 1998) is used here because of its universal approximation property and ability to model well both mildly and strongly nonlinear mappings f . However, the cost function used in Bayesian ensem... |

4555 |
Neural networks: A comprehensive foundation (2
- Haykin
- 1999
(Show Context)
Citation Context ...the factors or sources s i (t) (components of the vector s(t)) are gaussian, and the nonlinear mapping f is modeled by the standard multilayer perceptron (MLP) network. The MLP network (Bishop, 1995; =-=Haykin, 1998-=-) is used here because of its universal approximation property and ability to model well both mildly and strongly nonlinear mappings f . However, the cost function used in Bayesian ensemble learning i... |

1783 | Survey on independent component analysis
- Hyvarinen
- 1999
(Show Context)
Citation Context ...FA). It is also related to nonlinear blind source separation (BSS) and independent component analysis (ICA). For these problems we have recently developed several methods described in more detail in (=-=Hyvrinen et al., 2001-=-), Chapter 17, and in (Lappalainen and Honkela, 2000; Valpola et al., 2000; Valpola, 2000b). In particular, a method called nonlinear factor analysis (NFA) (Lappalainen and Honkela, 2000) serves as th... |

1600 | A: Bayesian data analysis
- Gelman
- 2004
(Show Context)
Citation Context ...ed to appendices for better readability. 2 Bayesian methods and ensemble learning 2.1 Bayesian inference In Bayesian data analysis and estimation methods (see for example (Bishop, 1995; Jordan, 1999; =-=Gelman et al., 1995-=-; Neal, 1996)) for continuous variables, all the uncertain quantities are modeled in terms of their joint probability density function (pdf). The key principle is to construct the joint posterior pdf ... |

1067 | Monte Carlo statistical methods
- Robert, Casella
- 2004
(Show Context)
Citation Context ... enough in practical real-world problems. Another group of Bayesian methods used in neural network models consists of Markov Chain Monte Carlo (MCMC) techniques for numerical integration (Neal, 1996; =-=Robert and Gasella, 1999-=-). They perform the necessary integrations needed in evaluating the evidence (3) and the posterior density (2) numerically by drawing samples from the true posterior distribution. MCMC techniques have... |

1030 |
Deterministic nonperiodic flow
- Lorenz
- 1963
(Show Context)
Citation Context ...ator with angular velocity 1/3. The harmonic oscillator has a two-dimensional state representation and linear dynamics. The two other dynamic processes were chosen to be independent Lorenz processes (=-=Lorenz, 1963-=-). The time series of the 8 states of the combined process are shown in the topmost subfigure in Fig. 3. There the three uppermost signals correspond to the first Lorenz process, the next three to the... |

898 | An introduction to variational methods for graphical models
- Jordan, Ghahramani, et al.
- 1999
(Show Context)
Citation Context ...e unsupervised learning problems, because the number of parameters grows far too large for estimating them in any reasonable time. Other fully Bayesian approaches include various variational methods (=-=Jordan et al., 1999-=-) for approximating the integration by a tractable problem, and mean field approach (Winther, 1998), where the problem is simplified by neglecting certain dependences between the random variables. In ... |

663 | Bayesian learning for neural networks
- Neal
- 1995
(Show Context)
Citation Context ...better readability. 2 Bayesian methods and ensemble learning 2.1 Bayesian inference In Bayesian data analysis and estimation methods (see for example (Bishop, 1995; Jordan, 1999; Gelman et al., 1995; =-=Neal, 1996-=-)) for continuous variables, all the uncertain quantities are modeled in terms of their joint probability density function (pdf). The key principle is to construct the joint posterior pdf for all the ... |

603 |
Detecting strange attractors in turbulence
- Takens
- 1981
(Show Context)
Citation Context ...suitable conditions, a sequence [x(t) x(t - 1) . . . x(t - D)] of the data vectors contains all the information needed to reconstruct the original state if the number D of the delays is large enough (=-=Takens, 1981-=-). The solution used here is to initialize the sources to the principal components of concatenated 2d + 1 subsequent data vectors z T (t) = [x T (t + d) x T (t + d - 1) . . . x T (t - d + 1) x T (t - ... |

445 | A practical Bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...ough they often show a high but very narrow peak in their posterior pdf's corresponding to the overfitted parameters. The first practical Bayesian method for neural networks was introduced by MacKay (=-=MacKay, 1992-=-). He used for the posterior density of the parameters a gaussian approximation around their MAP estimate to evaluate the evidence of the model. However, such parametric approximation methods (Bishop,... |

245 | Independent factor analysis
- Attias
- 1999
(Show Context)
Citation Context ... learning in similar problems using slightly di#erent assumptions and methods (Choudrey et al., 2000; Miskin and MacKay, 2000; Miskin and MacKay, 2001; Roberts and Everson, 2001). The work by Attias (=-=Attias, 1999-=-a; Attias, 1999b; Attias, 2000), summarized in (Attias, 2001), is also closely related to ours, even though point estimates are partly used in his early work. In (Attias, 2000), this type of methods h... |

192 |
Nonlinear prediction of chaotic time series,” Phys
- Casdagli
- 1989
(Show Context)
Citation Context ...proceeding, we give relevant references on nonlinear dynamic modeling. A review article on nonlinear predictive dynamic models containing comparisons of various techniques for chaotic time series is (=-=Casdagli, 1989-=-). A tutorial paper on using neural networks, in particular the well-known radial-basis function (RBF) networks (Bishop, 1995; Haykin, 1998) to dynamically model nonlinear chaotic time series is (Hayk... |

151 | Variational learning for switching state-space models
- Ghahramani, Hinton
- 2000
(Show Context)
Citation Context ...e reviewed in (Ghahramani, 2001). These models comprise another popular technique of dynamic modeling, which in practice often provides fairly similar results as state-space models. See for instance (=-=Ghahramani and Hinton, 2000-=-) where variational methods resembling ensemble learning are applied to a state-space model where a hidden Markov model switches between di#erent linear dynamical models. Cichocki et al. (Cichocki et ... |

137 | Keeping the neural networks simple by minimizing the description length of the weights - Hinton, Camp - 1993 |

125 |
Neural and Adaptive Systems: Fundamentals through Simulations
- Principe, Euliano, et al.
- 2000
(Show Context)
Citation Context ...haotic time series is (Haykin and Principe, 1998). Dynamic modeling using various neural network structures such as recurrent or time-delayed neural networks is discussed on an introductory level in (=-=Principe et al., 2000-=-), and on somewhat more advanced level in (Haykin, 1998). Quite recently, Bayesian techniques have been introduced for the problem of learning the mappings f and g in Eqs. (10) and (11) in (Briegel an... |

112 | Propagation algorithms for variational Bayesian learning
- Ghahramani, Beal
- 2001
(Show Context)
Citation Context ...each other. It is also possible that the iterations become unstable. For linear gaussian models it is possible to derive algorithms similar to Kalman smoothing using ensemble learning as was done in (=-=Ghahramani and Beal, 2001-=-). Our algorithm is designed for learning nonlinear models, and only propagates information one step forward and backward in time in the forward and backward phase. This makes learning stable but does... |

79 | Learning nonlinear dynamical systems using an EM algorithm
- Ghaharamani, Roweis
- 1999
(Show Context)
Citation Context ...hat more advanced level in (Haykin, 1998). Quite recently, Bayesian techniques have been introduced for the problem of learning the mappings f and g in Eqs. (10) and (11) in (Briegel and Tresp, 1999; =-=Ghahramani and Roweis, 1999-=-; Roweis and Ghahramani, 2001). In (Ghahramani and Roweis, 1999; Roweis and Ghahramani, 2001), the nonlinear mappings are modeled by RBF networks, and only the linear output layers of the RBF networks... |

64 | Ensemble learning - Lappalainen, Miskin - 2000 |

62 |
An introduction to hidden markov models and bayesian networks
- Ghahramani
(Show Context)
Citation Context ...). Quite recently, Bayesian techniques have been introduced for the problem of learning the mappings f and g in Eqs. (10) and (11) in (Briegel and Tresp, 1999; Ghahramani and Roweis, 1999; Roweis and =-=Ghahramani, 2001-=-). In (Ghahramani and Roweis, 1999; Roweis and Ghahramani, 2001), the nonlinear mappings are modeled by RBF networks, and only the linear output layers of the RBF networks are adapted. This yields a v... |

62 |
Parameter Estimation : Principle and Problems
- Sorenson
- 1980
(Show Context)
Citation Context ...ion in all but the simplest problems. Furthermore, such classical methods still provide a point estimates# which is somewhat arbitrarily chosen from the possible values of # in the posterior density (=-=Sorenson, 1980-=-). Instead of searching for some point estimate, the correct Bayesian procedure is to perform estimation by averaging over the posterior distribution p(#|X, H). This means that the estimates will be s... |

59 | Bayesian nonlinear independent component analysis by multi-layer perceptrons
- Lappalainen, Honkela
- 2000
(Show Context)
Citation Context ...ource separation (BSS) and independent component analysis (ICA). For these problems we have recently developed several methods described in more detail in (Hyvrinen et al., 2001), Chapter 17, and in (=-=Lappalainen and Honkela, 2000-=-; Valpola et al., 2000; Valpola, 2000b). In particular, a method called nonlinear factor analysis (NFA) (Lappalainen and Honkela, 2000) serves as the starting point of the new method introduced in thi... |

51 | Developments in probabilistic modelling with neural networks - ensemble learning
- MacKay
- 1995
(Show Context)
Citation Context ...e following. 2.2 Ensemble learning Ensemble learning (Barber and Bishop, 1998; Lappalainen and Miskin, 2000), also called variational Bayes, is a method developed recently (Hinton and van Camp, 1993; =-=MacKay, 1995-=-) for approximating the posterior density (2). It can be used both for parametric and variational approximation. In the former, some parameters characterizing the posterior pdf are optimized while in ... |

47 | D.: Ensemble learning for blind image separation and deconvolution
- Miskin, MacKay
- 2000
(Show Context)
Citation Context ...ion, with application to real-world speech data. Recently, 5 several authors have studied ensemble learning in similar problems using slightly di#erent assumptions and methods (Choudrey et al., 2000; =-=Miskin and MacKay, 2000-=-; Miskin and MacKay, 2001; Roberts and Everson, 2001). The work by Attias (Attias, 1999a; Attias, 1999b; Attias, 2000), summarized in (Attias, 2001), is also closely related to ours, even though point... |

46 | Ensemble learning for independent component analysis
- Lappalainen
- 1999
(Show Context)
Citation Context ...ed learning problems (for example (Barber and Bishop, 1998)), but it can be used for unsupervised learning as well. The first author employed it for standard linear independent component analysis in (=-=Lappalainen, 1999-=-) using a fixed form approximation, with application to real-world speech data. Recently, 5 several authors have studied ensemble learning in similar problems using slightly di#erent assumptions and m... |

29 | Building blocks for hierarchical latent variable models
- Valpola, Raiko, et al.
- 2001
(Show Context)
Citation Context ...in (Miskin and MacKay, 2001)) since they yield simple update equations. However, our choice of log-normal model for the variances makes it much easier to build a hierarchical model for the variances (=-=Valpola et al., 2001-=-). It should be noted that assuming all the parameters gaussian is not too restrictive, because the tanh nonlinearities in the MLP networks are able to transform the gaussian distributions to virtuall... |

27 | An ensemble learning approach to independent component analysis
- Choudrey, Penny, et al.
- 2000
(Show Context)
Citation Context ...a fixed form approximation, with application to real-world speech data. Recently, 5 several authors have studied ensemble learning in similar problems using slightly di#erent assumptions and methods (=-=Choudrey et al., 2000-=-; Miskin and MacKay, 2000; Miskin and MacKay, 2001; Roberts and Everson, 2001). The work by Attias (Attias, 1999a; Attias, 1999b; Attias, 2000), summarized in (Attias, 2001), is also closely related t... |

24 | Ensemble learning for blind source separation
- Miskin, MacKay
- 2001
(Show Context)
Citation Context ...ribution p(#|X, H). This means that the estimates will be sensitive to regions where the probability mass is large instead of being sensitive to high values of the pdf (Lappalainen and Honkela, 2000; =-=Miskin and MacKay, 2001-=-). One can go even one step further if possible and reasonable (for example in supervised learning problems), and make use of the complete set of models. This means that predicted values are obtained ... |

23 | Ensemble learning in Bayesian neural networks
- Barber, Bishop
- 1998
(Show Context)
Citation Context ...nal approximation method, called ensemble learning. It is applicable to unsupervised learning problems, too, and is discussed in more detail in the following. 2.2 Ensemble learning Ensemble learning (=-=Barber and Bishop, 1998-=-; Lappalainen and Miskin, 2000), also called variational Bayes, is a method developed recently (Hinton and van Camp, 1993; MacKay, 1995) for approximating the posterior density (2). It can be used bot... |

22 | Nonlinear independent component analysis using ensemble learning: Theory - Valpola - 2000 |

21 | Bayesian approach for neural networks — review and case studies
- Lampinen, Vehtari
- 2001
(Show Context)
Citation Context ...he posterior density (2) numerically by drawing samples from the true posterior distribution. MCMC techniques have been successfully applied to practical supervised learning problems, for example in (=-=Lampinen and Vehtari, 2001-=-) by using a MLP network structure. However, MCMC methods cannot be used in large scale unsupervised learning problems, because the number of parameters grows far too large for estimating them in any ... |

20 |
Making sense of a complex world
- Haykin, Principe
- 1998
(Show Context)
Citation Context ...not a restriction because any dynamic model depending on older factors s(t-2), s(t-3), . . . can be converted into an equivalent model of the type (11) in the way shown in Figure 2. It is well known (=-=Haykin and Principe, 1998-=-) that the nonlinear model (10)--(11) is not uniquely identifiable. This is because any smooth transformation of the latent space can be absorbed in the mappings f and g. If several di#erent parameter... |

19 | Fisher scoring and a mixture of modes approach for approximate inference and learning in nonlinear state space models
- Briegel, Tresp
- 1999
(Show Context)
Citation Context ... al., 2000), and on somewhat more advanced level in (Haykin, 1998). Quite recently, Bayesian techniques have been introduced for the problem of learning the mappings f and g in Eqs. (10) and (11) in (=-=Briegel and Tresp, 1999-=-; Ghahramani and Roweis, 1999; Roweis and Ghahramani, 2001). In (Ghahramani and Roweis, 1999; Roweis and Ghahramani, 2001), the nonlinear mappings are modeled by RBF networks, and only the linear outp... |

15 | Independent factor analysis with temporally structured factors
- Attias
(Show Context)
Citation Context ...using slightly di#erent assumptions and methods (Choudrey et al., 2000; Miskin and MacKay, 2000; Miskin and MacKay, 2001; Roberts and Everson, 2001). The work by Attias (Attias, 1999a; Attias, 1999b; =-=Attias, 2000-=-), summarized in (Attias, 2001), is also closely related to ours, even though point estimates are partly used in his early work. In (Attias, 2000), this type of methods have been extended to take into... |

13 | Bayesian ensemble learning for nonlinear factor analysis - VALPOLA - 2000 |

12 | Dynamical factor analysis of rhythmic magnetoencephalographic activity
- Särelä, Valpola, et al.
- 2001
(Show Context)
Citation Context ...l., 2001). In our method, latent (source) space dimensions of the order of tens is attainable by splitting the model into parts. This approach was applied to modeling magnetoencephalographic data in (=-=Srel et al., 2001-=-). The emphasis in our method is on finding a good model describing well the observed data. A su#ciently good posterior approximation is important, but in large problems at least, resources are better... |

10 | Unsupervised learning of nonlinear dynamic state-space models - Valpola - 2000 |

9 | Detecting process state changes by nonlinear blind source separation
- Iline, Valpola, et al.
- 2001
(Show Context)
Citation Context ...because it usually makes it much easier to interpret the state representation. This makes it possible for example to analyse the type of change in process change detection problem as was proposed in (=-=Iline et al., 2001-=-). The quality of the estimate of the underlying process was tested by studying the prediction accuracy for new samples. It should be noted that since the Lorenz processes are chaotic, the best that a... |

6 |
Nonlinear dynamic independent component analysis using state-space and neural network models
- Cichocki, Zhang, et al.
(Show Context)
Citation Context ...Hinton, 2000) where variational methods resembling ensemble learning are applied to a state-space model where a hidden Markov model switches between di#erent linear dynamical models. Cichocki et al. (=-=Cichocki et al., 1999-=-) have considered a nonlinear dynamic extension of the standard linear model used in ICA and BSS by applying state-space models and hyper RBF networks. However, their learning algorithms are not Bayes... |

3 | graphical models and variational methods - Attias - 2001 |

3 |
An Introduction to
- Roberts
- 1967
(Show Context)
Citation Context ...ecently, 5 several authors have studied ensemble learning in similar problems using slightly di#erent assumptions and methods (Choudrey et al., 2000; Miskin and MacKay, 2000; Miskin and MacKay, 2001; =-=Roberts and Everson, 2001-=-). The work by Attias (Attias, 1999a; Attias, 1999b; Attias, 2000), summarized in (Attias, 2001), is also closely related to ours, even though point estimates are partly used in his early work. In (At... |

2 |
Learning a hierarchical belief network of independent factor analyzers
- Attias
- 1999
(Show Context)
Citation Context ... learning in similar problems using slightly di#erent assumptions and methods (Choudrey et al., 2000; Miskin and MacKay, 2000; Miskin and MacKay, 2001; Roberts and Everson, 2001). The work by Attias (=-=Attias, 1999-=-a; Attias, 1999b; Attias, 2000), summarized in (Attias, 2001), is also closely related to ours, even though point estimates are partly used in his early work. In (Attias, 2000), this type of methods h... |

2 | Chapter 7: The unscented Kalman filter
- Wan, Merwe
- 2001
(Show Context)
Citation Context ... of these methods can be found in (Wan and Nelson, 2001) and Kalman filtering is also breafly discussed in Section 5.1. An interesting modification called the unscented Kalman filter is presented in (=-=Wan and Merwe, 2001-=-). In (Wan and Nelson, 2001; Wan and Merwe, 2001), the unknown state and dynamics are estimated using extended Kalman filtering while the observation mapping f is assumed to be the identity mapping. T... |

1 | An ensemble learning approach to nonlinear independent component analysis
- Honkela, Karhunen
- 2001
(Show Context)
Citation Context ...7. We have successfully applied the NFA and NIFA methods to finding useful compact representations for both artificial and realworld data sets in (Lappalainen and Honkela, 2000; Valpola et al., 2000; =-=Honkela and Karhunen, 2001-=-). However, the NFA and NIFA methods have still the drawback that they do not take into account the possible dependences of subsequent data vectors. This means that like in standard principal or indep... |

1 |
Chapter 6: An EM algorithm for identification of nonlinear dynamical systems
- Roweis, Ghahramani
- 2001
(Show Context)
Citation Context ...aykin, 1998). Quite recently, Bayesian techniques have been introduced for the problem of learning the mappings f and g in Eqs. (10) and (11) in (Briegel and Tresp, 1999; Ghahramani and Roweis, 1999; =-=Roweis and Ghahramani, 2001-=-). In (Ghahramani and Roweis, 1999; Roweis and Ghahramani, 2001), the nonlinear mappings are modeled by RBF networks, and only the linear output layers of the RBF networks are adapted. This yields a v... |

1 |
Chapter 5: Dual extended Kalman filter methods
- Wan, Nelson
- 2001
(Show Context)
Citation Context ...ate is inferred given the observations and the known mappings f and g. It has also been modified for joint estimation of the model parameters and the state. A review of these methods can be found in (=-=Wan and Nelson, 2001-=-) and Kalman filtering is also breafly discussed in Section 5.1. An interesting modification called the unscented Kalman filter is presented in (Wan and Merwe, 2001). In (Wan and Nelson, 2001; Wan and... |