Results 1  10
of
619
Dynamic Bayesian Networks: Representation, Inference and Learning
, 2002
"... Modelling sequential data is important in many areas of science and engineering. Hidden Markov models (HMMs) and Kalman filter models (KFMs) are popular for this because they are simple and flexible. For example, HMMs have been used for speech recognition and biosequence analysis, and KFMs have bee ..."
Abstract

Cited by 564 (3 self)
 Add to MetaCart
Modelling sequential data is important in many areas of science and engineering. Hidden Markov models (HMMs) and Kalman filter models (KFMs) are popular for this because they are simple and flexible. For example, HMMs have been used for speech recognition and biosequence analysis, and KFMs have been used for problems ranging from tracking planes and missiles to predicting the economy. However, HMMs
and KFMs are limited in their “expressive power”. Dynamic Bayesian Networks (DBNs) generalize HMMs by allowing the state space to be represented in factored form, instead of as a single discrete random variable. DBNs generalize KFMs by allowing arbitrary probability distributions, not just (unimodal) linearGaussian. In this thesis, I will discuss how to represent many different kinds of models as DBNs, how to perform exact and approximate inference in DBNs, and how to learn DBN models from sequential data.
In particular, the main novel technical contributions of this thesis are as follows: a way of representing
Hierarchical HMMs as DBNs, which enables inference to be done in O(T) time instead of O(T 3), where T is the length of the sequence; an exact smoothing algorithm that takes O(log T) space instead of O(T); a simple way of using the junction tree algorithm for online inference in DBNs; new complexity bounds on exact online inference in DBNs; a new deterministic approximate inference algorithm called factored frontier; an analysis of the relationship between the BK algorithm and loopy belief propagation; a way of
applying RaoBlackwellised particle filtering to DBNs in general, and the SLAM (simultaneous localization
and mapping) problem in particular; a way of extending the structural EM algorithm to DBNs; and a variety of different applications of DBNs. However, perhaps the main value of the thesis is its catholic presentation of the field of sequential data modelling.
Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains
 IEEE Transactions on Speech and Audio Processing
, 1994
"... In this paper a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely the choice of prior distribution family, the specification of the parameters of prior densities and the evaluation of the MAP estimates, are addr ..."
Abstract

Cited by 491 (39 self)
 Add to MetaCart
In this paper a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely the choice of prior distribution family, the specification of the parameters of prior densities and the evaluation of the MAP estimates, are addressed. Using HMMs with Gaussian mixture state observation densities as an example, it is assumed that the prior densities for the HMM parameters can be adequately represented as a product of Dirichlet and normalWishart densities. The classical maximum likelihood estimation algorithms, namely the forwardbackward algorithm and the segmental kmeans algorithm, are expanded and MAP estimation formulas are developed. Prior density estimation issues are discussed for two classes of applications: parameter smoothing and model adaptation, and some experimental results are given illustrating the practical interest of this approach. Because of its adaptive nature, Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications
The Infinite Hidden Markov Model
 Machine Learning
, 2002
"... We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. Th ..."
Abstract

Cited by 488 (33 self)
 Add to MetaCart
We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying statetransition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infiniteconsider, for example, symbols being possible words appearing in English text.
A Unifying Review of Linear Gaussian Models
, 1999
"... Factor analysis, principal component analysis, mixtures of gaussian clusters, vector quantization, Kalman filter models, and hidden Markov models can all be unified as variations of unsupervised learning under a single basic generative model. This is achieved by collecting together disparate observa ..."
Abstract

Cited by 260 (17 self)
 Add to MetaCart
Factor analysis, principal component analysis, mixtures of gaussian clusters, vector quantization, Kalman filter models, and hidden Markov models can all be unified as variations of unsupervised learning under a single basic generative model. This is achieved by collecting together disparate observations and derivations made by many previous authors and introducing a new way of linking discrete and continuous state models using a simple nonlinearity. Through the use of other nonlinearities, we show how independent component analysis is also a variation of the same basic generative model. We show that factor analysis and mixtures of gaussians can be implemented in autoencoder neural networks and learned using squared error plus the same regularization term. We introduce a new model for static data, known as sensible principal component analysis, as well as a novel concept of spatially adaptive observation noise. We also review some of the literature involving global and local mixtures of the basic models and provide pseudocode for inference and learning for all the basic models.
Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems
 Proceedings of the IEEE
, 1998
"... this paper. Let us place it within the neural network perspective, and particularly that of learning. The area of neural networks has greatly benefited from its unique position at the crossroads of several diverse scientific and engineering disciplines including statistics and probability theory, ph ..."
Abstract

Cited by 248 (11 self)
 Add to MetaCart
this paper. Let us place it within the neural network perspective, and particularly that of learning. The area of neural networks has greatly benefited from its unique position at the crossroads of several diverse scientific and engineering disciplines including statistics and probability theory, physics, biology, control and signal processing, information theory, complexity theory, and psychology (see [45]). Neural networks have provided a fertile soil for the infusion (and occasionally confusion) of ideas, as well as a meeting ground for comparing viewpoints, sharing tools, and renovating approaches. It is within the illdefined boundaries of the field of neural networks that researchers in traditionally distant fields have come to the realization that they have been attacking fundamentally similar optimization problems.
A Multiscale Random Field Model for Bayesian Image Segmentation
, 1996
"... Many approaches to Bayesian image segmentation have used maximum a posteriori (MAP) estimation in conjunction with Markov random fields (MRF). While this approach performs well, it has a number of disadvantages. In particular, exact MAP estimates cannot be computed, approximate MAP estimates are com ..."
Abstract

Cited by 233 (18 self)
 Add to MetaCart
Many approaches to Bayesian image segmentation have used maximum a posteriori (MAP) estimation in conjunction with Markov random fields (MRF). While this approach performs well, it has a number of disadvantages. In particular, exact MAP estimates cannot be computed, approximate MAP estimates are computationally expensive to compute, and unsupervised parameter estimation of the MRF is difficult. In this paper, we propose a new approach to Bayesian image segmentation which directly addresses these problems. The new method replaces the MRF model with a novel multiscale random field (MSRF), and replaces the MAP estimator with a sequential MAP (SMAP) estimator derived from a novel estimation criteria. Together, the proposed estimator and model result in a segmentation algorithm which is not iterative and can be computed in time proportional to MN where M is the number of classes and N is the number of pixels. We also develop a computationally effcient method for unsupervised estimation of m...
Mel Frequency Cepstral Coefficients for Music Modeling
 In International Symposium on Music Information Retrieval
, 2000
"... We examine in some detail Mel Frequency Cepstral Coefficients (MFCCs)  the dominant features used for speech recognition  and investigate their applicability to modeling music. In particular, we examine two of the main assumptions of the process of forming MFCCs: the use of the Mel frequency scale ..."
Abstract

Cited by 226 (3 self)
 Add to MetaCart
We examine in some detail Mel Frequency Cepstral Coefficients (MFCCs)  the dominant features used for speech recognition  and investigate their applicability to modeling music. In particular, we examine two of the main assumptions of the process of forming MFCCs: the use of the Mel frequency scale to model the spectra; and the use of the Discrete Cosine Transform (DCT) to decorrelate the Melspectral vectors.
An Introduction to MCMC for Machine Learning
, 2003
"... This purpose of this introductory paper is threefold. First, it introduces the Monte Carlo method with emphasis on probabilistic machine learning. Second, it reviews the main building blocks of modern Markov chain Monte Carlo simulation, thereby providing and introduction to the remaining papers of ..."
Abstract

Cited by 222 (2 self)
 Add to MetaCart
This purpose of this introductory paper is threefold. First, it introduces the Monte Carlo method with emphasis on probabilistic machine learning. Second, it reviews the main building blocks of modern Markov chain Monte Carlo simulation, thereby providing and introduction to the remaining papers of this special issue. Lastly, it discusses new interesting research horizons.
Dialogue act modeling for automatic tagging and recognition of conversational speech
 COMPUTATIONAL LINGUISTICS
, 2000
"... We describe a statistical approach for modeling dialogue acts in conversational speech, i.e., speecactlike ..."
Abstract

Cited by 205 (14 self)
 Add to MetaCart
We describe a statistical approach for modeling dialogue acts in conversational speech, i.e., speecactlike
The Helmholtz Machine
, 1995
"... Discovering the structure inherent in a set of patterns is a fundamental aim of statistical inference or learning. One fruitful approach is to build a parameterized stochastic generative model, independent draws from which are likely to produce the patterns. For all but the simplest generative model ..."
Abstract

Cited by 194 (22 self)
 Add to MetaCart
Discovering the structure inherent in a set of patterns is a fundamental aim of statistical inference or learning. One fruitful approach is to build a parameterized stochastic generative model, independent draws from which are likely to produce the patterns. For all but the simplest generative models, each pattern can be generated in exponentially many ways. It is thus intractable to adjust the parameters to maximize the probability of the observed patterns. We describe a way of finessing this combinatorial explosion by maximizing an easily computed lower bound on the probability of the observations. Our method can be viewed as a form of hierarchical selfsupervised learning that may relate to the function of bottomup and topdown cortical processing pathways.