Results 1  10
of
14
Dynamic Bayesian Networks: Representation, Inference and Learning
, 2002
"... Modelling sequential data is important in many areas of science and engineering. Hidden Markov models (HMMs) and Kalman filter models (KFMs) are popular for this because they are simple and flexible. For example, HMMs have been used for speech recognition and biosequence analysis, and KFMs have bee ..."
Abstract

Cited by 700 (3 self)
 Add to MetaCart
Modelling sequential data is important in many areas of science and engineering. Hidden Markov models (HMMs) and Kalman filter models (KFMs) are popular for this because they are simple and flexible. For example, HMMs have been used for speech recognition and biosequence analysis, and KFMs have been used for problems ranging from tracking planes and missiles to predicting the economy. However, HMMs
and KFMs are limited in their “expressive power”. Dynamic Bayesian Networks (DBNs) generalize HMMs by allowing the state space to be represented in factored form, instead of as a single discrete random variable. DBNs generalize KFMs by allowing arbitrary probability distributions, not just (unimodal) linearGaussian. In this thesis, I will discuss how to represent many different kinds of models as DBNs, how to perform exact and approximate inference in DBNs, and how to learn DBN models from sequential data.
In particular, the main novel technical contributions of this thesis are as follows: a way of representing
Hierarchical HMMs as DBNs, which enables inference to be done in O(T) time instead of O(T 3), where T is the length of the sequence; an exact smoothing algorithm that takes O(log T) space instead of O(T); a simple way of using the junction tree algorithm for online inference in DBNs; new complexity bounds on exact online inference in DBNs; a new deterministic approximate inference algorithm called factored frontier; an analysis of the relationship between the BK algorithm and loopy belief propagation; a way of
applying RaoBlackwellised particle filtering to DBNs in general, and the SLAM (simultaneous localization
and mapping) problem in particular; a way of extending the structural EM algorithm to DBNs; and a variety of different applications of DBNs. However, perhaps the main value of the thesis is its catholic presentation of the field of sequential data modelling.
Bilmes, “MVA processing of speech features
 IEEE Trans. Audio, Speech, Lang. Process
, 2007
"... In this paper, we investigate a technique consisting of mean subtraction, variance normalization and time sequence ltering. Unlike other techniques, it applies autoregression movingaverage (ARMA) ltering to the time sequence in the cepstral domain. We called this technique the MVA postprocessing ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we investigate a technique consisting of mean subtraction, variance normalization and time sequence ltering. Unlike other techniques, it applies autoregression movingaverage (ARMA) ltering to the time sequence in the cepstral domain. We called this technique the MVA postprocessing and the speech features with MVA postprocessing the MVA features. Overall, compared to raw features without MVA postprocessing, MVA features achieve improvements of 45 % on matched tasks and 65 % on mismatched tasks on the Aurora 2.0 noisy speech database, and well above a 50 % improvement on the Aurora 3.0 database. These improvements are comparable to systems with much more complicated techniques even though MVA is relatively simple and requires practically no additional computational cost. In this paper, in addition to describing MVA processing, we also present a novel analysis of the distortion of melfrequency cepstral coefcients and the log energy in the presence of different types of noises. The effectiveness of MVA is extensively investigated with respect to several variations: the congurations used to extract raw features, the domains where MVA is applied, the lters that are used, and the orders of the ARMA lters. Specically, it is argued and demonstrated that MVA works better when applied to the zeroth cepstral coefcient than to the log energy, that MVA works better in the cepstral domain, that an ARMA lter is better than either designed FIR lters or datadriven lters, and that a vetap ARMA lter is sufcient to achieve good performances in a variety
Generalised linear Gaussian models
, 2001
"... This paper addresses the timeseries modelling of high dimensional data. Currently, the hidden Markov model (HMM) is the most popular and successful model especially in speech recognition. However, there are well known shortcomings in HMMs particularly in the modelling of the correlation between suc ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
This paper addresses the timeseries modelling of high dimensional data. Currently, the hidden Markov model (HMM) is the most popular and successful model especially in speech recognition. However, there are well known shortcomings in HMMs particularly in the modelling of the correlation between successive observation vectors; that is, interframe correlation. Standard diagonal covariance matrix HMMs also lack the modelling of the spatial correlation in the feature vectors; that is, intraframe correlation. Several other timeseries models have been proposed recently especially in the segment model framework to address the interframe correlation problem such as GaussMarkov and dynamical system segment models. The lack of intraframe correlation has been compensated for with transform schemes such as semitied full covariance matrices (STC). All these models can be regarded as belonging to the broad class of generalised linear Gaussian models. Linear Gaussian models (LGM) are popular as many forms may be trained efficiently using the expectation maximisation algorithm. In this paper, several LGMs and generalised LGMs are reviewed. The models can be roughly categorised into four combinations according to two different state evolution and two different observation processes. The state evolution process can be based on a discrete finite state machine such as in the HMMs or a linear firstorder GaussMarkov process such as in the traditional linear dynamical systems. The observation process can be represented as a factor analysis model or a linear discriminant analysis model. General HMMs and schemes proposed to improve their performance such as STC can be regarded as special cases in this framework.
A comparative study of several feature transformation and learning methods for phoneme classification
 Journal of Speech Technology
, 2000
"... Abstract. This paper examines the applicability of some learning techniques for speech recognition, more precisely, for the classification of phonemes represented by a particular segment model. The methods compared were TiMBL (the IB1 algorithm), C4.5 (ID3 tree learning), OC1 (oblique tree learning) ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
Abstract. This paper examines the applicability of some learning techniques for speech recognition, more precisely, for the classification of phonemes represented by a particular segment model. The methods compared were TiMBL (the IB1 algorithm), C4.5 (ID3 tree learning), OC1 (oblique tree learning), artificial neural nets (ANN), Gaussian mixture modeling (GMM) and, as a reference, an HMM recognizer was also trained on the same corpus. Before feeding them into the learners, the segmental features were additionally transformed using either linear discriminant analysis (LDA), principal component analysis (PCA) or independent component analysis (ICA). Each learner was tested with each transformation in order to find the best combination. Furthermore, we experimented with several feature sets such as filterbank energies, melfrequency cepstral coefficients (MFCC) and gravity centers. We found LDA helped all the learners, in several cases quite considerably. PCA was beneficial only for some of the algorithms, while ICA improved the results quite rarely, and was bad for certain learning methods. From the learning viewpoint ANN was the most effective, and attained the same results independently of the transformation applied. GMM behaved worse, which shows the advantages of discriminative over generative learning. TiMBL produced reasonable results, while C4.5 and OC1 could not compete, no matter what transformation was tried.
Statistical Modelling in Continuous Speech Recognition (CSR)
 IN CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE
, 2001
"... Automatic continuous speech recognition (CSR) is sufficiently ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Automatic continuous speech recognition (CSR) is sufficiently
Switching dynamic system models for speech articulation and acoustics
 In Proceedings of the IMA Workshop
, 2000
"... Abstract. A statistical generative model for the speech process is described that embeds a substantially richer structure than the HMM currently in predominant use for automatic speech recognition. This switching dynamicsystem model generalizes and integrates the HMM and the piecewise stationary n ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
(Show Context)
Abstract. A statistical generative model for the speech process is described that embeds a substantially richer structure than the HMM currently in predominant use for automatic speech recognition. This switching dynamicsystem model generalizes and integrates the HMM and the piecewise stationary nonlinear dynamic system (statespace) model. Depending on the level and the nature of the switching in the model design, various key properties of the speech dynamics can be naturally represented in the model. Such properties include the temporal structure of the speech acoustics, its causal articulatory movements, and the control of such movements by the multidimensional targets correlated with the phonological (symbolic) units of speech in terms of overlapping articulatory features. One main challenge of using this multilevel switching dynamicsystem model for successful speech recognition is the computationally intractable inference (decoding with con dence measure) on the posterior probabilities of the hidden states. This leads to computationally intractable optimal parameter learning (training) also. Several versions of BayesNets have been devised with detailed dependency implementation specied to
Segmental Modeling Using a Continuous Mixture of Nonparametric Models
 IEEE Trans on SAP
, 1997
"... The aim of the research described in this paper is to overcome the modeling limitation of conventional hidden Markov models. We present a segmental model that consists of two elements. The first is a nonparametric representation of both the mean and variance trajectories, which describes the local d ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
The aim of the research described in this paper is to overcome the modeling limitation of conventional hidden Markov models. We present a segmental model that consists of two elements. The first is a nonparametric representation of both the mean and variance trajectories, which describes the local dynamics. The second element is some parameterized transformation (e.g., random shift) of the trajectory that is global to the segment and models longterm variations such as speaker identity. Introduction Speech sounds are produced by a timevarying dynamic system. Consequently, speech signals are highly correlated and nonstationary. In spite of this fact, in most implementations of hidden Markov models (HMMs) to speech recognition, the assumption that successive observations in a state are independent and identically distributed is inherent to the model. These limitations of the HMM are due to the fact that the HMM is a framebased approach. An alternative approach is segmental modeling, w...
A Maximumentropy Solution to the Framedependency Problem in Speech Recognition
, 2001
"... The HMM assumption of conditional independence of observations causes a variety of problems for speechrecognition applications. Previous attempts to construct acoustic models that remove this assumption have suffered from a significant increase in the number of parameters to train. Another weakness ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
The HMM assumption of conditional independence of observations causes a variety of problems for speechrecognition applications. Previous attempts to construct acoustic models that remove this assumption have suffered from a significant increase in the number of parameters to train. Another weakness of current acoustic models is that they do not account for the origin of derived features (estimated derivatives). We show how to both remove the independence assumption and properly account for derived features, with little or no increase in the number of parameters to train, by applying the principle of maximum entropy. We also show that ignoring the origins of derived features in training HMM acoustic models can lead to severe distortions of the effective language model. Evaluation of our maxent model on a simple problem cuts an alreadylow error rate in half compared to an equivalent HMM with the same number of parameters.
Scaled random trajectory segment models
 Computer Speech and Language
, 1998
"... Speech recognition systems that are based on hidden Markov modeling (HMM), assume that the mean trajectory feature vector within a state is constant over time. In recent years, segment models that attempt to describe the dynamics of the speech signal within a phonetic unit, have been proposed. Some ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Speech recognition systems that are based on hidden Markov modeling (HMM), assume that the mean trajectory feature vector within a state is constant over time. In recent years, segment models that attempt to describe the dynamics of the speech signal within a phonetic unit, have been proposed. Some of these models describe the mean trajectory over time as a random process. In this paper we present the concept of a scaled random trajectory segment model, which aims to overcome the modeling problem created by the fact that segment realizations of the same phonetic unit dier in length. The new model is supported by a direct experimental evidence. It oers the following advantages over the standard (nonscaled) model. First, it shows improved performance compared to the nonscaled model. This is demonstrated using phone classication experiments. Second, it yields closed form expressions for the estimated parameters, unlike the previously suggested, nonscaled model, that requires more complicated iterative estimation procedures. 2 1
Clustering WideContexts and HMM Topologies for Spontaneous Speech Recognition
, 2001
"... In most speech recognition systems today, all the acoustic variation associated with a phoneme is characterized in terms of the identity of its neighboring phonemes. The neighbors influence only the state observation density of a fixed Hidden Markov Model. Other sources of variation are captured imp ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In most speech recognition systems today, all the acoustic variation associated with a phoneme is characterized in terms of the identity of its neighboring phonemes. The neighbors influence only the state observation density of a fixed Hidden Markov Model. Other sources of variation are captured implicitly by using Gaussian mixture models for the state observations. Consequently, these models can be very broad, particularly for casual spontaneous speech. In this thesis, we explore conditioning of phonemes on higher level linguistic structure, specifically syllable and wordlevel structure to learn models for phonemes that are more specific to the context, reporting experimental results on a large vocabulary (35k words) conversational speech task (Switchboard). In particular, this thesis makes three main contributions related to wide context conditioning. First, we demonstrate that syllable and wordlevel structure can be incorporated into current acoustic models to improve recognition accuracy over triphones. For a fixed number of parameters, these models are computationally more efficient than pentaphones, both in training and in testing. In addition, use of syllable and word features leads to a small but significant improvement in performance. The widecontexts used in our acoustic model can implicitly capture resyllabification effects to a certain extent. However, we find that explicitly modeling resyllabification does not improve recognition further, because there are only a small number of phones that exhibit acoustic difference after resyllabification. The second contribution addresses the difficulties that arise when a large number of additional conditioning features are used. As the number of conditioning features increases, the training cost can increase exponentially. Moreover, a large fraction of the training labels tends to have too few examples to have reliable statistics associated with them, and this could potentially cause decision trees to learn bad clusters. A new method has been developed for clustering with multiple stages, where each stage clusters a different subset of features, and also has a choice of using the partitions learned in the previous stages. Apart from reducing the risk of unreliable statistics, it is designed to ameliorate data fragmentation problem and is computationally less expensive. This method was successfully demonstrated with pentaphones, resulting in equivalent performance at a lower cost. Finally, a new algorithm is described to design contextspecific HMMs. The idea is to model reduction of a phone for certain contexts, and to learn a more constrained topology. Using contextual information, the algorithm clusters HMM paths where each path has a different number of states. An HMM distance measure has been formulated to prune out the paths which are similar. During decoding, the paths are allocated dynamically for each subword unit according to their context. We investigated this algorithm to model phone topologies, finding improved characterization of speech given known word sequences but no significant improvement in word error rate.