Results 1  10
of
16
Dynamic Bayesian Networks: Representation, Inference and Learning
, 2002
"... Modelling sequential data is important in many areas of science and engineering. Hidden Markov models (HMMs) and Kalman filter models (KFMs) are popular for this because they are simple and flexible. For example, HMMs have been used for speech recognition and biosequence analysis, and KFMs have bee ..."
Abstract

Cited by 770 (3 self)
 Add to MetaCart
Modelling sequential data is important in many areas of science and engineering. Hidden Markov models (HMMs) and Kalman filter models (KFMs) are popular for this because they are simple and flexible. For example, HMMs have been used for speech recognition and biosequence analysis, and KFMs have been used for problems ranging from tracking planes and missiles to predicting the economy. However, HMMs
and KFMs are limited in their “expressive power”. Dynamic Bayesian Networks (DBNs) generalize HMMs by allowing the state space to be represented in factored form, instead of as a single discrete random variable. DBNs generalize KFMs by allowing arbitrary probability distributions, not just (unimodal) linearGaussian. In this thesis, I will discuss how to represent many different kinds of models as DBNs, how to perform exact and approximate inference in DBNs, and how to learn DBN models from sequential data.
In particular, the main novel technical contributions of this thesis are as follows: a way of representing
Hierarchical HMMs as DBNs, which enables inference to be done in O(T) time instead of O(T 3), where T is the length of the sequence; an exact smoothing algorithm that takes O(log T) space instead of O(T); a simple way of using the junction tree algorithm for online inference in DBNs; new complexity bounds on exact online inference in DBNs; a new deterministic approximate inference algorithm called factored frontier; an analysis of the relationship between the BK algorithm and loopy belief propagation; a way of
applying RaoBlackwellised particle filtering to DBNs in general, and the SLAM (simultaneous localization
and mapping) problem in particular; a way of extending the structural EM algorithm to DBNs; and a variety of different applications of DBNs. However, perhaps the main value of the thesis is its catholic presentation of the field of sequential data modelling.
The graphical models toolkit: An open source software system for speech and timeseries processing
 In Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing
, 2002
"... This paper describes the Graphical Models Toolkit (GMTK), an open source, publically available toolkit for developing graphicalmodel based speech recognition and general time series systems. Graphical models are a flexible, concise, and expressive probabilistic modeling framework with which one may ..."
Abstract

Cited by 124 (30 self)
 Add to MetaCart
This paper describes the Graphical Models Toolkit (GMTK), an open source, publically available toolkit for developing graphicalmodel based speech recognition and general time series systems. Graphical models are a flexible, concise, and expressive probabilistic modeling framework with which one may rapidly specify a vast collection of statistical models. This paper begins with a brief description of the representational and computational aspects of the framework. Following that is a detailed description of GMTK’s features, including a language for specifying structures and probability distributions, logarithmic space exact training and decoding procedures, the concept of switching parents, and a generalized EM training method which allows arbitrary subGaussian parameter tying. Taken together, these features endow GMTK with a degree of expressiveness and functionality that significantly complements other publically available packages. GMTK was recently used in the 2001 Johns Hopkins Summer Workshop, and experimental results are described in detail both herein and in a companion paper. 1.
LatticeBased Unsupervised MLLR For Speaker Adaptation
 Proc. ISCA ITRW ASR2000
, 2000
"... In this paper we explore the use of latticebased information for unsupervised speaker adaptation. As initially formulated, maximum likelihood linear regression (MLLR) aims to linearly transform the means of the gaussian models in order to maximize the likelihood of the adaptation data given the cor ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
(Show Context)
In this paper we explore the use of latticebased information for unsupervised speaker adaptation. As initially formulated, maximum likelihood linear regression (MLLR) aims to linearly transform the means of the gaussian models in order to maximize the likelihood of the adaptation data given the correct hypothesis (supervised MLLR) or the decoded hypothesis (unsupervised MLLR). For the latter, if the firstpass decoded hypothesis is extremely erroneous (as it is the case for large vocabulary telephony applications) MLLR will often find a transform that increases the likelihood for the incorrect models, and may even lower the likelihood of the correct hypothesis. Since the oracle word error rate of a lattice is much lower than that of the 1best or Nbest hypotheses, by performing adaptation against a word lattice, the correct models are more likely to be used in estimating the transform. Furthermore, the particular MAP lattice that we propose enables the use of a natural confidence mea...
Techniques for modelling Phonological Processes in Automatic Speech Recognition
, 2001
"... Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices does ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices does not exceed 29,500 words and includes no more than 40 figures. 1 Systems which automatically transcribe carefully dictated speech are now commercially available, but their performance degrades dramatically when the speaking style of users becomes more relaxed or conversational. This dissertation focuses on techniques that aim to improve the robustness of statistical speech transcription systems to conversational speaking styles. The dissertation shows first that the performance degradation occuring as speech becomes more conversational is severe and is partially attributable to differences in the acoustic realizations of sentences. Hypothesizing that the quantifiably wider range of
Clustering WideContexts and HMM Topologies for Spontaneous Speech Recognition
, 2001
"... In most speech recognition systems today, all the acoustic variation associated with a phoneme is characterized in terms of the identity of its neighboring phonemes. The neighbors influence only the state observation density of a fixed Hidden Markov Model. Other sources of variation are captured imp ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In most speech recognition systems today, all the acoustic variation associated with a phoneme is characterized in terms of the identity of its neighboring phonemes. The neighbors influence only the state observation density of a fixed Hidden Markov Model. Other sources of variation are captured implicitly by using Gaussian mixture models for the state observations. Consequently, these models can be very broad, particularly for casual spontaneous speech. In this thesis, we explore conditioning of phonemes on higher level linguistic structure, specifically syllable and wordlevel structure to learn models for phonemes that are more specific to the context, reporting experimental results on a large vocabulary (35k words) conversational speech task (Switchboard). In particular, this thesis makes three main contributions related to wide context conditioning. First, we demonstrate that syllable and wordlevel structure can be incorporated into current acoustic models to improve recognition accuracy over triphones. For a fixed number of parameters, these models are computationally more efficient than pentaphones, both in training and in testing. In addition, use of syllable and word features leads to a small but significant improvement in performance. The widecontexts used in our acoustic model can implicitly capture resyllabification effects to a certain extent. However, we find that explicitly modeling resyllabification does not improve recognition further, because there are only a small number of phones that exhibit acoustic difference after resyllabification. The second contribution addresses the difficulties that arise when a large number of additional conditioning features are used. As the number of conditioning features increases, the training cost can increase exponentially. Moreover, a large fraction of the training labels tends to have too few examples to have reliable statistics associated with them, and this could potentially cause decision trees to learn bad clusters. A new method has been developed for clustering with multiple stages, where each stage clusters a different subset of features, and also has a choice of using the partitions learned in the previous stages. Apart from reducing the risk of unreliable statistics, it is designed to ameliorate data fragmentation problem and is computationally less expensive. This method was successfully demonstrated with pentaphones, resulting in equivalent performance at a lower cost. Finally, a new algorithm is described to design contextspecific HMMs. The idea is to model reduction of a phone for certain contexts, and to learn a more constrained topology. Using contextual information, the algorithm clusters HMM paths where each path has a different number of states. An HMM distance measure has been formulated to prune out the paths which are similar. During decoding, the paths are allocated dynamically for each subword unit according to their context. We investigated this algorithm to model phone topologies, finding improved characterization of speech given known word sequences but no significant improvement in word error rate.
A TIMESYNCHRONOUS PHONETIC DECODER FOR A LONGCONTEXTUALSPAN HIDDEN TRAJECTORY MODEL
"... A novel timesynchronous decoder, designed specifically for a Hidden Trajectory Model (HTM) whose likelihood score computation depends on longspan phonetic contexts, is presented. HTM is a recently developed acoustic model aimed to capture the underlying dynamic structure of speech coarticulation a ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
A novel timesynchronous decoder, designed specifically for a Hidden Trajectory Model (HTM) whose likelihood score computation depends on longspan phonetic contexts, is presented. HTM is a recently developed acoustic model aimed to capture the underlying dynamic structure of speech coarticulation and reduction using a compact set of parameters. The longspan nature of the HTM had posed a great technical challenge for developing efficient search algorithms for full evaluation of the model. Taking on the challenge, the decoding algorithm is developed to deal effectively with the exponentially increased search space by HTMspecific techniques for hypothesis representation, wordending recombination, and hypothesis pruning. Experimental results obtained on the TIMIT phonetic recognition task are reported, extending our earlier HTM evaluation paradigms based on Nbest and A * lattice rescoring.
Recognition The Graphical Models Team
, 2001
"... In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switchboard. Typified by Maximum Mutual Information (MMI) or Minimum Classificatio ..."
Abstract
 Add to MetaCart
(Show Context)
In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switchboard. Typified by Maximum Mutual Information (MMI) or Minimum Classification Error (MCE) training, these methods assume a fixed statistical modeling structure, and then optimize only the associated numerical parameters (such as means, variances, and transition matrices). Such is also the state of typical structure learning and model selection procedures in statistics, where the goal is to determine the structure (edges and nodes) of a graphical model (and thereby the set of conditional independence statements) that best describes the data. This report describes the process and results from the 2001 Johns Hopkins summer workshop on graphical models. Specifically, in this report we explore the novel and significantly different methodology of discriminative structure learning. Here, the fundamental dependency relationships between random variables in a probabilistic model are
ISCA Archive LATTICEBASED UNSUPERVISED MLLR FOR SPEAKER ADAPTATION
"... In this paper we explore the use of latticebased information for unsupervised speaker adaptation. As initially formulated, maximum likelihood linear regression (MLLR) aims to linearly transform the means of the gaussian models in order to maximize the likelihood of the adaptation data given the cor ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper we explore the use of latticebased information for unsupervised speaker adaptation. As initially formulated, maximum likelihood linear regression (MLLR) aims to linearly transform the means of the gaussian models in order to maximize the likelihood of the adaptation data given the correct hypothesis (supervised MLLR) or the decoded hypothesis (unsupervised MLLR). For the latter, if the firstpass decoded hypothesis is extremely erroneous (as it is the case for large vocabulary telephony applications) MLLR will often find a transform that increases the likelihood for the incorrect models, and may even lower the likelihood of the correct hypothesis. Since the oracle word error rate of a lattice is much lower than that of the 1best or Nbest hypotheses, by performing adaptation against a word lattice, the correct models are more likely to be used in estimating the transform. Furthermore, the particular MAP lattice that we propose enables the use of a natural confidence measure given by the posterior occupancy probability of a state, that is, the statistics of a particular state will be updated with the current frame only if the a posteriori probability of the state at that particular time is greater than a predefined threshold. Experiments performed on a voicemail speech recognition task indicate a relative 2 % improvement in the word error rate of lattice MLLR over 1best MLLR. 1.
Discriminatively Structured Graphical Models for Speech Recognition The Graphical Models Team
"... In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switchboard. Typified by Maximum Mutual Information (MMI) or Minimum Classificatio ..."
Abstract
 Add to MetaCart
(Show Context)
In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switchboard. Typified by Maximum Mutual Information (MMI) or Minimum Classification Error (MCE) training, these methods assume a fixed statistical modeling structure, and then optimize only the associated numerical parameters (such as means, variances, and transition matrices). Such is also the state of typical structure learning and model selection procedures in statistics, where the goal is to determine the structure (edges and nodes) of a graphical model (and thereby the set of conditional independence statements) that best describes the data. This report describes the process and results from the 2001 Johns Hopkins summer workshop on graphical models. Specifically, in this report we explore the novel and significantly different methodology of discriminative structure learning. Here, the fundamental dependency relationships between random variables in a probabilistic model are learned in a discriminative fashion, and are learned separately and in isolation from the numerical parameters. The resulting independence properties of the model might in fact be wrong with respect to the true model, but are made only for the sake of optimizing classification performance. In order to apply the principles of structural discriminability, we adopt the framework of graphical models, which allows an arbitrary set of random variables and their