Results 1 -
9 of
9
Precision matrix modelling for large vocabulary continuous speech recognition
, 2004
"... Recently, structured precision matrix models were found to outperform the conventional diagonal covariance matrix models. Minimum phone error discriminative training of these models gave very good unadapted performance on large vocabulary continuous speech recognition systems. To obtain state-of-the ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Recently, structured precision matrix models were found to outperform the conventional diagonal covariance matrix models. Minimum phone error discriminative training of these models gave very good unadapted performance on large vocabulary continuous speech recognition systems. To obtain state-of-the-art performance, it is important to apply adaptation techniques efficiently to these models. In this paper, simple row-by-row iterative formulae are described for both MLLR mean and constrained MLLR transform estimations of these models. These update formulae are derived within the standard expectation maximisation framework and are guaranteed to increase the likelihood of the adaptation data. Efficient approximate schemes for these adaptation methods are also investigated to further reduce the computation. Experimental results are presented based on the MPE trained Subspace for Precision and Mean models, evaluated on both broadcast news and conversational telephone speech English tasks. 1.
Rao-Blackwellised Gibbs Sampling for Switching Linear Dynamical Systems
- In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004
, 2004
"... This paper describes the application of Rao-Blackwellised Gibbs sampling (RBGS) to speech recognition using switching linear dynamical systems (SLDSs). The SLDS is a hybrid of standard hidden Markov models (HMMs) and linear dynamical systems. It is an extension of the stochastic segment model as it ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
This paper describes the application of Rao-Blackwellised Gibbs sampling (RBGS) to speech recognition using switching linear dynamical systems (SLDSs). The SLDS is a hybrid of standard hidden Markov models (HMMs) and linear dynamical systems. It is an extension of the stochastic segment model as it relaxes the assumption of independent segments. SLDSs explicitly take into account the strong co-articulation present in speech. Unfortunately, inference in SLDS is intractable unless the discrete state sequence is known. RBGS is one approach that may be applied for both improved training and decoding for this form of intractable model. The theory of SLDS and RBGS is described, along with an efficient proposal mechanism. The performance of the SLDS using RBGS for training and inference is evaluated on the ARPA Resource Management task.
Linear Gaussian models for speech recognition
- CAMBRIDGE UNIVERSITY
, 2004
"... Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete stat ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete state that generated them is not a good assumption for speech recognition. State space models may be used to address some shortcomings of this assumption. State space models are based on a continuous state vector evolving through time according to a state evo-
Basis Superposition Precision Matrix Modelling For Large Vocabulary Continuous Speech Recognition
- in Proc. ICASSP
, 2004
"... An important aspect of using Gaussian mixture models in a HMMbased speech recognition systems is the form of the covariance matrix. One successful approach has been to model the inverse covariance, precision, matrix by superimposing multiple bases. This paper presents a general framework of basis su ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
An important aspect of using Gaussian mixture models in a HMMbased speech recognition systems is the form of the covariance matrix. One successful approach has been to model the inverse covariance, precision, matrix by superimposing multiple bases. This paper presents a general framework of basis superposition. Models are described in terms of parameter tying of the basis coefficients and restrictions in the number of basis. Two forms of parameter tying are described which provide a compact model structure. The first constrains the basis coefficients over multiple basis vectors (or matrices). This is related to the subspace for precision and mean (SPAM) model. The second constrains the basis coefficients over multiple components, yielding as one example heteroscedastic LDA (HLDA). Both maximum likelihood and minimum phone error training of these models are discussed. The performance of various configurations is examined on a conversational telephone speech task, SwitchBoard.
Covariance Modelling for Noise-Robust Speech Recognition
"... Model compensation is a standard way of improving speech recognisers’ robustness to noise. Most model compensation techniques produce diagonal covariances. However, this fails to handle any changes in the feature correlations due to the noise. This paper presents a scheme that allows full-covariance ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
Model compensation is a standard way of improving speech recognisers’ robustness to noise. Most model compensation techniques produce diagonal covariances. However, this fails to handle any changes in the feature correlations due to the noise. This paper presents a scheme that allows full-covariance matrices to be estimated. One problem is that full covariance matrix estimation will be more sensitive approximations, those for the dynamic parameters are known to crude. In this paper a linear transformation of a window of consecutive frames is used as the basis for dynamic parameter compensation. A second problem is that the resulting full covariance matrices slow down decoding. This is addressed by using predictive linear transforms that decorrelate the feature space, so that the decoder can then use diagonal covariance matrices. On a noise-corrupted Resource Management task, the proposed scheme outperformed the standard VTS compensation scheme.
Structured Precision Matrix Modelling for Speech Recognition
, 2006
"... Declaration This dissertation is the result of my own work and includes nothing which is the outcome of the work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices i ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing which is the outcome of the work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices is approximately 53,000 words. ii Summary The most extensively and successfully applied acoustic model for speech recognition is the Hid-den Markov Model (HMM). In particular, a multivariate Gaussian Mixture Model (GMM) is typically used to represent the output density function of each HMM state. For reasons of ef-ficiency, the covariance matrix associated with each Gaussian component is assumed diagonal and the probability of successive observations is assumed independent given the HMM state sequence. Consequently, the spectral (intra-frame) and temporal (inter-frame) correlations are poorly modelled. This thesis investigates ways of improving these aspects by extending the standard HMM. Parameters for these extended models are estimated discriminatively using the
Noisy CMLLR for noise-robust speech recognition
"... Adaptive training is a widely used technique for building speech recognition systems on non-homogeneous training data. Recently there has been interest in applying these approaches for situations where there is significant levels of background noise. Various schemes for adaptive training are based o ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Adaptive training is a widely used technique for building speech recognition systems on non-homogeneous training data. Recently there has been interest in applying these approaches for situations where there is significant levels of background noise. Various schemes for adaptive training are based on noise, or speaker, specific transforms of the observed noise-corrupted speech to yield estimates of the clean speech. However when there are high levels of background noise, these clean speech estimates may be poor resulting in degradations in performance. In this work, a new approach for adaptive training on noise-corrupted training data is presented. It extends a popular form of linear transform for model-based adaptation and adaptive training, constrained MLLR (CMLLR), to reflect additional uncertainty from noise-corrupted observations. This new form of transform is called noisy CMLLR (NCM-LLR). NCMLLR uses a modified version of generative model between clean speech and noisy observation, similar to factor analysis (FA). However in contrast in FA here the generative model describes a transformation, rather than a covariance matrix structure. The use of NCMLLR for adaptation and adaptive training using an expectation-maximisation approach is described. Discriminative adaptive training with NCMLLR is also presented based on the minimum phone error criterion. Experiments are conducted on noise-corrupted version of Resource Management and in-car recorded digit data. In preliminary experiments this new approach achieves improvements in recognition performance over the standard approach in low signal-to-noise ratio conditions. In addition the need for adaptive training when there are a range of noise conditions in the training data is shown. 2 1
Discriminative Complexity Control and Linear Projections for Large Vocabulary Speech Recognition
, 2005
"... Selecting the optimal model structure with the “appropriate” complexity is a standard prob-lem for training large vocabulary continuous speech recognition (LVCSR) systems, and machine learning in general. State-of-the-art LVCSR systems are highly complex. A wide variety of tech-niques may be used wh ..."
Abstract
- Add to MetaCart
Selecting the optimal model structure with the “appropriate” complexity is a standard prob-lem for training large vocabulary continuous speech recognition (LVCSR) systems, and machine learning in general. State-of-the-art LVCSR systems are highly complex. A wide variety of tech-niques may be used which alter the system complexity and word error rate (WER). Explicitly evaluating systems for all possible configurations is infeasible. Automatic model complexity control criteria are needed. Most existing complexity control schemes can be classified into two types, Bayesian learning techniques and information theory approaches. An implicit assumption is made in both that increasing the likelihood on held-out data decreases the WER. However, this correlation is found to be quite weak for current speech recognition systems. Hence it is preferable to employ discriminative methods for complexity control. In this thesis a novel discriminative model selection technique, the marginalization of a discriminative growth function, is presented. This is a closer approximation to the true WER than standard likelihood based approaches. The number of Gaussian components and feature dimensions of an HMM based LVCSR system is controlled. Experimental results on a wide rage of LVCSR tasks showed that
Optimization Tool for Speech Recognition
"... The proposal can be realized using HMM which is the universal model for the speech recognition. This proposal is an approach to increase the effectiveness of Hidden Markov Models (HMM) in the speech recognition field. This approach could determine the optimum topology with a practical computation ti ..."
Abstract
- Add to MetaCart
The proposal can be realized using HMM which is the universal model for the speech recognition. This proposal is an approach to increase the effectiveness of Hidden Markov Models (HMM) in the speech recognition field. This approach could determine the optimum topology with a practical computation time, and the performance can be comparable to the best recognition performance provided by the conventional maximum likelihood approach with manual tuning considering the decoding process. Thus, by using the proposed method, can automatically and rapidly determine an acoustic model topology with the highest performance, enabling us to dispense with manual tuning procedures when constructing acoustic models for speech recognition considering the decoding process in account. In this proposal I have compare the HMM with the Maximum Entropy Markov Model(MEMM), Factor analysed hidden Markov models

