Results 1  10
of
18
Precision matrix modelling for large vocabulary continuous speech recognition
, 2004
"... Recently, structured precision matrix models were found to outperform the conventional diagonal covariance matrix models. Minimum phone error discriminative training of these models gave very good unadapted performance on large vocabulary continuous speech recognition systems. To obtain stateofthe ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
(Show Context)
Recently, structured precision matrix models were found to outperform the conventional diagonal covariance matrix models. Minimum phone error discriminative training of these models gave very good unadapted performance on large vocabulary continuous speech recognition systems. To obtain stateoftheart performance, it is important to apply adaptation techniques efficiently to these models. In this paper, simple rowbyrow iterative formulae are described for both MLLR mean and constrained MLLR transform estimations of these models. These update formulae are derived within the standard expectation maximisation framework and are guaranteed to increase the likelihood of the adaptation data. Efficient approximate schemes for these adaptation methods are also investigated to further reduce the computation. Experimental results are presented based on the MPE trained Subspace for Precision and Mean models, evaluated on both broadcast news and conversational telephone speech English tasks. 1.
Linear Gaussian models for speech recognition
 CAMBRIDGE UNIVERSITY
, 2004
"... Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete stat ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete state that generated them is not a good assumption for speech recognition. State space models may be used to address some shortcomings of this assumption. State space models are based on a continuous state vector evolving through time according to a state evo
RaoBlackwellised Gibbs Sampling for Switching Linear Dynamical Systems
 In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004
, 2004
"... This paper describes the application of RaoBlackwellised Gibbs sampling (RBGS) to speech recognition using switching linear dynamical systems (SLDSs). The SLDS is a hybrid of standard hidden Markov models (HMMs) and linear dynamical systems. It is an extension of the stochastic segment model as it ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
This paper describes the application of RaoBlackwellised Gibbs sampling (RBGS) to speech recognition using switching linear dynamical systems (SLDSs). The SLDS is a hybrid of standard hidden Markov models (HMMs) and linear dynamical systems. It is an extension of the stochastic segment model as it relaxes the assumption of independent segments. SLDSs explicitly take into account the strong coarticulation present in speech. Unfortunately, inference in SLDS is intractable unless the discrete state sequence is known. RBGS is one approach that may be applied for both improved training and decoding for this form of intractable model. The theory of SLDS and RBGS is described, along with an efficient proposal mechanism. The performance of the SLDS using RBGS for training and inference is evaluated on the ARPA Resource Management task.
Basis Superposition Precision Matrix Modelling For Large Vocabulary Continuous Speech Recognition
 in Proc. ICASSP
, 2004
"... An important aspect of using Gaussian mixture models in a HMMbased speech recognition systems is the form of the covariance matrix. One successful approach has been to model the inverse covariance, precision, matrix by superimposing multiple bases. This paper presents a general framework of basis su ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
An important aspect of using Gaussian mixture models in a HMMbased speech recognition systems is the form of the covariance matrix. One successful approach has been to model the inverse covariance, precision, matrix by superimposing multiple bases. This paper presents a general framework of basis superposition. Models are described in terms of parameter tying of the basis coefficients and restrictions in the number of basis. Two forms of parameter tying are described which provide a compact model structure. The first constrains the basis coefficients over multiple basis vectors (or matrices). This is related to the subspace for precision and mean (SPAM) model. The second constrains the basis coefficients over multiple components, yielding as one example heteroscedastic LDA (HLDA). Both maximum likelihood and minimum phone error training of these models are discussed. The performance of various configurations is examined on a conversational telephone speech task, SwitchBoard.
HMMS AND RELATED SPEECH RECOGNITION TECHNOLOGIES
 SPRINGER HANDBOOK ON SPEECH PROCESSING AND SPEECH COMMUNICATION 1
"... Almost all present day continuous speech recognition (CSR) systems are based on Hidden Markov Models (HMMs). Although the fundamentals of HMMbased CSR have been understood for several decades, there has been steady progress in refining the technology both in terms of reducing the impact of the inhe ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Almost all present day continuous speech recognition (CSR) systems are based on Hidden Markov Models (HMMs). Although the fundamentals of HMMbased CSR have been understood for several decades, there has been steady progress in refining the technology both in terms of reducing the impact of the inherent assumptions, and in adapting the models for specific applications and environments. The aim of this chapter is to review the core architecture of a HMMbased CSR system and then outline the major areas of refinement incorporated into modernday systems.
Covariance Modelling for NoiseRobust Speech Recognition
"... Model compensation is a standard way of improving speech recognisers’ robustness to noise. Most model compensation techniques produce diagonal covariances. However, this fails to handle any changes in the feature correlations due to the noise. This paper presents a scheme that allows fullcovariance ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Model compensation is a standard way of improving speech recognisers’ robustness to noise. Most model compensation techniques produce diagonal covariances. However, this fails to handle any changes in the feature correlations due to the noise. This paper presents a scheme that allows fullcovariance matrices to be estimated. One problem is that full covariance matrix estimation will be more sensitive approximations, those for the dynamic parameters are known to crude. In this paper a linear transformation of a window of consecutive frames is used as the basis for dynamic parameter compensation. A second problem is that the resulting full covariance matrices slow down decoding. This is addressed by using predictive linear transforms that decorrelate the feature space, so that the decoder can then use diagonal covariance matrices. On a noisecorrupted Resource Management task, the proposed scheme outperformed the standard VTS compensation scheme.
Probabilistic Linear Discriminant Analysis with Bottleneck Features for Speech Recognition
"... We have recently proposed a new acoustic model based on probabilistic linear discriminant analysis (PLDA) which enjoys the flexibility of using higher dimensional acoustic features, and is more capable to capture the intraframe feature correlations. In this paper, we investigate the use of bottlen ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
We have recently proposed a new acoustic model based on probabilistic linear discriminant analysis (PLDA) which enjoys the flexibility of using higher dimensional acoustic features, and is more capable to capture the intraframe feature correlations. In this paper, we investigate the use of bottleneck features obtained from a deep neural network (DNN) for the PLDAbased acoustic model. Experiments were performed on the Switchboard dataset — a large vocabulary conversational telephone speech corpus. We observe significant word error reduction by using the bottleneck features. In addition, we have also compared the PLDAbased acoustic model to three others using Gaussian mixture models (GMMs), subspace GMMs and hybrid deep neural networks (DNNs), and PLDA can achieve comparable or slightly higher recognition accuracy from our experiments. Index Terms: speech recognition, bottleneck features, probabilistic linear discriminant analysis
Structured Precision Matrix Modelling for Speech Recognition
, 2006
"... Declaration This dissertation is the result of my own work and includes nothing which is the outcome of the work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices i ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Declaration This dissertation is the result of my own work and includes nothing which is the outcome of the work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices is approximately 53,000 words. ii Summary The most extensively and successfully applied acoustic model for speech recognition is the Hidden Markov Model (HMM). In particular, a multivariate Gaussian Mixture Model (GMM) is typically used to represent the output density function of each HMM state. For reasons of efficiency, the covariance matrix associated with each Gaussian component is assumed diagonal and the probability of successive observations is assumed independent given the HMM state sequence. Consequently, the spectral (intraframe) and temporal (interframe) correlations are poorly modelled. This thesis investigates ways of improving these aspects by extending the standard HMM. Parameters for these extended models are estimated discriminatively using the
Noisy CMLLR for noiserobust speech recognition
"... Adaptive training is a widely used technique for building speech recognition systems on nonhomogeneous training data. Recently there has been interest in applying these approaches for situations where there is significant levels of background noise. Various schemes for adaptive training are based o ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Adaptive training is a widely used technique for building speech recognition systems on nonhomogeneous training data. Recently there has been interest in applying these approaches for situations where there is significant levels of background noise. Various schemes for adaptive training are based on noise, or speaker, specific transforms of the observed noisecorrupted speech to yield estimates of the clean speech. However when there are high levels of background noise, these clean speech estimates may be poor resulting in degradations in performance. In this work, a new approach for adaptive training on noisecorrupted training data is presented. It extends a popular form of linear transform for modelbased adaptation and adaptive training, constrained MLLR (CMLLR), to reflect additional uncertainty from noisecorrupted observations. This new form of transform is called noisy CMLLR (NCMLLR). NCMLLR uses a modified version of generative model between clean speech and noisy observation, similar to factor analysis (FA). However in contrast in FA here the generative model describes a transformation, rather than a covariance matrix structure. The use of NCMLLR for adaptation and adaptive training using an expectationmaximisation approach is described. Discriminative adaptive training with NCMLLR is also presented based on the minimum phone error criterion. Experiments are conducted on noisecorrupted version of Resource Management and incar recorded digit data. In preliminary experiments this new approach achieves improvements in recognition performance over the standard approach in low signaltonoise ratio conditions. In addition the need for adaptive training when there are a range of noise conditions in the training data is shown. 2 1
Spectrotemporal postenhancement using MMSE estimation in NMF based singlechannel source separation
 in Annual Conference of the International Speech Communication Association (InterSpeech
, 2013
"... We propose to use minimum mean squared error (MMSE) estimates to enhance the signals that are separated by nonnegative matrix factorization (NMF). In single channel source separation (SCSS), NMF is used to train a set of basis vectors for each source from their training spectrograms. Then NMF is u ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We propose to use minimum mean squared error (MMSE) estimates to enhance the signals that are separated by nonnegative matrix factorization (NMF). In single channel source separation (SCSS), NMF is used to train a set of basis vectors for each source from their training spectrograms. Then NMF is used to decompose the mixed signal spectrogram as a weighted linear combination of the trained basis vectors from which estimates of each corresponding source can be obtained. In this work, we deal with the spectrogram of each separated signal as a 2D distorted signal that needs to be restored. A multiplicative distortion model is assumed where the logarithm of the true signal distribution is modeled with a Gaussian mixture model (GMM) and the distortion is modeled as having a lognormal distribution. The parameters of the GMM are learned from training data whereas the distortion parameters are learned online from each separated signal. The initial source estimates are improved and replaced with their MMSE estimates under this new probabilistic framework. The experimental results show that using the proposed MMSE estimation technique as a post enhancement after NMF improves the quality of the separated signal. Index Terms: Single channel source separation, nonnegative matrix factorization, Minimum mean square error estimates, and Gaussian mixture models. 1.