Results 1  10
of
45
Joint uncertainty decoding for robust large vocabulary speech recognition
, 2006
"... Standard techniques to increase automatic speech recognition noise robustness typically assume recognition models are clean trained. This “clean ” training data may in fact not be clean at all, but may contain channel variations, varying noise conditions, as well as different speakers. Hence rather ..."
Abstract

Cited by 36 (28 self)
 Add to MetaCart
Standard techniques to increase automatic speech recognition noise robustness typically assume recognition models are clean trained. This “clean ” training data may in fact not be clean at all, but may contain channel variations, varying noise conditions, as well as different speakers. Hence rather than considering noise robustness techniques as compensating clean acoustic models for environmental noise, they may be thought of as reducing the acoustic mismatch between training and test conditions. This report examines the application of VTS model compensation or modelbased Joint uncertainty decoding to clean and multistyle trained systems. An EMbased noise estimation procedure is also presented to produce ML VTS or Joint noise models depending on the form of compensation used. Alternatively, compared to multistyle training, adaptive training with Joint uncertainty transforms, also referred to as JAT in this work, provides a better method for handling heterogeneous data. With JAT, the uncertainty bias added to the model variances deweights observations proportional to the noise level. In this way, Joint transforms normalise the noise from the data allowing the canonical model to solely represent the underlying “clean ” acoustic signal. This
Adaptive training with joint uncertainty decoding for robust recognition of noisy data
 IN PROCEEDINGS OF ICASSP, VOLUME IV
, 2007
"... Standard noise compensation techniques for automatic speech recognition assume a clean trained acoustic model. What is thought of as “clean” data, may still have a variety of speakers, different channels and varying noise conditions. Hence it may be more reasonable to consider such data multicondit ..."
Abstract

Cited by 32 (17 self)
 Add to MetaCart
(Show Context)
Standard noise compensation techniques for automatic speech recognition assume a clean trained acoustic model. What is thought of as “clean” data, may still have a variety of speakers, different channels and varying noise conditions. Hence it may be more reasonable to consider such data multiconditional for multistyle training. This paper shows that multistyle models benefit from VTS compensation or Joint uncertainty decoding by reducing the mismatch between training and test. An EMbased noise estimation procedure that produces ML VTS or Joint noise models is also described. Alternatively, adaptive training with Joint uncertainty transforms factors out the noise from the data. The uncertainty variance bias deweights observations in the training data where the SNR is low. This property allows data with a wide SNR range to be used and produces canonical models that truly represent clean speech, whereas multistyle trained models must account for all acoustic variation associated with different noise conditions. This paper presents Joint adaptive training including formula for estimating the transforms and canonical model parameters. Experiments are conducted on the
Discriminative classifiers with adaptive kernels for noise robust speech recognition
 Comput. Speech Lang
, 2010
"... Discriminative classifiers are a popular approach to solving classification problems. However one of the problems with these approaches, in particular kernel based classifiers such as Support Vector Machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. Thi ..."
Abstract

Cited by 24 (18 self)
 Add to MetaCart
Discriminative classifiers are a popular approach to solving classification problems. However one of the problems with these approaches, in particular kernel based classifiers such as Support Vector Machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. This paper describes a scheme for overcoming this problem for speech recognition in noise by adapting the kernel rather than the SVM decision boundary. Generative kernels, defined using generative models, are one type of kernel that allows SVMs to handle sequence data. By compensating the parameters of the generative models for each noise condition noisespecific generative kernels can be obtained. These can be used to train a noiseindependent SVM on a range of noise conditions, which can then be used with a testset noise kernel for classification. The noisespecific kernels used in this paper are based on Vector Taylor Series (VTS) modelbased compensation. VTS allows all the model parameters to be compensated and the background noise to be estimated in a maximum likelihood fashion. A brief discussion of VTS, and the optimisation of the mismatch function representing the impact of noise on the clean speech, is also included. Experiments using these VTSbased testset noise kernels were run on the AURORA 2 continuous digit task. The proposed SVM rescoring scheme yields large gains in performance over the VTS compensated models. Key words: speech recognition, noise robustness, support vector machines, generative kernels
Issues with uncertainty decoding for noise robust speech recognition
 Speech Communication
, 2008
"... Interest is growing in a class of robustness algorithms that exploit the notion of uncertainty introduced by environmental noise. The majority of these techniques share the property that the uncertainty of an observation due to noise is propagated to the recogniser, resulting in increased model vari ..."
Abstract

Cited by 24 (9 self)
 Add to MetaCart
(Show Context)
Interest is growing in a class of robustness algorithms that exploit the notion of uncertainty introduced by environmental noise. The majority of these techniques share the property that the uncertainty of an observation due to noise is propagated to the recogniser, resulting in increased model variances. Using appropriate approximations, efficient implementations may be obtained, with the goal of achieving near modelbased performance without the associated computational cost. Unfortunately, uncertainty decoding forms that compute the uncertainty in the frontend and pass this to the decoder may suffer from a theoretical problem in low signaltonoise ratio conditions. This report discusses how this fundamental issue arises, and demonstrates it through two schemes: SPLICE with uncertainty and frontend Joint uncertainty decoding. A method to mitigate this in theJoint form is presented, as well as how SPLICE implicitly addresses it. However, it is shown that a modelbased Joint uncertainty decoding approach does not suffer from this limitation, like these frontend forms do, and is also competitive computationally. The issues described and performance of the various schemes are examined on two artificially corrupted corpora: AURORA 2.0 digit recognition database and the thousandword Resource Management task. 2 1
Transforming Binary Uncertainties for Robust Speech Recognition
"... Abstract—Recently, several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time–frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with en ..."
Abstract

Cited by 22 (8 self)
 Add to MetaCart
(Show Context)
Abstract—Recently, several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time–frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with enhanced speech in the linear spectral domain. The use of the cepstral transformation smears the information from the noise dominant time–frequency regions across all the cepstral features. We propose a supervised approach using regression trees to learn the nonlinear transformation of the uncertainty from the linear spectral domain to the cepstral domain. This uncertainty is used by a decoder that exploits the variance associated with the enhanced cepstral features to improve robust speech recognition. Systematic evaluations on a subset of the Aurora4 task using the estimated uncertainty show substantial improvement over the baseline performance across various noise conditions. Index Terms—Binary time–frequency mask, computational auditory scene analysis (CASA), robust automatic speech recognition, spectrogram reconstruction, uncertainty decoding. I.
Predictive linear transforms for noise robust speech recognition
 in Proceedings of the ASRU Workshop
"... It is well known that the addition of background noise alters the correlations between the elements of, for example, the MFCC feature vector. However, standard modelbased compensation techniques do not modify the featurespace in which the diagonal covariance matrix Gaussian mixture models are esti ..."
Abstract

Cited by 20 (16 self)
 Add to MetaCart
(Show Context)
It is well known that the addition of background noise alters the correlations between the elements of, for example, the MFCC feature vector. However, standard modelbased compensation techniques do not modify the featurespace in which the diagonal covariance matrix Gaussian mixture models are estimated. One solution to this problem, which yields good performance, is Joint Uncertainty Decoding (JUD) with full transforms. Unfortunately, this results in a high computational cost during decoding. This paper contrasts two approaches to approximating full JUD while lowering the computational cost. Both use predictive linear transforms to modify the featurespace: adaptationbased linear transforms, where the model parameters are restricted to be the same as the original clean system; and precision matrix modelling approaches, in particular semitied covariance matrices. These predicitve transforms are estimated using statistics derived from the full JUD transforms rather noisy data. The schemes are evaluated on AURORA 2 and a noisecorrupted Resource Management task. Index Terms — Noise robust speech recognition, joint uncertainty
EXTENDED VTS FOR NOISEROBUST SPEECH RECOGNITION
"... Model compensation is a standard way of improving speech recognisers’ robustness to noise. Currently popular schemes are based on vector Taylor series (VTS) compensation. They often use the continuous time approximation to compensate dynamic parameters. In this paper, the accuracy of dynamic paramet ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
(Show Context)
Model compensation is a standard way of improving speech recognisers’ robustness to noise. Currently popular schemes are based on vector Taylor series (VTS) compensation. They often use the continuous time approximation to compensate dynamic parameters. In this paper, the accuracy of dynamic parameter compensation is improved by representing the dynamic features as a linear transformation of a window of static features. A modified version of VTS compensation is applied to the distribution of the window of static features and, importantly, their correlations. These compensated distributions are then transformed to standard static and dynamic distributions. The proposed scheme outperformed the standard VTS scheme by about 10 % relative. Index Terms — Speech recognition, acoustic noise, robustness 1.
Incremental predictive and adaptive noise compensation
 In Proc. ICASSP
, 2009
"... Model compensation schemes are a powerful approach to handling mismatches between training and testing conditions. Normally these schemes are run in a batch adaptation mode, rerecognising the utterance used to estimate the noise model parameters. For many applications this introduces unacceptable l ..."
Abstract

Cited by 9 (7 self)
 Add to MetaCart
(Show Context)
Model compensation schemes are a powerful approach to handling mismatches between training and testing conditions. Normally these schemes are run in a batch adaptation mode, rerecognising the utterance used to estimate the noise model parameters. For many applications this introduces unacceptable latency. This paper examines three forms of incremental mode modelbased compensation: vector Taylor series; joint uncertainty decoding; and predictive CMLLR. These predictive schemes can also be combined with adaptive schemes such as CMLLR. By combining the approaches, weaknesses of each can be addressed. The performance is evaluated on incar recorded data, where the combined incremental scheme shows gains over either individually. Index Terms: noise robustness, speaker adaptation 1.
Covariance Modelling for NoiseRobust Speech Recognition
"... Model compensation is a standard way of improving speech recognisers’ robustness to noise. Most model compensation techniques produce diagonal covariances. However, this fails to handle any changes in the feature correlations due to the noise. This paper presents a scheme that allows fullcovariance ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Model compensation is a standard way of improving speech recognisers’ robustness to noise. Most model compensation techniques produce diagonal covariances. However, this fails to handle any changes in the feature correlations due to the noise. This paper presents a scheme that allows fullcovariance matrices to be estimated. One problem is that full covariance matrix estimation will be more sensitive approximations, those for the dynamic parameters are known to crude. In this paper a linear transformation of a window of consecutive frames is used as the basis for dynamic parameter compensation. A second problem is that the resulting full covariance matrices slow down decoding. This is addressed by using predictive linear transforms that decorrelate the feature space, so that the decoder can then use diagonal covariance matrices. On a noisecorrupted Resource Management task, the proposed scheme outperformed the standard VTS compensation scheme.
On noise estimation for robust speech recognition using vector Taylor series
 Proc. ICASSP
, 2010
"... In this paper, we propose a novel noise variance estimation method using the fixed point method for the VTSbased robust speech recognition. Noise parameters are reestimated over a given utterance using an EM algorithm. The derivative of the auxiliary function with respect to the noise variance i ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
In this paper, we propose a novel noise variance estimation method using the fixed point method for the VTSbased robust speech recognition. Noise parameters are reestimated over a given utterance using an EM algorithm. The derivative of the auxiliary function with respect to the noise variance is resolved, and the fixed point algorithm estimates the noise variance by recursively approximating the root of the resulting derivative. The method leads to a reestimation formula with a flavor like the standard ML variance estimation, and the iteration procedure is stepsize free. We also investigate improving the noise estimation for efficient VTS adaptation. Several fast noise estimation methods are examined including estimation from nonspeech areas and incremental adaptation. In the evaluation over Aurora 2 database, the proposed noise variance estimation method obtains a significant improvement in recognition accuracy over the method using sample variance. Further experiments show that the VTS ML estimation over nonspeech areas is an effective fast adaptation method. The final refined approach achieves 8.75 % WER, 13 % relative improvement over the conventional VTS adaptation. Index Terms — Robust speech recognition, vector Taylor series, noise estimation 1.