Results 1 - 10
of
21
Joint uncertainty decoding for robust large vocabulary speech recognition
, 2006
"... Standard techniques to increase automatic speech recognition noise robustness typically assume recognition models are clean trained. This “clean ” training data may in fact not be clean at all, but may contain channel variations, varying noise conditions, as well as different speakers. Hence rather ..."
Abstract
-
Cited by 23 (20 self)
- Add to MetaCart
Standard techniques to increase automatic speech recognition noise robustness typically assume recognition models are clean trained. This “clean ” training data may in fact not be clean at all, but may contain channel variations, varying noise conditions, as well as different speakers. Hence rather than considering noise robustness techniques as compensating clean acoustic models for environmental noise, they may be thought of as reducing the acoustic mismatch between training and test conditions. This report examines the application of VTS model compensation or model-based Joint uncertainty decoding to clean and multistyle trained systems. An EM-based noise estimation procedure is also presented to produce ML VTS or Joint noise models depending on the form of compensation used. Alternatively, compared to multistyle training, adaptive training with Joint uncertainty transforms, also referred to as JAT in this work, provides a better method for handling heterogeneous data. With JAT, the uncertainty bias added to the model variances de-weights observations proportional to the noise level. In this way, Joint transforms normalise the noise from the data allowing the canonical model to solely represent the underlying “clean ” acoustic signal. This
Adaptive training with joint uncertainty decoding for robust recognition of noisy data
- IN PROCEEDINGS OF ICASSP, VOLUME IV
, 2007
"... Standard noise compensation techniques for automatic speech recognition assume a clean trained acoustic model. What is thought of as “clean” data, may still have a variety of speakers, different channels and varying noise conditions. Hence it may be more reasonable to consider such data multi-condit ..."
Abstract
-
Cited by 15 (13 self)
- Add to MetaCart
Standard noise compensation techniques for automatic speech recognition assume a clean trained acoustic model. What is thought of as “clean” data, may still have a variety of speakers, different channels and varying noise conditions. Hence it may be more reasonable to consider such data multi-conditional for multistyle training. This paper shows that multistyle models benefit from VTS compensation or Joint uncertainty decoding by reducing the mismatch between training and test. An EM-based noise estimation procedure that produces ML VTS or Joint noise models is also described. Alternatively, adaptive training with Joint uncertainty transforms factors out the noise from the data. The uncertainty variance bias de-weights observations in the training data where the SNR is low. This property allows data with a wide SNR range to be used and produces canonical models that truly represent clean speech, whereas multistyle trained models must account for all acoustic variation associated with different noise conditions. This paper presents Joint adaptive training including formula for estimating the transforms and canonical model parameters. Experiments are conducted on the
Issues with uncertainty decoding for noise robust speech recognition
- Speech Communication
, 2008
"... Interest is growing in a class of robustness algorithms that exploit the notion of uncertainty introduced by environmental noise. The majority of these techniques share the property that the uncertainty of an observation due to noise is propagated to the recogniser, resulting in increased model vari ..."
Abstract
-
Cited by 14 (9 self)
- Add to MetaCart
Interest is growing in a class of robustness algorithms that exploit the notion of uncertainty introduced by environmental noise. The majority of these techniques share the property that the uncertainty of an observation due to noise is propagated to the recogniser, resulting in increased model variances. Using appropriate approximations, efficient implementations may be obtained, with the goal of achieving near model-based performance without the associated computational cost. Unfortunately, uncertainty decoding forms that compute the uncertainty in the front-end and pass this to the decoder may suffer from a theoretical problem in low signal-to-noise ratio conditions. This report discusses how this fundamental issue arises, and demonstrates it through two schemes: SPLICE with uncertainty and front-end Joint uncertainty decoding. A method to mitigate this in theJoint form is presented, as well as how SPLICE implicitly addresses it. However, it is shown that a model-based Joint uncertainty decoding approach does not suffer from this limitation, like these front-end forms do, and is also competitive computationally. The issues described and performance of the various schemes are examined on two artificially corrupted corpora: AURORA 2.0 digit recognition database and the thousand-word Resource Management task. 2 1
Predictive linear transforms for noise robust speech recognition
- in Proceedings of the ASRU Workshop
"... It is well known that the addition of background noise alters the correlations between the elements of, for example, the MFCC feature vector. However, standard model-based compensation techniques do not modify the feature-space in which the diagonal covariance matrix Gaussian mixture models are esti ..."
Abstract
-
Cited by 14 (12 self)
- Add to MetaCart
It is well known that the addition of background noise alters the correlations between the elements of, for example, the MFCC feature vector. However, standard model-based compensation techniques do not modify the feature-space in which the diagonal covariance matrix Gaussian mixture models are estimated. One solution to this problem, which yields good performance, is Joint Uncertainty Decoding (JUD) with full transforms. Unfortunately, this results in a high computational cost during decoding. This paper contrasts two approaches to approximating full JUD while lowering the computational cost. Both use predictive linear transforms to modify the feature-space: adaptation-based linear transforms, where the model parameters are restricted to be the same as the original clean system; and precision matrix modelling approaches, in particular semi-tied covariance matrices. These predicitve transforms are estimated using statistics derived from the full JUD transforms rather noisy data. The schemes are evaluated on AURORA 2 and a noise-corrupted Resource Management task. Index Terms — Noise robust speech recognition, joint uncertainty
Discriminative classifiers with adaptive kernels for noise robust speech recognition
- Comput. Speech Lang
, 2010
"... Discriminative classifiers are a popular approach to solving classification problems. However one of the problems with these approaches, in particular kernel based classifiers such as Support Vector Machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. Thi ..."
Abstract
-
Cited by 12 (10 self)
- Add to MetaCart
Discriminative classifiers are a popular approach to solving classification problems. However one of the problems with these approaches, in particular kernel based classifiers such as Support Vector Machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. This paper describes a scheme for overcoming this problem for speech recognition in noise by adapting the kernel rather than the SVM decision boundary. Generative kernels, defined using generative models, are one type of kernel that allows SVMs to handle sequence data. By compensating the parameters of the generative models for each noise condition noise-specific generative kernels can be obtained. These can be used to train a noiseindependent SVM on a range of noise conditions, which can then be used with a test-set noise kernel for classification. The noise-specific kernels used in this paper are based on Vector Taylor Series (VTS) model-based compensation. VTS allows all the model parameters to be compensated and the background noise to be estimated in a maximum likelihood fashion. A brief discussion of VTS, and the optimisation of the mismatch function representing the impact of noise on the clean speech, is also included. Experiments using these VTS-based test-set noise kernels were run on the AURORA 2 continuous digit task. The proposed SVM rescoring scheme yields large gains in performance over the VTS compensated models. Key words: speech recognition, noise robustness, support vector machines, generative kernels
EXTENDED VTS FOR NOISE-ROBUST SPEECH RECOGNITION
"... Model compensation is a standard way of improving speech recognisers’ robustness to noise. Currently popular schemes are based on vector Taylor series (VTS) compensation. They often use the continuous time approximation to compensate dynamic parameters. In this paper, the accuracy of dynamic paramet ..."
Abstract
-
Cited by 10 (9 self)
- Add to MetaCart
Model compensation is a standard way of improving speech recognisers’ robustness to noise. Currently popular schemes are based on vector Taylor series (VTS) compensation. They often use the continuous time approximation to compensate dynamic parameters. In this paper, the accuracy of dynamic parameter compensation is improved by representing the dynamic features as a linear transformation of a window of static features. A modified version of VTS compensation is applied to the distribution of the window of static features and, importantly, their correlations. These compensated distributions are then transformed to standard static and dynamic distributions. The proposed scheme outperformed the standard VTS scheme by about 10 % relative. Index Terms — Speech recognition, acoustic noise, robustness 1.
Transforming Binary Uncertainties for Robust Speech Recognition
"... Abstract—Recently, several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time–frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with en ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Abstract—Recently, several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time–frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with enhanced speech in the linear spectral domain. The use of the cepstral transformation smears the information from the noise dominant time–frequency regions across all the cepstral features. We propose a supervised approach using regression trees to learn the nonlinear transformation of the uncertainty from the linear spectral domain to the cepstral domain. This uncertainty is used by a decoder that exploits the variance associated with the enhanced cepstral features to improve robust speech recognition. Systematic evaluations on a subset of the Aurora4 task using the estimated uncertainty show substantial improvement over the baseline performance across various noise conditions. Index Terms—Binary time–frequency mask, computational auditory scene analysis (CASA), robust automatic speech recognition, spectrogram reconstruction, uncertainty decoding. I.
Transforming Features to Compensate Speech Recogniser Models for Noise
"... To make speech recognisers robust to noise, either the features or the models can be compensated. Feature enhancement is often fast; model compensation is often more accurate, because it predicts the corrupted speech distribution. It is therefore able, for example, to take uncertainty about the clea ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
To make speech recognisers robust to noise, either the features or the models can be compensated. Feature enhancement is often fast; model compensation is often more accurate, because it predicts the corrupted speech distribution. It is therefore able, for example, to take uncertainty about the clean speech into account. This paper re-analyses the recently-proposed predictive linear transformations for noise compensation as minimising the KL divergence between the predicted corrupted speech and the adapted models. New schemes are then introduced which apply observation-dependent transformations in the front-end to adapt the back-end distributions. One applies transforms in the exact same manner as the popular minimum mean square error (MMSE) feature enhancement scheme, and is as fast. The new method performs better on AURORA 2. Index Terms: speech recognition, noise robustness 1.
Covariance Modelling for Noise-Robust Speech Recognition
"... Model compensation is a standard way of improving speech recognisers’ robustness to noise. Most model compensation techniques produce diagonal covariances. However, this fails to handle any changes in the feature correlations due to the noise. This paper presents a scheme that allows full-covariance ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
Model compensation is a standard way of improving speech recognisers’ robustness to noise. Most model compensation techniques produce diagonal covariances. However, this fails to handle any changes in the feature correlations due to the noise. This paper presents a scheme that allows full-covariance matrices to be estimated. One problem is that full covariance matrix estimation will be more sensitive approximations, those for the dynamic parameters are known to crude. In this paper a linear transformation of a window of consecutive frames is used as the basis for dynamic parameter compensation. A second problem is that the resulting full covariance matrices slow down decoding. This is addressed by using predictive linear transforms that decorrelate the feature space, so that the decoder can then use diagonal covariance matrices. On a noise-corrupted Resource Management task, the proposed scheme outperformed the standard VTS compensation scheme.
Discriminative Classifiers with Generative Kernels for Noise Robust ASR
"... Discriminative classifiers are a popular approach to solving classification problems. However one of the problems with these approaches, in particular kernel based classifiers such as Support Vector Machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. Thi ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Discriminative classifiers are a popular approach to solving classification problems. However one of the problems with these approaches, in particular kernel based classifiers such as Support Vector Machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. This paper describes a scheme for overcoming this problem for speech recognition in noise. Generative kernels, defined using generative models, allow SVMs to handle sequence data. By compensating the generative models for the noise conditions noise-specific generative kernels can be obtained. These can be used to train a noise-independent SVM on a range of noise conditions, which can then be used with a test-set noise kernel for classification. Initial experiments using an idealised version of model-based compensation were run on the AURORA 2.0 continuous digit task. The proposed scheme yielded large gains in performance over the compensated models.

