## Adaptive training with joint uncertainty decoding for robust recognition of noisy data (2007)

Venue: | IN PROCEEDINGS OF ICASSP, VOLUME IV |

Citations: | 26 - 17 self |

### BibTeX

@INPROCEEDINGS{Liao07adaptivetraining,

author = {H. Liao and M. J. F. Gales},

title = {Adaptive training with joint uncertainty decoding for robust recognition of noisy data},

booktitle = {IN PROCEEDINGS OF ICASSP, VOLUME IV},

year = {2007},

pages = {389--392},

publisher = {}

}

### OpenURL

### Abstract

Standard noise compensation techniques for automatic speech recognition assume a clean trained acoustic model. What is thought of as “clean” data, may still have a variety of speakers, different channels and varying noise conditions. Hence it may be more reasonable to consider such data multi-conditional for multistyle training. This paper shows that multistyle models benefit from VTS compensation or Joint uncertainty decoding by reducing the mismatch between training and test. An EM-based noise estimation procedure that produces ML VTS or Joint noise models is also described. Alternatively, adaptive training with Joint uncertainty transforms factors out the noise from the data. The uncertainty variance bias de-weights observations in the training data where the SNR is low. This property allows data with a wide SNR range to be used and produces canonical models that truly represent clean speech, whereas multistyle trained models must account for all acoustic variation associated with different noise conditions. This paper presents Joint adaptive training including formula for estimating the transforms and canonical model parameters. Experiments are conducted on the

### Citations

145 | A compact model for speaker-adaptive training
- Anastasakos
- 1996
(Show Context)
Citation Context ... acoustic models. Alternatively, adaptive training may be applied to remove these unwanted factors, such as speaker differences or the acoustic environment, from being included in the acoustic models =-=[1, 2]-=-. Rather than force the acoustic model to represent all these factors, as expected in multistyle training, transforms are used instead to model the variation from different factors. MLLR transforms ca... |

91 | Speech Recognition in Noisy Environments
- Moreno
- 1996
(Show Context)
Citation Context ...ing VTS, with the noise model iteratively improved using EM to maximise the likelihood of the test condition. Such an EM framework for estimating the additive and convolutional noise was presented in =-=[9]-=-, but in this work estimation is conducted in the cepstral domain. Also, a simple iterative 1storder gradient search is used to �nd an MLE of the noise variance. Although using the MLE noise model der... |

82 |
Model-Based Techniques for Noise Robust Speech Recognition
- Gales
- 1995
(Show Context)
Citation Context ... have an explicit model of the noise 2 . Typically the uncompensated “clean” acoustic model M consists of Gaussian components each de�ned by a prior, c (m) ,mean,μ (m) s , and variance, Σ (m) s . PMC =-=[5]-=- and VTS compensation [6] approximate the integrals in equation 1 for each acoustic model component. This assumes that the frame/state alignment of the clean speech does not change with noise. In the ... |

80 | HMM adaptation using vector Taylor series for noisy speech recognition
- Acero, Deng, et al.
- 2000
(Show Context)
Citation Context ...f the noise 2 . Typically the uncompensated “clean” acoustic model M consists of Gaussian components each de�ned by a prior, c (m) ,mean,μ (m) s , and variance, Σ (m) s . PMC [5] and VTS compensation =-=[6]-=- approximate the integrals in equation 1 for each acoustic model component. This assumes that the frame/state alignment of the clean speech does not change with noise. In the cepstral domain, the rela... |

36 | Uncertainty decoding for noise robust speech recognition
- Liao
- 2007
(Show Context)
Citation Context ...g the features, e.g. CMN or SPLICE, or modifying the model parameters. The latter approach, often called model-based compensation, tends to give better results than the former, feature-based approach =-=[3]-=-. It is frequently assumed that a noise corrupted speech observation, ot =[y T t ΔyTt Δ2y T t ]T, at time t is conditionally independent of all other observations given the clean speech st and the noi... |

32 | Joint uncertainty decoding for robust large vocabulary speech recognition
- Liao, Gales
- 2006
(Show Context)
Citation Context ...a simple iterative 1storder gradient search is used to �nd an MLE of the noise variance. Although using the MLE noise model derived using VTS compensation may give good results for Joint compensation =-=[10]-=-, there is a mismatch between the compensation used during noise estimation and that applied during recognition. Hence, it is sensible to generate ML noise parameters explicitly tuned for Joint compen... |

18 | Issues with uncertainty decoding for noise robust automatic speech recognition
- Liao, Gales
(Show Context)
Citation Context ...tion for each component of a front-end GMM representing the observed, corrupted acoustic space; this form however suffers from a fundamental problem and is less ef�cient than the model-based approach =-=[8]-=-. Previously, the joint distribution was estimated using stereo data [4, 8]. It may be predicted given the clean speech and noise model using VTS or PMC [3, 7], resulting in noise compensating Joint t... |

11 | Recent advances in broadcast news transcription
- Kim, Evermann, et al.
(Show Context)
Citation Context ...sitive. Thus, the change of variable ς =logΣ (m) s is made. The derivatives may be easily recomputed to now optimise ς. 4. EXPERIMENTS Asimpli�ed Broadcast News system based on the 2003 CU-HTK system =-=[11]-=- was evaluated. MFCC parameters with the 0th cepstra, and associated 1st- and 2nd-order features for 39 dimensions were used with cross-word triphones and decision-tree clustered states. There were 16... |

10 | Acoustic factorisation
- Gales
(Show Context)
Citation Context ... acoustic models. Alternatively, adaptive training may be applied to remove these unwanted factors, such as speaker differences or the acoustic environment, from being included in the acoustic models =-=[1, 2]-=-. Rather than force the acoustic model to represent all these factors, as expected in multistyle training, transforms are used instead to model the variation from different factors. MLLR transforms ca... |

9 |
Vector Taylor series based joint uncertainty decoding
- Xu, Rigazio, et al.
- 2006
(Show Context)
Citation Context ...el parameters ˇ M are single-pass retrained, as in [4], then no noise model is explicitly estimated. for 1424407281/07/$20.00 ©2007 IEEE IV 389 ICASSP 2007has shown to be more ef�cient than PMC =-=[7]-=- and a better approximation than the log-normal [6] it is still computationally expensive as every model component must be individually adapted with respect to the noise. This involves the computation... |