## Transforming Features to Compensate Speech Recogniser Models for Noise

Citations: | 5 - 4 self |

### BibTeX

@MISC{Dalen_transformingfeatures,

author = {R. C. Van Dalen and F. Flego and M. J. F. Gales},

title = {Transforming Features to Compensate Speech Recogniser Models for Noise},

year = {}

}

### OpenURL

### Abstract

To make speech recognisers robust to noise, either the features or the models can be compensated. Feature enhancement is often fast; model compensation is often more accurate, because it predicts the corrupted speech distribution. It is therefore able, for example, to take uncertainty about the clean speech into account. This paper re-analyses the recently-proposed predictive linear transformations for noise compensation as minimising the KL divergence between the predicted corrupted speech and the adapted models. New schemes are then introduced which apply observation-dependent transformations in the front-end to adapt the back-end distributions. One applies transforms in the exact same manner as the popular minimum mean square error (MMSE) feature enhancement scheme, and is as fast. The new method performs better on AURORA 2. Index Terms: speech recognition, noise robustness 1.

### Citations

406 | Maximum Likelihood Linear Transformations For HMM-Based Speech Recognition
- Gales
- 1998
(Show Context)
Citation Context ...mation methods as fast as MMSE, but of an entirely different nature. Based on predictive CMLLR, they aim to minimise the KL divergence to the corrupted speech distributions. Adaptive (standard) CMLLR =-=[6]-=- is an instance of a general method of adapting a speech recogniser: applying linear transformations to the component distributions. The transformations This work was partly funded by Toshiba Research... |

343 | The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions
- Hirsch, Pearce
- 2000
(Show Context)
Citation Context .... Post-processing of the transformed observation as if it represented the clean speech is therefore not possible. 5. Results The performance of the proposed schemes was evaluated on the AURORA 2 task =-=[13]-=-. AURORA 2 is a small vocabulary digit string recognition task, based on the TIDIGITS database with noise artificially added. The clean training data comprises 8440 utterances from 55 male and 55 fema... |

38 | Uncertainty decoding with SPLICE for noise robust speech recognition
- Droppo, Acero, et al.
- 2002
(Show Context)
Citation Context ..., sending multiple observations to the back-end models at once [2]. Other schemes aim to match model parameters to the expected distribution. They adapt model covariances depending on the observation =-=[3, 4]-=-, though this can have issues [5]. Both these approaches are computationally more expensive than just feature transformations, which is what the popular minimum mean square error (MMSE) feature enhanc... |

36 | Uncertainty decoding for noise robust speech recognition
- Liao
- 2007
(Show Context)
Citation Context ...mmse ; µ (m) x , Σ (m)´ x . (7) Model-based noise-robustness methods are predictive: they predict the distribution of the corrupted speech per component. For example, joint uncertainty decoding (JUD) =-=[11]-=- associates every component m of the speech recogniser HMM with one base class n, whose distribution is approximated by front-end component n. It finds compensation for all components in a base class ... |

32 | Joint uncertainty decoding for robust large vocabulary speech recognition
- Liao, Gales
- 2006
(Show Context)
Citation Context ...An initial estimate of the additive noise for each utterance was obtained using the first and last 20 frames. This was then re-estimated for every utterance using the ML VTS-based scheme described in =-=[14]-=-, applied on the base class level. This noise model was then used to nVTS-compensate the 64-component clean front-end GMM, producing the joint distribution (3). This scheme for estimating the joint d... |

19 | Predictive linear transforms for noise robust speech recognition
- Gales, Dalen
(Show Context)
Citation Context ...ely represent the distribution of the noise-corrupted speech. Recently, schemes have been proposed that train “predictive” linear transformations from statistics predicted by noise-compensated models =-=[1]-=-. This paper gives a more principled analysis of these predictive linear transforms as minimising the KullbackLeibler (KL) divergence between the predicted distributions and the models effectively use... |

18 | Issues with uncertainty decoding for noise robust automatic speech recognition
- Liao, Gales
(Show Context)
Citation Context ...he back-end models at once [2]. Other schemes aim to match model parameters to the expected distribution. They adapt model covariances depending on the observation [3, 4], though this can have issues =-=[5]-=-. Both these approaches are computationally more expensive than just feature transformations, which is what the popular minimum mean square error (MMSE) feature enhancement scheme uses. However, MMSE ... |

17 |
A mimimum mean square error approach for speech enhancement
- Ephraim
- 1990
(Show Context)
Citation Context ...(x − µ (n) x ), Σ (n) −1 Σ (n) ” . (4) − Σ (n) yx A standard approach to reconstruct a clean speech feature vector ˆxt from the observation yt is to find the minimum mean square error (MMSE) estimate =-=[10]-=-. This uses the mean of (4), with x and y swapped around: ˆxt = E { x| yt} = X P ( n| yt) E { x| n, yt} where A (n) = X P ( n| yt) n = X n mmse = Σ (n) xy n “ µ (n) x + Σ (n) xy Σ (n) y x −1 xy (yt − ... |

14 |
Using observation uncertainty
- Arrowood, Clements
- 2002
(Show Context)
Citation Context ..., sending multiple observations to the back-end models at once [2]. Other schemes aim to match model parameters to the expected distribution. They adapt model covariances depending on the observation =-=[3, 4]-=-, though this can have issues [5]. Both these approaches are computationally more expensive than just feature transformations, which is what the popular minimum mean square error (MMSE) feature enhanc... |

12 | Extended vts for noise-robust speech recognition
- Dalen, Gales
- 2009
(Show Context)
Citation Context ...for each clean speech component n. This paper uses standard VTS, which produces diagonal matrices Σ (n) xy , Σ (n) y in (3). It is also possible to estimate full covariance matrices with extended VTS =-=[9]-=-. In all the schemes that will be discussed, the most computationally expensive part of the estimation is the VTS compensation, to be precise, computing the derivative of the corrupted speech with rel... |

9 |
Vector Taylor series based joint uncertainty decoding
- Xu, Rigazio, et al.
- 2006
(Show Context)
Citation Context ...his front-end distribution is found from a clean speech GMM (from the training data) and a noise model (estimated e.g. per utterance). A variant of first-order vector Taylor series (VTS) compensation =-=[7, 8]-=- yields the distributions in (3) for each clean speech component n. This paper uses standard VTS, which produces diagonal matrices Σ (n) xy , Σ (n) y in (3). It is also possible to estimate full covar... |

8 | Accounting for the uncertainty of speech estimates in the context of model-based feature enhancement
- Stouten, Hamme, et al.
(Show Context)
Citation Context ...on schemes for noiserobustness that integrate knowledge about observation uncertainty have been proposed. Some use a number of transforms, sending multiple observations to the back-end models at once =-=[2]-=-. Other schemes aim to match model parameters to the expected distribution. They adapt model covariances depending on the observation [3, 4], though this can have issues [5]. Both these approaches are... |

7 | Joint removal of additive and convolutional noise with modelbased feature enhancement
- Stouten, hamme, et al.
- 2004
(Show Context)
Citation Context ...his front-end distribution is found from a clean speech GMM (from the training data) and a noise model (estimated e.g. per utterance). A variant of first-order vector Taylor series (VTS) compensation =-=[7, 8]-=- yields the distributions in (3) for each clean speech component n. This paper uses standard VTS, which produces diagonal matrices Σ (n) xy , Σ (n) y in (3). It is also possible to estimate full covar... |

7 | Incremental predictive and adaptive noise compensation
- Flego, Gales
- 2009
(Show Context)
Citation Context ...e expressed as E { yi| m} = ` µ (m) xi ´ (n) − b(n) /a . (14) The statistics then become very fast to find online, because much of them can be cached. To save space, this is not derived here, but see =-=[1, 12]-=- for details. judi judi 4. Component-independent PCMLLR The form of decoding with PCMLLR is the same as the one for CMLLR, in (2). Therefore, which transformation A (n) is chosen from the set of trans... |