## EXTENDED VTS FOR NOISE-ROBUST SPEECH RECOGNITION

Citations: | 12 - 10 self |

### BibTeX

@MISC{Dalen_extendedvts,

author = {R. C. Van Dalen and M. J. F. Gales},

title = {EXTENDED VTS FOR NOISE-ROBUST SPEECH RECOGNITION},

year = {}

}

### OpenURL

### Abstract

Model compensation is a standard way of improving speech recognisers’ robustness to noise. Currently popular schemes are based on vector Taylor series (VTS) compensation. They often use the continuous time approximation to compensate dynamic parameters. In this paper, the accuracy of dynamic parameter compensation is improved by representing the dynamic features as a linear transformation of a window of static features. A modified version of VTS compensation is applied to the distribution of the window of static features and, importantly, their correlations. These compensated distributions are then transformed to standard static and dynamic distributions. The proposed scheme outperformed the standard VTS scheme by about 10 % relative. Index Terms — Speech recognition, acoustic noise, robustness 1.

### Citations

343 | The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions - Hirsch, Pearce - 2000 |

256 | Unscented filtering and nonlinear estimation - Julier, Uhlmann - 2004 |

172 | Speaker-independent isolated word recognition using dynamic features of speech spectrum - Furui - 1986 |

169 | Environmental robustness in automatic speech recognition - Acero, Stem - 1990 |

123 | The DARPA 1000-Word Resource Management Database for Continuous Speech Recognition - Price, Fisher, et al. - 1988 |

90 | Speech Recognition in Noisy Environments
- Moreno
- 1996
(Show Context)
Citation Context ...ons are required. In this work, the noise model gives the distributions of n and h. n (including the dynamic parameters) is assumed Gaussian with mean µn and covariance Σn; h = µh is assumed constant =-=[3]-=-. These distributions can be estimated using maximum-likelihood estimation and some data from the testing noise condition [4].2.1. Vector Taylor series Equation (1) can be approximated with a first-o... |

84 | Assessment for Automatic Speech Recognition: II.NOISEX-92: A Database and An Experiment to Study The Effect of Additive Noise - Varga, Steeneken - 1993 |

81 |
Model-based techniques for noise robust speech recognition
- Gales
- 1995
(Show Context)
Citation Context ...τ − y s t−τ ) 2 Pw τ=1 τ 2 ≈ ∂ys ∂t ˛ ; (6) t ” . (7) µ ∆ y = Jxµ ∆ x ; Σ ∆ y = diag “ JxΣ ∆ x J T x + JnΣ ∆ n J T n 2.2. Data-driven parallel model combination Data-driven parallel model combination =-=[6]-=- (DPMC) is a Monte Carlo method for estimating the distribution of the corrupted speech. Samples are drawn from the distributions of x s and n s . (1) then gives the value of y s for each sample. The ... |

79 |
HMM adaptation using vector Taylor series for noisy speech recognition
- Acero, Deng, et al.
- 2000
(Show Context)
Citation Context ...at µ s n, µ s x, µ s h, (1) becomes with y s t ≈ f (µ s x, µ s n, µ s h) + Jx(x s t − µ s x) + Jn(n s t − µ s n), (3) Jx = ∂ys ; ∂xs ∂ys Jn = . (4) ∂ns The corrupted static mean and covariance become =-=[5]-=- µ s y = f (µ s x, µ s n, µ s h) ; (5a) Σ s “ y = diag JxΣ s xJ T x + JnΣ s nJ T ” n . (5b) To compensate dynamic parameters, the continuous time approximation [1] is often used in conjunction with VT... |

45 | The HTK book. (for HTK version 3.4 - Young, Ollason, et al. - 2009 |

36 | Uncertainty decoding for noise robust speech recognition
- Liao
- 2007
(Show Context)
Citation Context ...as a 18 dB average signal-to-noise ratio. Noise compensation was applied to a speech recogniser trained on clean data from the Wall Street Journal corpus. The system was based on the one described in =-=[10]-=-, but the number of components was reduced to about 650, more appropriate for an embedded system. The number of components was about 7800. The language model was an open digit loop. VTS Word error rat... |

32 | Joint uncertainty decoding for robust large vocabulary speech recognition
- Liao, Gales
- 2006
(Show Context)
Citation Context ...ssumed Gaussian with mean µn and covariance Σn; h = µh is assumed constant [3]. These distributions can be estimated using maximum-likelihood estimation and some data from the testing noise condition =-=[4]-=-.2.1. Vector Taylor series Equation (1) can be approximated with a first-order vector Taylor series (VTS) [3]. Evaluating the partial derivatives of f at µ s n, µ s x, µ s h, (1) becomes with y s t ≈... |

27 | Speech recognition in noisy environments using first-order vector Taylor series - Kim, Un, et al. - 1998 |

26 | Adaptive Training with Joint Uncertainty Decoding for Robust Recognition of Noise Data - Liao, Gales - 2007 |

25 | High-performance HMM adaptation with joint compensation of additive and convolutive distortions via vector taylor series
- Li
- 2007
(Show Context)
Citation Context ...ions, the computational cost is much greater than for VTS. 3. EXTENDED VTS The continuous time approximation does not yield accurate compensation. In some cases, performance decreases when it is used =-=[7]-=-. This work uses an alternative method, the key intuition to which is the following. Since dynamic coefficients are a linear combination of consecutive static feature vectors, a distribution over dyna... |

19 | Predictive linear transforms for noise robust speech recognition
- Gales, Dalen
(Show Context)
Citation Context ... However, this yields lower gains than going from VTS to eVTS, and decoding with full covariances is computationally expensive, though joint uncertainty decoding and predictive linear transformations =-=[9]-=- can be used. Therefore, this paper concentrates on diagonal covariance compensation. 4.2. Toshiba In-car corpus Initial experiments were run on a task with real recorded noise: the Toshiba in-car dat... |

15 |
Robust speech recognition in noise - performance of the IBM continuous speech recognizer on the ARPA noise spoke task
- Gopinath
- 1995
(Show Context)
Citation Context ...a-delta coefficients, are appended to the static features to form the feature vector. The standard approach to compensate the associated dynamic parameters is to use the continuous time approximation =-=[1]-=-. It assumes that the dynamic coefficients are the time derivatives of the statics. The form of compensation for the dynamic parameters is then closely related to the static parameters. In previous wo... |

15 | Noise adaptive training using a vector taylor series approach for noise robust automatic speech recognition - Kalinli, Seltzer, et al. - 2009 |

9 | Vector Taylor series based joint uncertainty decoding - Xu, Rigazio, et al. - 2006 |

5 | Bayesian feature enhancement using a mixture of unscented transformations for uncertainty decoding of noisy speech - Shinohara, Akamine |

4 |
Covariance modelling for noise robust speech recognition
- Dalen, Gales
(Show Context)
Citation Context ... proposed to improve the dynamic Rogier van Dalen is funded by Toshiba Research Europe Ltd. Thanks go to Dr F. Flego and H. Liao for initial code and experiment configurations. parameter compensation =-=[2]-=-. The dynamic coefficients can be expressed as a linear transformation over a window of static feature coefficients. The distribution over this “extended” feature vector is then computed. By linearly ... |

3 |
Statistical adaptation of acoustic models to noise conditions for robust speech recognition
- Torre, Fohr, et al.
(Show Context)
Citation Context ...directly and the static elements duplicated for each time instance. A related scheme that also attempts to improve compensation for dynamic parameters, but in the log-spectral domain, is described in =-=[8]-=-. However, a large number of approximations were made to derive the VTS form, including ignoring correlations between time instances and parameter differences between time instances. 4. EXPERIMENTS Th... |