## Speaker Adaptive Modeling by Vocal Tract Normalization (2002)

Venue: | IEEE Trans. on Speech and Audio Processing |

Citations: | 29 - 1 self |

### BibTeX

@ARTICLE{Welling02speakeradaptive,

author = {Lutz Welling and Hermann Ney and Stephan Kanthak and Lehrstuhl Fur Informatik Vi},

title = {Speaker Adaptive Modeling by Vocal Tract Normalization},

journal = {IEEE Trans. on Speech and Audio Processing},

year = {2002},

volume = {10},

pages = {415--426}

}

### OpenURL

### Abstract

This paper presents methods for speaker adaptive modeling using vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new training method for VTN: By using single-density acoustic models per HMM state for selecting the scale factor of the frequency axis, we avoid the problem that a mixture-density tends to learn the scale factors of the training speakers and thus cannot be used for selecting the scale factor. We show that using single Gaussian densities for selecting the scale factor in training results in lower error rates than using mixture densities.

### Citations

592 | Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,”Comput
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...f such a speaker specic parameter is the scale factor of the frequency axis as used in VTN. As we will discuss later, also the widely used MLLR approach (MLLR = maximum likelihood linear regression) [=-=13-=-]sts into this framework, where the single parameter has to be extended to a matrix of parameters. In addition to speaker adaptive modeling, the 4 same method can be used to describe other types of a... |

145 | A compact model for speaker-adaptive training
- Anastasakos
- 1996
(Show Context)
Citation Context ...on) [13, 9]. In MLLR, the parameter set denes a trans6 formation matrix A() and is applied to the mean vector of Gaussian distributions: ! = A() (15) maybe with an additive oset. In [2], the training problem of such matrix transformations as formulated by Eq.(12) is referred to as speaker-adaptive training (SAT). transformation of the observations X: This can be formulated as a ma... |

106 | A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition
- Sankar, Lee
(Show Context)
Citation Context ...er to use VTN appropriately, it is useful to study it in the framework of speaker adaptive modeling. Speaker adaptive modeling can be viewed as a special case of stochastic matching as 3 presented in =-=[17]-=-. 2.1 Principles In speech recognition, we have a sequence of acoustic vectors or observations over time t = 1; :::; T : X = x 1 :::x t :::x T (1) In conventional, i.e. non{adaptive, acoustic modeling... |

78 |
A parametric approach to vocal tract length normalization
- Eide, Gish
- 1996
(Show Context)
Citation Context ...speaker adaptive modeling and training. EDICS: 1-RECO (speech recognition), 1-SPRD (speech production and perception). 1 Introduction This paper deals with methods for vocal tract normalization (VTN) =-=[7, 11, 12, 16, 18, 19]-=-. The main idea of VTN is to normalize the frequency axis of the acoustic vectors for each speaker in the recognition process and thus to remove speaker dependent variability in the acoustic vectors. ... |

75 | Structural maximum a posteriori linear regression for fast HMM adaptation
- Siohan, Myrvoll, et al.
- 2002
(Show Context)
Citation Context ...raining from normalized observations X . As usual, the prior distribution p(jW ) is assumed to be uniform. Only recently, there has been some studies to make explicit use of the prior distribution [5=-=,-=- 6]. 9 In the following, we will focus on methods for training and recognition with VTN. Since we will use dierent types of acoustic models with dierent model parameters, we have summarized the dieren... |

55 |
A frequency warping approach to speaker normalization,” M
- Lee
- 1996
(Show Context)
Citation Context ...id the cost of the full optimization. In this paper, we will distinguish and use two variants, namely a baseline and an improved two{pass strategy. The baseline two{pass strategy has been proposed in =-=[11, 1-=-2] and works as follows: 1. Asrst recognition pass with non-normalized acoustic vectors X and a normalized acoustic model ~ produces a preliminary transcription ~ W : ~ W = arg max W n p(W ) p(XjW ;... |

55 |
Speaker normalization on conversational telephone speech
- Wegmann, McAllaster, et al.
- 1996
(Show Context)
Citation Context ...speaker adaptive modeling and training. EDICS: 1-RECO (speech recognition), 1-SPRD (speech production and perception). 1 Introduction This paper deals with methods for vocal tract normalization (VTN) =-=[7, 11, 12, 16, 18, 19]-=-. The main idea of VTN is to normalize the frequency axis of the acoustic vectors for each speaker in the recognition process and thus to remove speaker dependent variability in the acoustic vectors. ... |

34 |
The RWTH Large Vocabulary Continuous Speech Recognition System
- Ney, Welling, et al.
- 1998
(Show Context)
Citation Context ...model consisting of a single Gaussian density per HMM state is estimated from the non-normalized acoustic vectors X r of all training speakers r with r = 1; : : : ; R by maximum likelihood training [15]: = arg max 0 R Y r=1 p(X r jW r ; 0 ) : (19) 11 2. For each training speaker r, a scale factor r is chosen as the scale factor which results in the maximum likelihood for the training data of... |

27 |
Normalization of vowels by vocal-tract length and its application to vowel identification
- Wakita
- 1977
(Show Context)
Citation Context ...speaker adaptive modeling and training. EDICS: 1-RECO (speech recognition), 1-SPRD (speech production and perception). 1 Introduction This paper deals with methods for vocal tract normalization (VTN) =-=[7, 11, 12, 16, 18, 19]-=-. The main idea of VTN is to normalize the frequency axis of the acoustic vectors for each speaker in the recognition process and thus to remove speaker dependent variability in the acoustic vectors. ... |

22 | Improved methods for vocal tract normalization
- Welling, Kanthak, et al.
- 1999
(Show Context)
Citation Context ...tently better results. 13 fast VTN: A signicant reduction in computation time can be obtained by using a text{independent acoustic model p(X j) with model parameter for scale factor selection [11, =-=12, 19, 21-=-, 22]: ~ = arg max p(X j) : (28) The training of the text{independent model parameters will be addressed later. As in the two{pass strategy before, the recognition pass itself is then carried out ... |

18 | Speaker adaptive training: a maximum likelihood approach to speaker normalization - Anastasakos, McDonough, et al. - 1997 |

12 | Acoustic Front End Optimization for Large Vocabulary Speech Recognition
- Welling, Haberland, et al.
- 1997
(Show Context)
Citation Context ...experiments on the large vocabulary corpora WSJ0 and Verbmobil, we used the recognizer described in [15]. This recognizer employs a cepstrum front{end that includes linear discriminant analysis (LDA) =-=[20]-=-. For acoustic modeling, continuous density hidden Markov models along with decision{tree based state tying are used. A time{synchronous left{to{right beam search strategy in combination with a tree-o... |

11 |
Maximum a posterior linear regression with elliptically symmetric matrix variate priors
- Chou
- 1999
(Show Context)
Citation Context ...raining from normalized observations X . As usual, the prior distribution p(jW ) is assumed to be uniform. Only recently, there has been some studies to make explicit use of the prior distribution [5=-=,-=- 6]. 9 In the following, we will focus on methods for training and recognition with VTN. Since we will use dierent types of acoustic models with dierent model parameters, we have summarized the dieren... |

10 | Speaker and gender normalization for continuous-density Hidden Markov Models - Acero, Huang - 1996 |

9 |
Speaker adaptation with all-pass transforms
- McDonough, Schaaf, et al.
- 2004
(Show Context)
Citation Context ...6) so that we have for the distribution: p(XjW ; ; ) = p(X jW ; ) (17) When changing the random variable from X to X , we must include the so-called Jacobian determinant for the transformation [14]. This is not expressed explicitly in our notation since we use the same symbols for the non-adaptive model p(XjW ; ) and the adaptive model p(X jW ; ). Anyway, the eect as such tends to be small... |

9 |
Experiments in speaker normalization and adaptation for large vocabulary speech recognition
- Pye, Woodland
- 1997
(Show Context)
Citation Context |

9 | A study on speaker normalization using vocal tract normalization and speaker adaptive training
- Welling, Haeb-Umbach, et al.
- 1998
(Show Context)
Citation Context ... the multimodality of the mixture distributions. As a result, it is dicult to reliably select the speaker-dependent scale factor of the frequency axis. This problem is addressed in a number of papers =-=[11, 19, 21]-=-. The method used in this paper for selecting the speakerdependent scale factor wassrst presented in [21] and is based on using single Gaussian densities. Using this single-density method for scale fa... |

7 |
The Karlsruhe Verbmobil Speech Recognition Engine
- Finke
(Show Context)
Citation Context ... In recognition tests, the fast VTN is compared with the improved two{pass strategy. We will present recognition tests on three dierent databases, namely the German spontaneous speech task Verbmobil [=-=8]-=-, the Wall Street Journal task and the German telephone digit string corpus SieTill. The results will show that the proposed method for VTN in training in combination with the improved two{pass strate... |

6 |
Speaker Normalization Using Ecient Frequency Warping Procedures
- Lee, Rose
- 1996
(Show Context)
Citation Context |

3 | Source Normalization Training for HMM Ap plied to Noisy Telephone Speech Recognition - Gong - 1997 |

2 | Thelen: \Speaker adaptive training applied to continuous mixture density modeling - Aubert, E - 1997 |

1 |
Gales: \Maximum likelihood linear transformations for HMM-based speech recognition
- F
- 1998
(Show Context)
Citation Context ... (13) so that we have for the distribution: p(XjW ; ; ) = p(XjW ; ) (14) The prototypical example of this approach is the so-called MLLR method (MLLR = maximum likelihood linear regression) [13, 9]. In MLLR, the parameter set denes a trans6 formation matrix A() and is applied to the mean vector of Gaussian distributions: ! = A() (15) maybe with an additive oset. In [2], the tr... |