## Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition (1998)

Venue: | Computer Speech and Language |

Citations: | 437 - 60 self |

### BibTeX

@ARTICLE{Gales98maximumlikelihood,

author = {M.J.F. Gales},

title = {Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition},

journal = {Computer Speech and Language},

year = {1998},

volume = {12},

pages = {75--98}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias, strict linear feature-space transformations are inappropriate in this case. Hence, only model-based linear transforms are considered. The paper compares the two possible forms of model-based transforms: (i) unconstrained, where any combination of mean and variance transform may be used, and (ii) constrained, which requires the variance transform to have the same form as the mean transform (sometimes referred to as feature-space transforms). Re-estimation formulae for all appropriate cases of transform are given. This includes a new and efficient "full" variance transform and the extension of the constrained model-space transform from the simple diagonal case to the full or block-diagonal case. The constrained and unconstrained transforms are evaluated in terms of computational cost, recognition time efficiency, and use for speaker adaptive training. The recognition performance of the two model-space transforms on a large vocabulary speech recognition task using incremental adaptation is investigated. In addition, initial experiments using the constrained model-space transform for speaker adaptive training are detailed. 1 The author is now at the IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA 1

### Citations

8919 | Maximum likelihood from incomplete data via the EM algorithm (with discussion
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...on of the variance must correspond to that applied to the mean. Both these transforms are described in detail below. In all cases the parameters of the linear transform are found using an EM approach =-=[3]-=-. The parameters of the transforms are found by optimising the following equation Q(M; M) = (1) K \Gamma 1 2 M X m=1 T X =1 fl m () h K (m) + log(js\Sigma (m) j) + (o() \Gammas(m) ) T \Sigma (m)\Gamma... |

650 | Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...on Research Center, Yorktown Heights, NY 10598, USA 1 Introduction In recent years there has been a vast amount of work done on estimating and applying linear transformations to HMM-based recognisers =-=[2, 4, 13, 17]-=-. Though not the only possible model adaptation scheme, for example maximum a-posteriori adaptation [10] may be used, linear transforms have been shown to be a powerful tool for both speaker and envir... |

201 | Tree-based state tying for high accuracy acoustic modeling
- Young, Odell, et al.
- 1994
(Show Context)
Citation Context ...consisted of 36493 sentences from the SI-284 WSJ0 and WSJ1 sets, and the LIMSI 1993 WSJ lexicon and phone set were used. The standard HTK system was trained using decision-tree-based state clustering =-=[24]-=- to define 6399 speech states. For the H1 task a 65k word list and dictionary was used with the trigram language model described in [23]. For the S5 task a 5K vocabulary with trigram language model wa... |

157 | A compact model for speaker-adaptive training
- Anastasakos, McDonough, et al.
- 1996
(Show Context)
Citation Context ...ted. These transforms are then compared in terms of efficiency at run-time and in training the transform. There has also been much interest in using adaptation techniques in both training and testing =-=[1, 11]-=-. Here, instead of applying the test set adaptation transforms to a speaker-independent modelset they are applied to a model set trained using that adaptation scheme. Thus the model-set used in adapta... |

120 | Mean and variance adaptation within the MLLR framework
- Gales
- 1996
(Show Context)
Citation Context ...ce transformations, which act on the model parameters themselves, have been shown to be useful. There are two main forms of model-space transformation 3 . First, there is the unconstrained case (e.g. =-=[13, 9]-=-) where the transforms on the means and variances are unrelated to each other. Alternatively, for the constrained case (e.g. [4]), the mean transformation and variance transformation are required to h... |

111 | A maximum-likelihood approach to stochastic matching for robust speech recognition
- Sankar, Lee
- 1996
(Show Context)
Citation Context ...ed on a particular set of adaptation data, such that it maximises the likelihood of that adaptation data given the current model-set. The theory behind these ML trained transforms is well established =-=[22]-=-. However the actual forms of the transform that have been applied to date are limited, due to the complexity of optimising the transformation parameters. The aim of this paper is to present the vario... |

95 | Speaker adaptation using constrained estimation of Gaussian mixtures
- Digalakis, Rtischev, et al.
- 1995
(Show Context)
Citation Context ...on Research Center, Yorktown Heights, NY 10598, USA 1 Introduction In recent years there has been a vast amount of work done on estimating and applying linear transformations to HMM-based recognisers =-=[2, 4, 13, 17]-=-. Though not the only possible model adaptation scheme, for example maximum a-posteriori adaptation [10] may be used, linear transforms have been shown to be a powerful tool for both speaker and envir... |

90 |
Model-Based Techniques for Noise Robust Speech Recognition
- Gales
- 1995
(Show Context)
Citation Context ...s then trained for each tied state, a total of about 6 million parameters. For the secondary channel experiments, S5, a PLP version of the standard MFCC models were built using single-pass retraining =-=[5]-=- on the secondary channel training data. This was to ensure that a reasonable initial model set was used in the adaptation process. All recognition tests were carried out on the 1994 ARPA Hub 1 and S5... |

66 | Maximum a-posteriori estimation for multivariate Gaussian mixture observations of Markov chains - Gauvain - 1994 |

65 | The development of the 1994 HTK large vocabulary speech recognition system
- Woodland
- 1994
(Show Context)
Citation Context ...ed for the recognition task was a gender-independent cross-word-triphone mixture-Gaussian tied-state HMM system. This was the same as the "HMM-1" model set used in the HTK 1994 ARPA evaluati=-=on system [23]-=-. The speech was parameterised into 12 MFCCs, C 1 to C 12 , along with normalised log-energy and the first and second differentials of these parameters. This yielded a 39-dimensional feature vector. C... |

64 | The generation and use of regression class trees for MLLR adaptation
- Gales
- 1996
(Show Context)
Citation Context ...ted. The question of what the appropriate number of transforms is for a particular set of adaptation data and how the Gaussian components should be grouped together is interesting and is discussed in =-=[6]-=-. The question of complexity versus specificity was also examined in [17], where using an unconstrained block-diagonal mean transformation was shown to be better than a diagonal transformation. This m... |

53 | A one pass decoder design for large vocabulary recognition
- Odel, Valtchev, et al.
- 1994
(Show Context)
Citation Context ... word list and dictionary was used with the trigram language model described in [23]. For the S5 task a 5K vocabulary with trigram language model was used. All decoding used a dynamic-network decoder =-=[18]-=- which can either operate in a single-pass or rescore pre-computed word lattices. A 12 component mixture Gaussian distribution was then trained for each tied state, a total of about 6 million paramete... |

50 |
Integrated models of signal and background with application to speaker identification in noise
- Rose, Hofstetter, et al.
- 1994
(Show Context)
Citation Context ... (21) Thus by appropriately modifying the means the additional cost at recognition time is just a matrix-vector multiplication and a simple addition. The transform using a simple bias on the variance =-=[20, 22]-=- is not considered here, as for many situations it can give an inappropriate transformation. For cases where the variance bias is not constrained to be positive any unobserved Gaussian component may e... |

44 |
Probabilistic optimum filtering for robust speech recognition
- Neumeyer, Weintraub
- 1994
(Show Context)
Citation Context ... not allowed to alter the recogniser stage in any way 2 . A variety of linear feature-space transformations for adaptation and compensation for speech recognition have been proposed in the literature =-=[12, 15, 16]-=-. ML training of linear feature-space transformations may be shown to be, not surprisingly, inappropriate for speech recognition (see [7]). In contrast, model-space transformations, which act on the m... |

34 |
Improved Acoustic Modeling for HMMs using Linear Transform
- Leggetter
- 1995
(Show Context)
Citation Context ... not allowed to alter the recogniser stage in any way 2 . A variety of linear feature-space transformations for adaptation and compensation for speech recognition have been proposed in the literature =-=[12, 15, 16]-=-. ML training of linear feature-space transformations may be shown to be, not surprisingly, inappropriate for speech recognition (see [7]). In contrast, model-space transformations, which act on the m... |

25 | Experiments in speaker normalisation and adaptation for large vocabulary speech recognition
- Pye, Woodland
- 1997
(Show Context)
Citation Context ...ansforms used are vocal tract normalisation (VTN) [11] and speaker adaptive training (SAT) [1]. The gains obtained using VTN have been shown to be essentially additive to the gains obtained using SAT =-=[19]-=-. This paper does not consider the use of VTN as it is only concerned with linear transformations, though VTN would similarly be expected to improve results quoted here. The standard SAT training uses... |

21 |
Robust Speech Recognition Based on Stochastic
- Sankar, Lee
- 1995
(Show Context)
Citation Context ...may be applied to an HMM-based speech recognition system and how they may be simply estimated. Usually linear transformations are described as being applied in either the model-space or feature-space =-=[21]-=-. This paper uses the same terminology, but applied in a very strict sense. Thus a feature-space transform is required to only act on the features, it is not allowed to alter the recogniser stage in a... |

17 |
Unsupervised Speaker Adaptation by Probabilistic Spectrum
- Cox, Bridle
- 1989
(Show Context)
Citation Context ...on Research Center, Yorktown Heights, NY 10598, USA 1 Introduction In recent years there has been a vast amount of work done on estimating and applying linear transformations to HMM-based recognisers =-=[2, 4, 13, 17]-=-. Though not the only possible model adaptation scheme, for example maximum a-posteriori adaptation [10] may be used, linear transforms have been shown to be a powerful tool for both speaker and envir... |

16 |
Speaker Normalisation Using Efficient Frequency Warping Procedures
- Lee, Rose
- 1996
(Show Context)
Citation Context ...ted. These transforms are then compared in terms of efficiency at run-time and in training the transform. There has also been much interest in using adaptation techniques in both training and testing =-=[1, 11]-=-. Here, instead of applying the test set adaptation transforms to a speaker-independent modelset they are applied to a model set trained using that adaptation scheme. Thus the model-set used in adapta... |

14 | Semi-Tied Full-Covariance matrices for hidden markov models
- Gales
- 1997
(Show Context)
Citation Context ...here is also the constraint that there are no numerical accuracy problems. 4 increase the likelihood at each iteration. The optimisation has the same form as the semi-tied fullcovariance optimisation =-=[8]-=-, where an indirect method over the rows was previously presented. The advantage of the indirect method was that it did not involve the inversion of G (i) . In contrast to the variance transform in eq... |

13 | L.: Practical implementations of speaker adaptive training
- Schwartz
- 1997
(Show Context)
Citation Context ... (35) and \Phi A (s) ; b (s) \Psi is the transformation associated with speaker s 11 . Unfortunately when implementing these re-estimation formulae there are severe computational and memory overheads =-=[14, 19]-=-. In order to update the means as described in equation 32 it is necessary to store a full, or block-diagonal, matrix for each Gaussian component. This rapidly becomes impractical as the number of Gau... |

10 | Unsupervised Speaker-Adaptation For Hybrid HMM-MLP Continuous Speech Recognition System
- Neto, Martins, et al.
- 1995
(Show Context)
Citation Context ... not allowed to alter the recogniser stage in any way 2 . A variety of linear feature-space transformations for adaptation and compensation for speech recognition have been proposed in the literature =-=[12, 15, 16]-=-. ML training of linear feature-space transformations may be shown to be, not surprisingly, inappropriate for speech recognition (see [7]). In contrast, model-space transformations, which act on the m... |

5 |
A Sankar, and V V Digalakis. A comparative study of speaker adaptation techniques
- Neumeyer
- 1995
(Show Context)
Citation Context |

4 |
Leggetter and P CWoodland. Maximum likelihood linear regression for speaker adaptation of continuous density HMMs. Computer Speech and Language
- J
- 1995
(Show Context)
Citation Context ...ansform for speaker adaptive training are detailed. 1 Introduction In recent years there has been a vast amount of work done on estimating and applying linear transformations to HMM-based recognisers =-=[2, 4, 12, 16]-=-. Though not the only possible model adaptation scheme, for example maximum a-posteriori adaptation [9] may be used, linear transforms have been shown to be a powerful tool for both speaker and enviro... |