## Semi-Tied Covariance Matrices For Hidden Markov Models (1999)

Venue: | IEEE Transactions on Speech and Audio Processing |

Citations: | 202 - 31 self |

### BibTeX

@ARTICLE{Gales99semi-tiedcovariance,

author = {M.J.F. Gales},

title = {Semi-Tied Covariance Matrices For Hidden Markov Models},

journal = {IEEE Transactions on Speech and Audio Processing},

year = {1999},

volume = {7},

pages = {272--281}

}

### Years of Citing Articles

### OpenURL

### Abstract

There is normally a simple choice made in the form of the covariance matrix to be used with continuous-density HMMs. Either a diagonal covariance matrix is used, with the underlying assumption that elements of the feature vector are independent, or a full or block-diagonal matrix is used, where all or some of the correlations are explicitly modelled. Unfortunately when using full or block-diagonal covariance matrices there tends to be a dramatic increase in the number of parameters per Gaussian component, limiting the number of components which may be robustly estimated. This paper introduces a new form of covariance matrix which allows a few \full" covariance matrices to be shared over many distributions, whilst each distribution maintains its own \diagonal" covariance matrix. In contrast to other schemes which have hypothesised a similar form, this technique ts within the standard maximumlikelihood criterion used for training HMMs. The new form of covariance matrix is evaluated on a large-vocabulary speech-recognition task. In initial experiments the performance of the standard system was achieved using approximately half the number of parameters. Moreover, a 10% reduction in word error rate compared to a standard system can be achieved with less than a 1% increase in the number of parameters and little increase in recognition time. 2 1

### Citations

8929 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster
- 1977
(Show Context)
Citation Context ...) . This is used to generate the component's covariance matrix as described in equation 13. It is very complex to optimise these parameters directly so an expectation-maximisation approach is adopted =-=[4]-=- 2 . Furthermore, rather than dealing with H (r) , it is simpler to deal with its inverse, A (r) , thus A (r) = H (r) 1 . If ML estimates of all the parameters are made then the auxiliary function bel... |

4569 | A tutorial on hidden Markov models and selected applications in speech recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...rs and little increase in recognition time. 2 1 Introduction There is normally a simple choice made in the form of the covariance matrix to be used with continuous-density hidden Markov models (HMMs) =-=[19]-=-. Either a diagonal covariance matrix is used, with the underlying assumption that each element of the feature vector is independent, or a full or block-diagonal matrix is used, where all or some of t... |

2894 |
Introduction to Statistical Pattern Recognition
- Fukunaga
- 1972
(Show Context)
Citation Context ...e independent. The use of the discrete cosine transform in speech recognition is common for this reason [3]. Other schemes include linear discriminant analysis (LDA) and the Karhunen-Loeve transform [=-=5]-=-. However, it is hard tosnd a single transform which decorrelates all elements of the feature vector for all states. Model-based schemes are a moresexible approach, which allow many decorrelating tran... |

827 | Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences
- Davis, Mermelstein
- 1980
(Show Context)
Citation Context ...the front-end processing is modied to try and ensure that all elements of the feature vector are independent. The use of the discrete cosine transform in speech recognition is common for this reason [=-=3-=-]. Other schemes include linear discriminant analysis (LDA) and the Karhunen-Loeve transform [5]. However, it is hard tosnd a single transform which decorrelates all elements of the feature vector for... |

652 | Maximum likelihood linear regression for speaker adaptation of the continuous density hidden Markov models. Computer Speech and Language
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context .... Thus, when using equation 22, the inverse of G (ri) may not exist. There are two solutions to this problem, similar to those used to ensure robustness in maximum likelihood linear regression (MLLR) =-=[15]-=-. Thesrst is to use block diagonal transformations, thus dramatically reducing the chance of nonfull rank matrices. Furthermore it decreases both the computational load (it is cheaper to invert three ... |

437 | Maximum likelihood linear transformations for hmm-based speech recognition
- Gales
- 1997
(Show Context)
Citation Context ...ions to the case when full covariance matrices are used is also possible [7] 8 . Another closely related problem is ML linear transformations of the variances for speaker and environmental adaptation =-=[8]-=-. Here a linear transform, typically tied over many components, is required to adapt the variances to be representative of a new speaker, or acoustic environment. When adapted in an unconstrained mode... |

201 | Tree-based state tying for high accuracy acoustic modeling
- Young, Odell, et al.
- 1994
(Show Context)
Citation Context ...consisted of 36493 sentences from the SI-284 WSJ0 and WSJ1 sets, and the LIMSI 1993 WSJ lexicon and phone set were used. The standard HTK system was trained using decision-tree-based state clustering =-=[22-=-] to dene 6399 speech states. For the H1 task a 65k word list and dictionary was used with the trigram language model described in [20]. All decoding used a dynamic-network decoder [18]. When generati... |

120 | Mean and variance adaptation within the MLLR framework
- Gales
- 1996
(Show Context)
Citation Context ...he current model set. A further modication can be used to generate a transform that is guaranteed to increase the likelihood. This transform has a similar form to the variance transform described in [10]. The component-specic variance may be written as (m) = L (m) diag (s)0 full L (m)T diag (10) where (s)0 full = P m2M (s) L (m) 1 diag P sm ()(o() (m) )(o() (m) ) T L (m) 1 diag T... |

85 |
Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition
- Kumar
- 1997
(Show Context)
Citation Context ...meters. This optimisation is performed using a simple iterative scheme, which is guaranteed to increase the likelihood of the training data. Recently an extension to LDA based on ML has been proposed =-=[14-=-], heteroscedastic LDA (HLDA). Although addressing a dierent problem, that of dimensionality reduction, optimising the HLDA transform requires solving similar equations to the ones described here. In ... |

67 |
Maximum-Likelihood Estimation for Mixture Multivariate Stochastic Observations of Markov Chains
- Juang
- 1985
(Show Context)
Citation Context ...number of parameters per Gaussian component, limiting the number of components which may be robustly estimated. To overcome this problem multiple diagonalcovariance Gaussian distributions may be used =-=[16, 13]-=-. In addition to being able to model nonGaussian distributions they can model correlations. However, it is preferable to decorrelate the feature vector as far as possible, as otherwise components must... |

65 | The development of the 1994 HTK large vocabulary speech recognition system
- Woodland
- 1994
(Show Context)
Citation Context ...ed for the recognition task was a gender-independent cross-word-triphone mixture-Gaussian tied-state HMM system. This was the same as the \HMM-1" model set used in the HTK 1994 ARPA evaluation sy=-=stem [20]-=-. In this model set all the speech models had a three emitting state, left-to-right topology. Two silence models were used. Thesrst silence model, a short pause model, had a single emitting state whic... |

64 | The generation and use of regression class trees for MLLR adaptation
- Gales
- 1996
(Show Context)
Citation Context ...gle component system and performing agglomerative clustering. An alternative scheme, that has previously been used for generating regression class trees, is based on locally maximising the likelihood =-=[6-=-]. Here a modied version of K-means clustering is used. If there are R semi-tied transforms then for each state 13 , s, the semi-tied transform associated with that state, ^ r (s) , is determined by ^... |

53 | A one pass decoder design for large vocabulary recognition - Odel, Valtchev, et al. - 1994 |

27 | Context Dependent Modeling of Phones in Continuous Speech Using Decision Trees
- Bahl, Souza, et al.
- 1991
(Show Context)
Citation Context ... covariance HMM model set is shown as the baseline 5.3 Parameter tying A variety of techniques have been used for clustering Gaussian components in speech recognition, for example decision tree tying =-=[1]-=-. Unfortunately, it is harder to decide how to group the components into semi-tied classes. The simplest approach is to tie all states together, or all the states of the same monophone together. The c... |

27 |
The importance of cepstral parameter correlations in speech recognition
- Ljolje
- 1994
(Show Context)
Citation Context ... matrix system. This paper introduces a new modelbased scheme, semi-tied covariance matrices. The scheme which is most closely related to the one described in this paper is the state-specic rotation [=-=17-=-], which normally uses a separate transform for each state, but may be applied at any level of clustering. The model-space transform introduced in this paper is a natural extension of the state-specic... |

14 | Semi-Tied Full-Covariance matrices for hidden markov models
- Gales
- 1997
(Show Context)
Citation Context ...d robustly. Each iteration is guaranteed to increase the likelihood of the training data. 20 An alternative optimisation scheme was given in the original presentation of semi-tied covariance matrices =-=[9-=-]. In all cases the two schemes converged to the same solution, however the scheme presented here is felt to be more elegant. 21 It makes no dierence whether the positive or negative root is selected ... |

9 |
Training and Speaker Adaptation in Template-based Speech Recognition
- Hewett
- 1989
(Show Context)
Citation Context ... current estimate of the component mean. This still does not yield a transform that is guaranteed to increase the likelihood (it uses the same sort of approximation as least-squares linear regression =-=[11-=-]), but does relate the transform to the current model set. A further modication can be used to generate a transform that is guaranteed to increase the likelihood. This transform has a similar form to... |

5 |
Recent improvements in the AT&T speech-to-text (STT) system
- Hindle, Ljolje, et al.
- 1996
(Show Context)
Citation Context ...) . For further details of this type of adaptation and its limitations see [7]. These ML adaptation schemes may be contrasted with the least squares linear regression (LSLR) adaptation implemented in =-=[12]-=- when the decorrelating rotation described in [17] was used. Using LSLR it is not possible to guarantee that the likelihood of the adaptation data will increase. However the computational cost is far ... |

4 |
A Liporace. Maximum likelihood estimation for multivariate observations of Markov sources
- unknown authors
- 1982
(Show Context)
Citation Context ...number of parameters per Gaussian component, limiting the number of components which may be robustly estimated. To overcome this problem multiple diagonalcovariance Gaussian distributions may be used =-=[16, 13]-=-. In addition to being able to model nonGaussian distributions they can model correlations. However, it is preferable to decorrelate the feature vector as far as possible, as otherwise components must... |

3 |
Adapting semi-tied full-covariance HMMs
- Gales
- 1997
(Show Context)
Citation Context ...mponent specic covariance matrix need not necessarily be diagonal, they need only be more constrained than the semi-tied transform. Though estimation formulae may be simply derived in this case (see [=-=7-=-] for how to estimate the semi-tied transform in the non-diagonal case), the diagonal component specic covariance case is felt to be the most practically useful and will be the one described in this p... |

1 |
R a Gopinath, D Kavensky, P Olsen, and L Polymenakos. IBM's LVCSR system for transcription of broadcast news used in the 1997 hub4 english evaluation
- Chen, Gales, et al.
- 1998
(Show Context)
Citation Context ...imately the same, the time ecient optimisation only requires the storage of a full covariance matrix at the state level, rather than the Gaussian component level 11 . This was the approach adopted in =-=[2]-=- (except a numerical optimisation scheme was used tosnd the model parameters). 5.2 Number of iterations required For the semi-tied covariance matrix the estimation process is an iterative one. Figure ... |