## Cluster Adaptive Training Of Hidden Markov Models (1999)

Venue: | IEEE Transactions on Speech and Audio Processing |

Citations: | 63 - 16 self |

### BibTeX

@ARTICLE{Gales99clusteradaptive,

author = {M.J.F. Gales},

title = {Cluster Adaptive Training Of Hidden Markov Models},

journal = {IEEE Transactions on Speech and Audio Processing},

year = {1999},

volume = {8},

pages = {417--428}

}

### Years of Citing Articles

### OpenURL

### Abstract

When performing speaker adaptation there are two conicting requirements. First the transform must be powerful enough to represent the speaker. Second the transform must be quickly and easily estimated for any particular speaker. The most popular adaptation schemes have used many parameters to adapt the models to be representative of an individual speaker. This limits how rapidly the models may be adapted to a new speaker or acoustic environment. This paper examines an adaptation scheme requiring very few parameters, cluster adaptive training (CAT). CAT may be viewed as a simple extension to speaker clustering. Rather than selecting a single cluster as representative of a particular speaker, a linear interpolation of all the cluster means is used as the mean of the particular speaker. This scheme naturally falls into an adaptive training framework. Maximum likelihood estimates of the interpolation weights are given. Furthermore, simple re-estimation formulae for cluster means, represented both explicitly and by sets of transforms of some canonical mean, are given. On a speakerindependent task CAT reduced the word error rate using very little adaptation data. In addition when combined with other adaptation schemes it gave a 5% reduction in word error rate over adapting a speaker-independent model set. 2 1

### Citations

9054 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

655 | Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...er-independent model set. 2 1 Introduction In recent years there has been a great deal of work done on adapting speech recognition systems to acoustic environment dierences or to particular speakers [=-=8, 12, 3, 17-=-]. To adapt large numbers of parameters with very little adaptation data, a transform that modies all the model parameters, not just those observed in the adaptation data, to be representative of a pa... |

441 | Maximum likelihood linear transformations for HMM-based speech recognition
- Gales
- 1998
(Show Context)
Citation Context ... or acoustic environment. A variety of transforms have been examined, for example, vocal tract normalisation [10], maximum likelihood linear regression (MLLR) [12], constrained model-space transforms =-=[3, 7]-=- and speaker clustering [16, 14]. The majority of these techniques apply some transformation to a canonical model. Originally a speaker-independent (SI) model was used as the canonical model. During r... |

160 | A Compact Model for Speaker-Adaptive Training
- Anastaskos, McDonough, et al.
- 1996
(Show Context)
Citation Context ..., the speaker-specic transform is applied to the SI model to generate the speaker-specic model. Recently adaptive training 2 was proposed as an alternative technique to generate the canonical model [1=-=, 10]-=-. Since the vast majority of training databases contain speech from many speakers, and in some cases acoustic environments, the adaptation scheme applied in recognition could also be used during train... |

115 | A maximum-likelihood approach to stochastic matching for robust speech recognition
- Sankar, Lee
- 1996
(Show Context)
Citation Context ...er-independent model set. 2 1 Introduction In recent years there has been a great deal of work done on adapting speech recognition systems to acoustic environment dierences or to particular speakers [=-=8, 12, 3, 17-=-]. To adapt large numbers of parameters with very little adaptation data, a transform that modies all the model parameters, not just those observed in the adaptation data, to be representative of a pa... |

95 | Speaker adaptation using constrained estimation of Gaussian mixtures
- Digalakis, Rtischev, et al.
- 1995
(Show Context)
Citation Context ...er-independent model set. 2 1 Introduction In recent years there has been a great deal of work done on adapting speech recognition systems to acoustic environment dierences or to particular speakers [=-=8, 12, 3, 17-=-]. To adapt large numbers of parameters with very little adaptation data, a transform that modies all the model parameters, not just those observed in the adaptation data, to be representative of a pa... |

66 |
Maximum A-Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains
- Gauvain, Lee
- 1994
(Show Context)
Citation Context |

65 | The generation and use of regression class trees for MLLR adaptation
- Gales
- 1996
(Show Context)
Citation Context ...or it, which is used to adapt the Gaussian components belonging to that class. There are a variety of methods for partitioning the Gaussian components, similar to the partitions for MLLR discussed in =-=[4]-=-. Standard schemes for partitioning the components include clustering in acoustic space and using phonetic questions. Using multiple cluster weight vectors increases the number of parameters to be est... |

43 |
Mixture-Model Adaptation for
- Foster, Kuhn
- 2007
(Show Context)
Citation Context ...l set, the sucient statistics G (m) , K (m) and L (m) , may be calculated as dened by equations 16, 17 and 18. Recently an alternative scheme for very rapid adaptation has been proposed, eigenvoices [=-=9]-=-. This scheme may also be used to initialise CAT. Again a set of simple speaker-dependent models are generated for every speaker in the training data. The means of each speaker are concatenated into a... |

38 | Speaker Clustering and Transformation for Speaker Adaptation
- Padmanabhan, Bahl, et al.
- 1998
(Show Context)
Citation Context ...riety of transforms have been examined, for example, vocal tract normalisation [10], maximum likelihood linear regression (MLLR) [12], constrained model-space transforms [3, 7] and speaker clustering =-=[16, 14-=-]. The majority of these techniques apply some transformation to a canonical model. Originally a speaker-independent (SI) model was used as the canonical model. During recognition, the speaker-specic ... |

34 |
Improved Acoustic Modeling for HMMs using Linear Transform
- Leggetter
- 1995
(Show Context)
Citation Context ...ingle cluster weight vector for each speaker. To addsexibility to the system multiple cluster weight vectors may be used in CAT in a similar fashion to the piecewise linear approximation used in MLLR =-=[11]-=-. The Gaussian components are partitioned into a set of R disjoint cluster weight classes, M (1) w to M (R) w . Each cluster weight class has a separate cluster weight vector calculated for it, which ... |

30 |
Structural MAP speaker adaptation using hierarchical priors
- Shinoda, Lee
- 1997
(Show Context)
Citation Context ... In the case of large vocabulary systems with one adaptation sentence, only a small fraction of Gaussian components will thus be adapted. Extensions to MAP are possible, such as structured MAP (SMAP) =-=[18]-=-. SMAP allows unobserved Gaussian components to be adapted, but requires linear transforms to be estimated. As previously mentioned, even the simplest linear transforms typically require more paramete... |

26 |
Cluster adaptive training for speech recognition
- Gales
- 1998
(Show Context)
Citation Context ... of distinct speaker clusters, the speaker model's 2 Authors have also used the term adapted training. 3 A shortened version of this paper describing the basic CAT algorithm was presented at ICSLP'98 =-=[6]-=-. 1 mean 1 Prior mean 2 mean P-1 Mean Variance l S 1 l l 1 P-1 2 mean P Figure 1: Cluster adaptive training mean parameters are determined by a linear combination of all the cluster means. The Gaussia... |

25 | Experiments in speaker normalisation and adaptation for large vocabulary speech recognition
- Pye, Woodland
- 1997
(Show Context)
Citation Context ...omes large, for example on the ARPA Hub4 task where there are greater than 5,000 such conditions, standard SAT training rapidly becomes impractical. While there are possible solutions to this problem =-=[13, 1-=-5], they typically signicantly increase the training time, or are no longer based on an ML training criterion. 8 This is not true of the modied SAT scheme described in [7]. 5 4.2 Model-based clusters ... |

14 | Training data clustering for improved speech recognition
- Sankar, Beaufays, et al.
- 1995
(Show Context)
Citation Context ...riety of transforms have been examined, for example, vocal tract normalisation [10], maximum likelihood linear regression (MLLR) [12], constrained model-space transforms [3, 7] and speaker clustering =-=[16, 14-=-]. The majority of these techniques apply some transformation to a canonical model. Originally a speaker-independent (SI) model was used as the canonical model. During recognition, the speaker-specic ... |

13 | L.: Practical implementations of speaker adaptive training
- Schwartz
- 1997
(Show Context)
Citation Context ...omes large, for example on the ARPA Hub4 task where there are greater than 5,000 such conditions, standard SAT training rapidly becomes impractical. While there are possible solutions to this problem =-=[13, 1-=-5], they typically signicantly increase the training time, or are no longer based on an ML training criterion. 8 This is not true of the modied SAT scheme described in [7]. 5 4.2 Model-based clusters ... |

11 | Transformation Smoothing for Speaker and Environmental Adaptation - Gales - 1997 |

3 |
Speaker Normalisation Using Ecient Frequency Warping Procedures
- Lee, Rose
- 1996
(Show Context)
Citation Context ...ated. However, the transform should also be powerful enough to accurately model the speaker or acoustic environment. A variety of transforms have been examined, for example, vocal tract normalisation =-=[10]-=-, maximum likelihood linear regression (MLLR) [12], constrained model-space transforms [3, 7] and speaker clustering [16, 14]. The majority of these techniques apply some transformation to a canonical... |