## Online Bayesian tree-structured transformation of HMMs with optimal model selection for speaker adaptation (2001)

Venue: | IEEE Trans. Speech and Audio Proc |

Citations: | 7 - 2 self |

### BibTeX

@ARTICLE{Wang01onlinebayesian,

author = {Shaojun Wang and Yunxin Zhao and Senior Member},

title = {Online Bayesian tree-structured transformation of HMMs with optimal model selection for speaker adaptation},

journal = {IEEE Trans. Speech and Audio Proc},

year = {2001},

pages = {663--677}

}

### OpenURL

### Abstract

Abstract—This paper presents a new recursive Bayesian learning approach for transformation parameter estimation in speaker adaptation. Our goal is to incrementally transform or adapt a set of hidden Markov model (HMM) parameters for a new speaker and gain large performance improvement from a small amount of adaptation data. By constructing a clustering tree of HMM Gaussian mixture components, the linear regression (LR) or affine transformation parameters for HMM Gaussian mixture components are dynamically searched. An online Bayesian learning technique is proposed for recursive maximum a posteriori (MAP) estimation of LR and affine transformation parameters. This technique has the advantages of being able to accommodate flexible forms of transformation functions as well as a priori probability density functions (pdfs). To balance between model complexity and goodness of fit to adaptation data, a dynamic programming algorithm is developed for selecting models using a Bayesian variant of the “minimum description length ” (MDL) principle. Speaker adaptation experiments with a 26-letter English alphabet vocabulary were conducted, and the results confirmed effectiveness of the online learning framework. Index Terms—Affine transformation, Bayesian model selection, hidden Markov models (HMMs), linear regression (LR), model

### Citations

8797 |
Introduction to Algorithms
- Cormen, Lieserson, et al.
- 1990
(Show Context)
Citation Context ... Either a bottom-up dynamic programming algorithm or a top-down recursive algorithm can be performed to obtain the MDL. In general, the bottom-up approach is more efficient than the top-down approach =-=[12]-=-. However, for this problem, the two approaches have the same computational complexity. Li et al. [35] used the top-down recursive algorithm to calculate the MDL in the MLE sense for generalizing case... |

1090 |
Linear and Nonlinear Programming
- Luenberger
- 1984
(Show Context)
Citation Context ..., as suggested in [22], i.e., (7) (8) (9) and the modified algorithm has the desirable property of being locally monotonic when [37]. The optimal choice of at each step is determined by a line search =-=[36]-=- to maximize , where is approximated by In the choice of a priori pdf , we adopt the generalized Gaussian density (GGD), which has the form (10) In (9), the a posteriori estimate of can be solved by t... |

1002 |
The EM Algorithm and Extensions
- McLachlan, Krishnan
- 1997
(Show Context)
Citation Context ...lly, our findings, together with future research directions and open problems, are summarized in Section VII. , where denotes the auxiliary function of log likelihood as defined in EM algorithm [15], =-=[37]-=- and is the a priori pdf of with a hyperparameter . It follows that maximizing leads to improvements in [15], [37]. Inspired by Titterington’s work on recursive estimation using incomplete data [46], ... |

796 | A view of the EM algorithm that justifies incremental, sparse, and other variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...the model parameters updated at that time, the parameter estimates are not as accurate as batch training. Another approach [20], [21], [23] used an incremental version of the EM algorithm proposed in =-=[39]-=-. In incremental EM approach, the conditional sufficient statistics of the th observation are computed using the th model estimates, and the sufficient statistics of the previous observations are unch... |

693 |
An Introduction to Signal Detection and Estimation
- Poor
- 1994
(Show Context)
Citation Context ...tree after each or several updates of parameter estimates. This problem has been investigated by Chien [6] in a batch algorithm. Another direction is using the sequential hypothesis testing technique =-=[40]-=- as a verification scheme for evaluating the transformation reliability, which was shown very useful for unsupervised speaker adaptation [9]. In designing a learning algorithm, it is often important t... |

622 | Maximum likelihood linear regression for speaker adaptation of HMMs
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...quires exploiting relationship among acoustic-phonetic units. Speaker adaptation techniques can be categorized into the approaches of Bayesian estimation [22], [32] and parameter transformation [16], =-=[34]-=-. Bayesian estimation has the asymptotic property that by using a sufficiently large amount of adaptation data from a speaker, SI acoustic models will be converged to speaker-dependent acoustic models... |

608 |
Introduction to computational Learning Theory
- Kearns, Vazirani
- 1994
(Show Context)
Citation Context ...tural complexity and parameterization complexity to best fit the adaptation data, the goal being minimization of recognition error. The model selection problem is coarsely prefigured by Occam’s Razor =-=[29]-=-: given two hypotheses that fit the data equally well, prefer the simpler one. Rissanen distills such thinking in his Principle of MDL: choose the model that gives the shortest description of data. Or... |

602 |
The computational complexity of probabilistic inference using Bayesian belief networks
- Cooper
- 1990
(Show Context)
Citation Context ...algorithm, the current parameter estimates are used to decode the current adaptation utterance only, leaving the previous decoding results unchanged. In general, learning a graphical model is NP-hard =-=[11]-=-. For a tree-structured model, we could in principle calculate model description length for every possible tree cut and take the model for672 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9,... |

514 | Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains
- Gauvain, Lee
- 1994
(Show Context)
Citation Context ...ount of enrollment speech, which in general requires exploiting relationship among acoustic-phonetic units. Speaker adaptation techniques can be categorized into the approaches of Bayesian estimation =-=[22]-=-, [32] and parameter transformation [16], [34]. Bayesian estimation has the asymptotic property that by using a sufficiently large amount of adaptation data from a speaker, SI acoustic models will be ... |

425 | Maximum likelihood linear transformations for HMMbased speech recognition
- Gales
- 1998
(Show Context)
Citation Context ...cumulated sufficient statistics are computed from each utterance using the model parameters updated at that time, the parameter estimates are not as accurate as batch training. Another approach [20], =-=[21]-=-, [23] used an incremental version of the EM algorithm proposed in [39]. In incremental EM approach, the conditional sufficient statistics of the th observation are computed using the th model estimat... |

420 |
Maximun likelihood estimation from incomplete data using the EM algorithm (with discussion
- DEMPSTER, LAIRD, et al.
- 1977
(Show Context)
Citation Context .... Finally, our findings, together with future research directions and open problems, are summarized in Section VII. , where denotes the auxiliary function of log likelihood as defined in EM algorithm =-=[15]-=-, [37] and is the a priori pdf of with a hyperparameter . It follows that maximizing leads to improvements in [15], [37]. Inspired by Titterington’s work on recursive estimation using incomplete data ... |

112 | Mean and variance adaptation within the MLLR framework
- Gales, Woodland
- 1996
(Show Context)
Citation Context ...the accumulated sufficient statistics are computed from each utterance using the model parameters updated at that time, the parameter estimates are not as accurate as batch training. Another approach =-=[20]-=-, [21], [23] used an incremental version of the EM algorithm proposed in [39]. In incremental EM approach, the conditional sufficient statistics of the th observation are computed using the th model e... |

106 | Generalizing case frames using a thesaurus and the MDL principle
- Li, Abe
- 1998
(Show Context)
Citation Context ...o obtain the MDL. In general, the bottom-up approach is more efficient than the top-down approach [12]. However, for this problem, the two approaches have the same computational complexity. Li et al. =-=[35]-=- used the top-down recursive algorithm to calculate the MDL in the MLE sense for generalizing case frames in natural language processing. Shinoda et al. [42] used the top-down recursive algorithm to c... |

92 | Speaker adaptation using constrained estimation of Gaussian mixtures
- Digalakis, Rtischev, et al.
- 1995
(Show Context)
Citation Context ...ral requires exploiting relationship among acoustic-phonetic units. Speaker adaptation techniques can be categorized into the approaches of Bayesian estimation [22], [32] and parameter transformation =-=[16]-=-, [34]. Bayesian estimation has the asymptotic property that by using a sufficiently large amount of adaptation data from a speaker, SI acoustic models will be converged to speaker-dependent acoustic ... |

79 | Structural maximum a posteriori linear regression for fast HMM adaptation,”, submitted for publication
- Siohan, Myrvoll, et al.
(Show Context)
Citation Context ... forms of transformation functions were limited to those having reproducible a priori/a posteriori probability density function (pdf) pairs, which were either conjugate [22] or elliptically symmetric =-=[5]-=-, [7], and were unfortunately few. Recognition performance was shown sensitive to the parameter update interval lengths, the longer the better. In this paper, we propose applying a recursive Bayesian ... |

71 |
Semi-continuous hidden Markov models for speech signal
- Huang, Jack
- 1989
(Show Context)
Citation Context ... widely studied. Since given a small adaptation data set, it is unlikely to have sufficient speech data for all hidden Markov model (HMM) units, certain parameter correlation [1], [13] and tying [2], =-=[25]-=- are introduced so that the model parameters can be consistently and fully adjusted. Correlations among Gaussian mean parameter vectors have been used in HMM parameter adaptation [31] and tying has be... |

68 |
Tied mixture continuous parameter modeling for speech recognition
- Bellegarda, Nahamoo
- 1990
(Show Context)
Citation Context ... been widely studied. Since given a small adaptation data set, it is unlikely to have sufficient speech data for all hidden Markov model (HMM) units, certain parameter correlation [1], [13] and tying =-=[2]-=-, [25] are introduced so that the model parameters can be consistently and fully adjusted. Correlations among Gaussian mean parameter vectors have been used in HMM parameter adaptation [31] and tying ... |

60 |
Recursive parameter estimation using incomplete data
- Titterington
- 1984
(Show Context)
Citation Context ..., [37] and is the a priori pdf of with a hyperparameter . It follows that maximizing leads to improvements in [15], [37]. Inspired by Titterington’s work on recursive estimation using incomplete data =-=[46]-=-, a recursive estimation formula can be derived for by taking the normalized auxiliary function as the objective function. Maximizing the second-order Taylor series expansion of with respect to and de... |

54 | Natural Statistical Models for Automatic Speech Recognition
- Bilmes
- 1998
(Show Context)
Citation Context ...ively small number of parameters are used to characterize the dependency. Examples include Markov random fields [41], multi-scale tree processes [28], tree-structural MAP adaptation [43], buried HMMs =-=[3]-=-, and dynamic Bayesian networks [51]. Among these methods, Digalakis et al. [19] compared the first three and concluded that significant gain in accuracy can be obtained by exploiting dependency among... |

48 | Speaker adaptation using combined transformation and Bayesian methods
- Digalakis, Neumeyer
- 1996
(Show Context)
Citation Context ...ch is small. However, parameter transformation may not lead to convergence to speaker-dependent models. Adaptation algorithms have been proposed to exploit the advantages of both approaches [6], [8], =-=[17]-=-, [44]. These algorithms can achieve a large adaptation effect when using a small amount of data and maintain the asymptotic property when using a large amount of data. Furthermore, speaker adaptation... |

41 |
On stochastic feature and model compensation approaches to robust speech recognition
- Lee
- 1998
(Show Context)
Citation Context ...f enrollment speech, which in general requires exploiting relationship among acoustic-phonetic units. Speaker adaptation techniques can be categorized into the approaches of Bayesian estimation [22], =-=[32]-=- and parameter transformation [16], [34]. Bayesian estimation has the asymptotic property that by using a sufficiently large amount of adaptation data from a speaker, SI acoustic models will be conver... |

32 | Probabilistic modeling with bayesian networks for automatic speech recognition
- Zweig, Russell
- 1998
(Show Context)
Citation Context ...e used to characterize the dependency. Examples include Markov random fields [41], multi-scale tree processes [28], tree-structural MAP adaptation [43], buried HMMs [3], and dynamic Bayesian networks =-=[51]-=-. Among these methods, Digalakis et al. [19] compared the first three and concluded that significant gain in accuracy can be obtained by exploiting dependency among acoustic model parameters. Chien [7... |

30 | On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate
- Huo, Lee
- 1997
(Show Context)
Citation Context ...ence, online speaker adaptation in general requires less computation and memory as compared with batch adaptation. Several approaches appeared in the literature for online adaptation [7], [18], [23], =-=[26]-=-, [27], [48], [50]. One approach [48], [50] applied expectation-maximization (EM) algorithm or segmental -means algorithm sequentially to online test speech to accomplish unsupervised learning of mode... |

29 |
and Rissanen: Intertwining themes in theories of model selection
- Lanterman
(Show Context)
Citation Context ...nd mixture sequence, and the feature vectors are again assigned into the tree nodes for estimation of transformation parameters. A Bayesian variant of the “mininum description length” (MDL) principle =-=[30]-=- is then used to determine optimal tree size and transformation matrix forms and the HMM parameters are adapted by the transformation functions. The steps from Viterbi decoding through model parameter... |

29 | On Adaptive decision rules and decision parameter adaptation for automatic speech recognition
- Lee, Huo
- 2000
(Show Context)
Citation Context ... adaptation effect when using a small amount of data and maintain the asymptotic property when using a large amount of data. Furthermore, speaker adaptation may operate in batch or online modes [32], =-=[33]-=-. In batch mode, adaptation is performed over a set of enrollment speech data. In online mode, adaptation is performed incrementally and data are discarded after usage. As a consequence, online speake... |

29 |
Structural MAP speaker adaptation using hierarchical priors
- Shinoda, Lee
- 1997
(Show Context)
Citation Context ... and hence a relatively small number of parameters are used to characterize the dependency. Examples include Markov random fields [41], multi-scale tree processes [28], tree-structural MAP adaptation =-=[43]-=-, buried HMMs [3], and dynamic Bayesian networks [51]. Among these methods, Digalakis et al. [19] compared the first three and concluded that significant gain in accuracy can be obtained by exploiting... |

23 | The sample complexity of learning fixed-structure bayesian networks
- Dasgupta
- 1997
(Show Context)
Citation Context ...ples needed for training a model, such that for new testing data, one has certain confidence that the probability of making an error is under certain level. This problem is known as sample complexity =-=[14]-=- and it is often formulated in a probably approximately correct (PAC) sense of learning [29]. Upper and lower bounds on sample complexity can be derived in the PAC framework for a learning algorithm, ... |

20 |
Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models
- Ahadi, Woodland
- 1997
(Show Context)
Citation Context ...eaker adaptation has been widely studied. Since given a small adaptation data set, it is unlikely to have sufficient speech data for all hidden Markov model (HMM) units, certain parameter correlation =-=[1]-=-, [13] and tying [2], [25] are introduced so that the model parameters can be consistently and fully adjusted. Correlations among Gaussian mean parameter vectors have been used in HMM parameter adapta... |

19 |
A posteriori estimation of correlated jointly Gaussian mean vectors
- Lasry, Stern
- 1984
(Show Context)
Citation Context ...] and tying [2], [25] are introduced so that the model parameters can be consistently and fully adjusted. Correlations among Gaussian mean parameter vectors have been used in HMM parameter adaptation =-=[31]-=- and tying has been widely used in transformation-based adaptation [34]. Certain techniques also relate model parameters across all classes by making Markovian assumptions on the dependency structure,... |

18 |
Predictive speaker adaptation in speech recognition
- Cox
- 1995
(Show Context)
Citation Context ... adaptation has been widely studied. Since given a small adaptation data set, it is unlikely to have sufficient speech data for all hidden Markov model (HMM) units, certain parameter correlation [1], =-=[13]-=- and tying [2], [25] are introduced so that the model parameters can be consistently and fully adjusted. Correlations among Gaussian mean parameter vectors have been used in HMM parameter adaptation [... |

15 |
Speaker adaptation with autonomous model complexity control by MDI principle
- Shinoda, Watanabe
- 1996
(Show Context)
Citation Context ...ansformation cluster. The tree-structured clustering technique provides a hierarchical way of defining transformation clusters. To construct a hierarchical tree, we follow the procedure of Chien [7], =-=[42]-=- where the Gaussian mixture components of HMMs are clustered by using the binary split -means algorithm with a divergence measure, i.e., tr tr where was obtained by merging the Gaussian pdfs grouped i... |

14 | speaker-independent continuous speech recognition system using continuous mixture Gaussian density HMM of phoneme-sized units - Zhao, “A - 1993 |

13 |
Batch incremental and instantaneous adaptation techniques for speech recognition
- Zavaliagkos, Schwartz, et al.
- 1995
(Show Context)
Citation Context ... speaker adaptation in general requires less computation and memory as compared with batch adaptation. Several approaches appeared in the literature for online adaptation [7], [18], [23], [26], [27], =-=[48]-=-, [50]. One approach [48], [50] applied expectation-maximization (EM) algorithm or segmental -means algorithm sequentially to online test speech to accomplish unsupervised learning of model parameters... |

12 |
Maximum a posterior linear regression with elliptically symmetric matrix variate priors
- Chou
- 1999
(Show Context)
Citation Context ...rameters, and the information matrix will be . A fast algorithm for matrix inversion is for matrix, so the proposed method needs operations for each , while an incremental EM algorithm for MAPLR [5], =-=[10]-=- needs operations for each . One direction to further improve the performance of our adaptation technique is to incorporate tree-structural learning during the online learning process, that is, to upd... |

12 |
Ephraim,"IIidden Markov Modeling using a dominant state sequence with application to speech recognition,"Computer
- Merhav, Y
(Show Context)
Citation Context ...vations , with each generated by a HMM, the likelihood of observation sequences is approximated by the joint likelihood of the dominant state and mixture index sequences and the observation sequences =-=[38]-=-, that is , where , and are the optimal state and mixture index sequences determined by the Viterbi algorithm. EachWANG AND ZHAO: ONLINE BAYESIAN TREE-STRUCTURED TRANSFORMATION OF HMMs 671 feature ve... |

11 |
A Markov random field approach to Bayesian speaker adaptation
- Shahshahani
- 1997
(Show Context)
Citation Context ...int correlation is represented by a low-order conditional distribution and hence a relatively small number of parameters are used to characterize the dependency. Examples include Markov random fields =-=[41]-=-, multi-scale tree processes [28], tree-structural MAP adaptation [43], buried HMMs [3], and dynamic Bayesian networks [51]. Among these methods, Digalakis et al. [19] compared the first three and con... |

11 |
Non-linear compensation for stochastic matching
- Surendran, Lee, et al.
- 1999
(Show Context)
Citation Context ...many cases the mismatch is nonlinear and the functional form is unknown, extensive studies show that LR and affine transformation give rather large improvement to speech recognition performance [33], =-=[45]-=-, and LR and affine transformation can be viewed as first-order approximations to nonlinear mismatch functions in both model and feature spaces. However, previous efforts in online Bayesian estimation... |

10 | Online adaptation of hidden Markov models using incremental estimation algorithms
- Digalakis
- 1999
(Show Context)
Citation Context ...As a consequence, online speaker adaptation in general requires less computation and memory as compared with batch adaptation. Several approaches appeared in the literature for online adaptation [7], =-=[18]-=-, [23], [26], [27], [48], [50]. One approach [48], [50] applied expectation-maximization (EM) algorithm or segmental -means algorithm sequentially to online test speech to accomplish unsupervised lear... |

8 | A hybrid algorithm for speaker adaptation using MAP transformation and adaptation - Chien, Lee, et al. - 1997 |

6 | Efficient training algorithms for HMM’s using incremental estimation
- Gotoh, Hochberg, et al.
- 1998
(Show Context)
Citation Context ...onsequence, online speaker adaptation in general requires less computation and memory as compared with batch adaptation. Several approaches appeared in the literature for online adaptation [7], [18], =-=[23]-=-, [26], [27], [48], [50]. One approach [48], [50] applied expectation-maximization (EM) algorithm or segmental -means algorithm sequentially to online test speech to accomplish unsupervised learning o... |

4 |
Comments on “Efficient training algorithms for HMMs using incremental estimation
- Byrne, Gunawardana
- 2000
(Show Context)
Citation Context ...onditional sufficient statistics of all the training data are recalculated using the latest parameter estimates. Even though likelihood may not be monotonically increased as in the batch EM algorithm =-=[4]-=-, convergence of this incremental EM algorithm has 1063–6676/01$10.00 © 2001 IEEE664 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 6, SEPTEMBER 2001 Fig. 1. Block-diagram of online Ba... |

4 |
Convergence of EM variants
- Gunawardana, Byrne
- 1999
(Show Context)
Citation Context ...EEE664 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 6, SEPTEMBER 2001 Fig. 1. Block-diagram of online Bayesian tree-structured transformation of HMM parameters. been proved recently =-=[24]-=- by Csiszar’s alternating minimization procedure. However, as pointed out by Digalakis [18], this incremental EM algorithm is not an online algorithm since multiple passes through the data are perform... |

4 | On-line adaptive learning of CDHMM parameters based on multiple-stream prior evolution and posterior pooling - Huo, Ma - 1999 |

3 |
Unsupervised hierarchical adaptation using reliable selection of cluster-dependent parameters
- Chien, Junqua
- 2000
(Show Context)
Citation Context ...ction is using the sequential hypothesis testing technique [40] as a verification scheme for evaluating the transformation reliability, which was shown very useful for unsupervised speaker adaptation =-=[9]-=-. In designing a learning algorithm, it is often important to know the amount of samples needed for training a model, such that for new testing data, one has certain confidence that the probability of... |

2 |
et al., “Rapid speech recognizer adaptation to new speakers
- Digalakis
(Show Context)
Citation Context ...les include Markov random fields [41], multi-scale tree processes [28], tree-structural MAP adaptation [43], buried HMMs [3], and dynamic Bayesian networks [51]. Among these methods, Digalakis et al. =-=[19]-=- compared the first three and concluded that significant gain in accuracy can be obtained by exploiting dependency among acoustic model parameters. Chien [7] recently developed an online Bayesian tran... |

2 |
Statistical recursive estimation algorithms for speaker adaptation
- Wang
- 2000
(Show Context)
Citation Context ...certain cases, can be replaced by its expectation, i.e., the complete-data Fisher information matrix . In this paper, however, we consider only the case of using , and more details on can be found in =-=[47]-=-. From (2) and (3), we can see that the effect of a priori information decreases as the number of observations becomes large. The batch algorithm of (2) and (3) is next converted into a recursive esti... |

1 |
Hybrid adaptation of tree structure and hidden Markov models for robust speech recognition
- Chien
- 1999
(Show Context)
Citation Context ...lment speech is small. However, parameter transformation may not lead to convergence to speaker-dependent models. Adaptation algorithms have been proposed to exploit the advantages of both approaches =-=[6]-=-, [8], [17], [44]. These algorithms can achieve a large adaptation effect when using a small amount of data and maintain the asymptotic property when using a large amount of data. Furthermore, speaker... |

1 |
hierarchical transformation of hidden Markov models for speech recognition
- “Online
- 1999
(Show Context)
Citation Context ...age. As a consequence, online speaker adaptation in general requires less computation and memory as compared with batch adaptation. Several approaches appeared in the literature for online adaptation =-=[7]-=-, [18], [23], [26], [27], [48], [50]. One approach [48], [50] applied expectation-maximization (EM) algorithm or segmental -means algorithm sequentially to online test speech to accomplish unsupervise... |

1 |
Modeling parameter dependence in speaker adaptation using multiscale tree processes
- Kannan, Ostendorf
- 1998
(Show Context)
Citation Context ... a low-order conditional distribution and hence a relatively small number of parameters are used to characterize the dependency. Examples include Markov random fields [41], multi-scale tree processes =-=[28]-=-, tree-structural MAP adaptation [43], buried HMMs [3], and dynamic Bayesian networks [51]. Among these methods, Digalakis et al. [19] compared the first three and concluded that significant gain in a... |

1 |
Joint maximum a posterior adaptation of transformation and hidden Markov model parameters
- Siohan, Chesta, et al.
(Show Context)
Citation Context ...small. However, parameter transformation may not lead to convergence to speaker-dependent models. Adaptation algorithms have been proposed to exploit the advantages of both approaches [6], [8], [17], =-=[44]-=-. These algorithms can achieve a large adaptation effect when using a small amount of data and maintain the asymptotic property when using a large amount of data. Furthermore, speaker adaptation may o... |