## Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains (1994)

Venue: | IEEE Transactions on Speech and Audio Processing |

Citations: | 546 - 42 self |

### BibTeX

@ARTICLE{Gauvain94maximuma,

author = {Jean-luc Gauvain and Chin-hui Lee},

title = {Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains},

journal = {IEEE Transactions on Speech and Audio Processing},

year = {1994},

volume = {2},

pages = {291--298}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely the choice of prior distribution family, the specification of the parameters of prior densities and the evaluation of the MAP estimates, are addressed. Using HMMs with Gaussian mixture state observation densities as an example, it is assumed that the prior densities for the HMM parameters can be adequately represented as a product of Dirichlet and normal-Wishart densities. The classical maximum likelihood estimation algorithms, namely the forward-backward algorithm and the segmental k-means algorithm, are expanded and MAP estimation formulas are developed. Prior density estimation issues are discussed for two classes of applications: parameter smoothing and model adaptation, and some experimental results are given illustrating the practical interest of this approach. Because of its adaptive nature, Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications

### Citations

9054 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...idden process, i.e. the state mixture component and the state sequence of a Markov chain for an HMM. In these cases ML estimates are usually obtained using the expectation-maximization (EM) algorithm =-=[6, 1, 28]-=-. For HMM parameter estimation this algorithm is also called the Baum-Welch algorithm. The EM algorithm is an iterative procedure for approximating ML estimates in the general case of models involving... |

4178 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...tored into two terms f(xj`) = h(x)k(`jt(x)) such that h(x) is independent of ` and k(`jt(x)) is the kernel density which is a function of ` and depends on x only through the sufficient statistic t(x) =-=[27, 5, 7]-=-. In this case, the natural solution is to choose the prior density in a conjugate family fk(\Deltaj'); '2 OEg, which includes the kernel density of f(\Deltaj`). The MAP estimation is then reduced to ... |

842 |
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The annals of mathematical statistics
- Baum, Soules, et al.
- 1970
(Show Context)
Citation Context ... recognition applications. 1 Introduction Estimation of a probabilistic function of Markov chain, also called a hidden Markov model (HMM), is usually obtained by the method of maximum likelihood (ML) =-=[1, 2, 23, 15]-=- which assumes that the size of the training data is large enough to provide robust estimates. This paper investigates maximum a posteriori (MAP) estimation of continuous density hidden Markov models ... |

826 |
Optimal Statistical Decision
- Degroot
- 1971
(Show Context)
Citation Context ...ht-forwardly be extended to the subcases of discrete density HMM and tied-mixture HMM. The MAP estimate can be seen as a Bayes estimate of the vector parameter when the loss function is not specified =-=[5]-=-. The MAP estimation framework provides a way of incorporating prior information in the training process, which is particularly useful for dealing with problems posed by sparse training data for which... |

806 | The Viterbi algorithm
- Forney
- 1973
(Show Context)
Citation Context ...G( (m+1) ; s (m+1) jx)sG( (m) ; s (m) jx) with s (m+1) = argmax s f(x; sj (m) ) (42) (m+1) = argmax f(x; s (m+1) j)G(): (43) The most likely state sequence s (m+1) is decoded by the Viterbi algorithm =-=[9]-=-. Maximization overscan also be replaced by any hill climbing procedure overssubject to the constraint that f(x; s (m+1) j (m+1) )G( (m+1) )sf(x; s (m+1) j (m) )G( (m) ). The EM algorithm is once agai... |

550 |
An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities
- Baum
- 1972
(Show Context)
Citation Context ... recognition applications. 1 Introduction Estimation of a probabilistic function of Markov chain, also called a hidden Markov model (HMM), is usually obtained by the method of maximum likelihood (ML) =-=[1, 2, 23, 15]-=- which assumes that the size of the training data is large enough to provide robust estimates. This paper investigates maximum a posteriori (MAP) estimation of continuous density hidden Markov models ... |

549 |
Mixture densities, maximum likelihood and the EM algorithm
- Redner, Walker
- 1984
(Show Context)
Citation Context ...idden process, i.e. the state mixture component and the state sequence of a Markov chain for an HMM. In these cases ML estimates are usually obtained using the expectation-maximization (EM) algorithm =-=[6, 1, 28]-=-. For HMM parameter estimation this algorithm is also called the Baum-Welch algorithm. The EM algorithm is an iterative procedure for approximating ML estimates in the general case of models involving... |

105 |
The Segmental K-Means Algorithm for Estimating Parameters of Hidden Markov Models
- Juang, Rabiner
- 1990
(Show Context)
Citation Context ... maximized. The estimation procedure becomes ~s= argmax max s G(; sjx) = argmax max s f(x; sj)G(): (41) where ~sis refered to as the segmental MAP estimate of . As for the segmental k-means algorithm =-=[16]-=-, it is straightforward to prove that starting with any estimates(m) , alternate maximization over s andsgives a sequence of estimates with non-decreasing values of G(; sjx), i.e. G( (m+1) ; s (m+1) j... |

101 |
Maximum likelihood estimation for multivariate observations of markov sources
- Liporace
- 1982
(Show Context)
Citation Context ... recognition applications. 1 Introduction Estimation of a probabilistic function of Markov chain, also called a hidden Markov model (HMM), is usually obtained by the method of maximum likelihood (ML) =-=[1, 2, 23, 15]-=- which assumes that the size of the training data is large enough to provide robust estimates. This paper investigates maximum a posteriori (MAP) estimation of continuous density hidden Markov models ... |

75 |
Linear Statistical Inference and its Applications, 2nd Edition
- Rao
- 1973
(Show Context)
Citation Context ...tored into two terms f(xj`) = h(x)k(`jt(x)) such that h(x) is independent of ` and k(`jt(x)) is the kernel density which is a function of ` and depends on x only through the sufficient statistic t(x) =-=[27, 5, 7]-=-. In this case, the natural solution is to choose the prior density in a conjugate family fk(\Deltaj'); '2 OEg, which includes the kernel density of f(\Deltaj`). The MAP estimation is then reduced to ... |

67 |
Maximum likelihood estimation for mixture multivariate stochastic observations of markov chains
- Juang
- 1985
(Show Context)
Citation Context |

66 |
A study on speaker adaptation of the parameters of continuous density hidden Markov models
- Lee, Lin, et al.
- 1991
(Show Context)
Citation Context ...o be known and the prior density limited to a Gaussian. Brown et al. [3] used Bayesian estimation for speaker adaptation of CDHMM parameters in a connected digit recognizer. More recently, Lee et al. =-=[20]-=- investigated various training schemes of Gaussian mean and variance parameters using normalgamma prior densities for speaker adaptation. They showed that on the alpha-digit vocabulary, with only a sm... |

57 |
A Segmental K-Means Training Procedure for Connected Word
- Rabiner, Wilpon, et al.
- 1986
(Show Context)
Citation Context ...mine two ways of approximating MAP by local maximization of f(xj)G() or of f(x; sj)G(). These two solutions are the MAP versions of the Baum-Welch algorithm [2] and of the segmental k-means algorithm =-=[26]-=-, algorithms which were developed for ML estimation. 4.1 Forward-Backward MAP Estimate From equation (24) it is straightforward to show that the auxiliary function of the EM algorithm applied to ML es... |

42 |
Acoustic modeling for large vocabulary speech recognition
- Lee, Rabiner, et al.
- 1990
(Show Context)
Citation Context ...ance matrices are used and the transition probabilities are assumed fixed and known. Details of the recognition system and the basic assumptions for acoustic modeling of subword units can be found in =-=[19]-=-. As described in [21], a 38-dimensional feature vector composed of LPC-derived cepstrum coefficients, and first and second order time derivatives was computed after the data were down-sampled to 8kHz... |

42 |
The empirical Bayes approach to statistical decision problems
- Robbins
- 1964
(Show Context)
Citation Context ...this family of p.d.f.'s fG(\Deltaj'); ' 2 OEg is also assumed known based on common or subjective knowledge about the stochastic process. An alternate solution is to adopt an empirical Bayes approach =-=[29]-=- where the prior parameters are estimated directly from data. The estimation is then based on the marginal distribution of the data given the estimated prior parameters. In fact, part of the available... |

38 | Bayesian learning for hidden Markov model with Gaussian mixture state observation densities
- Gauvain, Lee
- 1992
(Show Context)
Citation Context ...ur speech recognition applications: parameter smoothing, speaker adaptation, speaker group modeling and corrective training. We have previously reported experimental results for these applications in =-=[10, 11, 12, 22]-=-. In order to demonstrate the effectiveness of Bayesian estimation for such applications, some results are given Article submitted to IEEE Trans. on Speech & Audio, published in April 1994. 11 here. I... |

38 |
Distributions in Statistics
- Johnson, Kotz
- 1970
(Show Context)
Citation Context ...tion is in the form of a multinomial distribution. Then a practical candidate to model the prior knowledge about the mixture gain parameter vector is the conjugate density such as a Dirichlet density =-=[14]-=- g(! 1 ; :::; !K j 1 ; :::;sK ) / K Y k=1 !sk \Gamma1 k (6) 2 In the following the same term f is used to denote both the joint and the marginal p.d.f.'s since it is not likely to cause confusion. 3 j... |

33 |
Speaker adaptation based on MAP estimation of HMM parameters
- Lee, Gauvain
- 1993
(Show Context)
Citation Context ...ur speech recognition applications: parameter smoothing, speaker adaptation, speaker group modeling and corrective training. We have previously reported experimental results for these applications in =-=[10, 11, 12, 22]-=-. In order to demonstrate the effectiveness of Bayesian estimation for such applications, some results are given Article submitted to IEEE Trans. on Speech & Audio, published in April 1994. 11 here. I... |

30 | Bayesian learning of Gaussian mixture densities for hidden Markov models
- Gauvain, Lee
- 1991
(Show Context)
Citation Context ...ur speech recognition applications: parameter smoothing, speaker adaptation, speaker group modeling and corrective training. We have previously reported experimental results for these applications in =-=[10, 11, 12, 22]-=-. In order to demonstrate the effectiveness of Bayesian estimation for such applications, some results are given Article submitted to IEEE Trans. on Speech & Audio, published in April 1994. 11 here. I... |

27 | Map estimation of continuous density hmm: theory and applications - Gauvain, Lee - 1992 |

24 |
On distributions admitting a sufficient statistic
- Koopman
- 1936
(Show Context)
Citation Context ...mation problem of finding the mode of the kernel density k(\Deltajt(x)). However, among the distribution families of interest, only exponential families have a sufficient statistic of fixed dimension =-=[4, 17]-=-. When there is no sufficient statistic of a fixed dimension, MAP estimation, like ML estimation, is a much more difficult problem because the posterior density is not expressible in terms of a fixed ... |

21 | Cross-lingual Experiments with Phone Recognition
- Lamel, Gauvain
- 1993
(Show Context)
Citation Context ...00 utterance of a general English corpus[13], served as seed models for speaker and task adaptation. Another use of MAP estimation has recently been proposed for textindependent speaker identification=-=[18]-=- using a small amount of speaker-specific training data. 7 Conclusion The theoretical framework for MAP estimation of multivariate Gaussian mixture density and HMM with Gaussian mixture state observat... |

19 |
Probability Theory
- Prohorov, Rozanov
- 1969
(Show Context)
Citation Context ...ponent is non-degenerate, i.e.s! k ? 0, then c k1 ; c k2 ; :::; c kT is a sequence of T i.i.d. random variables with a non-degenerate distributionand lim sup T!1 P T t=1 c kt = 1 with probability one =-=[25]-=-. It follows that ~ w k converges to P T t=1 c kt =T with probability one when T !1. Applying the same reasoning to ~ m k and ~ r k , it can be seen that the EM reestimation formulas for the MAP and M... |

16 |
Improved Acoustic Modeling for Large Vocabulary Continuous Speech Recognition
- Lee, Giachin, et al.
- 1992
(Show Context)
Citation Context ... and the transition probabilities are assumed fixed and known. Details of the recognition system and the basic assumptions for acoustic modeling of subword units can be found in [19]. As described in =-=[21]-=-, a 38-dimensional feature vector composed of LPC-derived cepstrum coefficients, and first and second order time derivatives was computed after the data were down-sampled to 8kHz to simulate the telep... |

16 |
Dynamic speaker adaptation for feature-based isolated word recognition
- Stern, Lasry
- 1987
(Show Context)
Citation Context ...n learning of Gaussian densities has been widely used for sequential learning of the mean vectors of feature- and template-based recognizers (see for example, Zelinski and Class [31], Stern and Lasry =-=[30]-=-). Ferretti and Scarci [8] used Bayesian estimation of mean vectors to build speakerspecific codebooks in an HMM framework. In all these cases, the precision parameter was assumed to be known and the ... |

12 |
Bayesian adaptation in speech recognition
- Brown, Lee, et al.
- 1983
(Show Context)
Citation Context ...n of mean vectors to build speakerspecific codebooks in an HMM framework. In all these cases, the precision parameter was assumed to be known and the prior density limited to a Gaussian. Brown et al. =-=[3]-=- used Bayesian estimation for speaker adaptation of CDHMM parameters in a connected digit recognizer. More recently, Lee et al. [20] investigated various training schemes of Gaussian mean and variance... |

11 |
Vocabulary-independent speech recognition: The VOCIND System
- Hon
- 1992
(Show Context)
Citation Context ...aker clustering and corrective training. MAP estimation has also been applied to task adaptation[22]. In this case task-independent SI models, trained from 10,000 utterance of a general English corpus=-=[13]-=-, served as seed models for speaker and task adaptation. Another use of MAP estimation has recently been proposed for textindependent speaker identification[18] using a small amount of speaker-specifi... |

10 |
Sur les Lois de Probabilites a Estimation Exhaustive
- Darmois
- 1935
(Show Context)
Citation Context ...mation problem of finding the mode of the kernel density k(\Deltajt(x)). However, among the distribution families of interest, only exponential families have a sufficient statistic of fixed dimension =-=[4, 17]-=-. When there is no sufficient statistic of a fixed dimension, MAP estimation, like ML estimation, is a much more difficult problem because the posterior density is not expressible in terms of a fixed ... |

10 |
A learning procedure for speaker-dependent word recognition systems based on sequential processing of input tokens
- Zelinski, Class
- 1983
(Show Context)
Citation Context ...mental Results Bayesian learning of Gaussian densities has been widely used for sequential learning of the mean vectors of feature- and template-based recognizers (see for example, Zelinski and Class =-=[31]-=-, Stern and Lasry [30]). Ferretti and Scarci [8] used Bayesian estimation of mean vectors to build speakerspecific codebooks in an HMM framework. In all these cases, the precision parameter was assume... |

5 |
Large-vocabulary speech recognition with speaker-adapted codebook and HMM parameters
- Ferretti, Scarci
- 1989
(Show Context)
Citation Context ...ities has been widely used for sequential learning of the mean vectors of feature- and template-based recognizers (see for example, Zelinski and Class [31], Stern and Lasry [30]). Ferretti and Scarci =-=[8]-=- used Bayesian estimation of mean vectors to build speakerspecific codebooks in an HMM framework. In all these cases, the precision parameter was assumed to be known and the prior density limited to a... |

1 |
A Database for ContinuousSpeech Recognition in a 1000-Word Domain
- Price, Fisher, et al.
- 1988
(Show Context)
Citation Context ...to the same phone [10], and for p.d.f. smoothing the same marginal prior density was used for all the components of a given mixture [11]. In experiments using the DARPA Naval Resource Management (RM) =-=[24]-=- and the TI connected digit corpora, MAP estimation always outperformed ML estimation, with error rate reductions on the order of 10 to 25%. In the case of model adaptation, MAP estimation may be view... |