• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition (1998)

by M.J.F. Gales
Venue:COMPUTER SPEECH AND LANGUAGE
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 570
Next 10 →

Semi-tied covariance matrices for hidden Markov models,”

by M J F Gales - IEEE Trans. Speech and Audio Processing, , 1999
"... ..."
Abstract - Cited by 262 (35 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...ions to the case when full covariance matrices are used is also possible [7] 8 . Another closely related problem is ML linear transformations of the variances for speaker and environmental adaptation =-=[8]-=-. Here a linear transform, typically tied over many components, is required to adapt the variances to be representative of a new speaker, or acoustic environment. When adapted in an unconstrained mode...

Parametric Hidden Markov Models for Gesture Recognition

by Andrew D. Wilson, Aaron F. Bobick - IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , 1999
"... A new method for the representation, recognition, and interpretation of parameterized gesture is presented. By parameterized gesture we mean gestures that exhibit a systematic spatial variation; one example is a point gesture where the relevant parameter is the two-dimensional direction. Our approa ..."
Abstract - Cited by 208 (3 self) - Add to MetaCart
A new method for the representation, recognition, and interpretation of parameterized gesture is presented. By parameterized gesture we mean gestures that exhibit a systematic spatial variation; one example is a point gesture where the relevant parameter is the two-dimensional direction. Our approach is to extend the standard hidden Markov model method of gesture recognition by including a global parametric variation in the output probabilities of the HMM states. Using a linear model of dependence, we formulate an expectation-maximization (EM) method for training the parametric HMM. During testing, a similar EM algorithm simultaneously maximizes the output likelihood of the PHMM for the given sequence and estimates the quantifying parameters. Using visually derived and directly measured three-dimensional hand position measurements as input, we present results that demonstrate the recognition superiority of the PHMM over standard HMM techniques, as well as greater robustness in parameter estimation with respect to noise in the input features. Last, we extend the PHMM to handle arbitrary smooth (nonlinear) dependencies. The nonlinear formulation requires the use of a generalized expectation-maximization (GEM) algorithm for both training and the simultaneous recognition of the gesture and estimation of the value of the parameter. We present results on a pointing gesture, where the nonlinear approach permits the natural spherical coordinate parameterization of pointing direction.

Statistical parametric speech synthesis

by Alan W Black, Heiga Zen, Keiichi Tokuda - in Proc. ICASSP, 2007 , 2007
"... This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This ..."
Abstract - Cited by 179 (18 self) - Add to MetaCart
This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This paper also contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years. Advantages and disadvantages of statistical parametric synthesis are highlighted as well as identifying where we expect the key developments to appear in the immediate future. Index Terms — Speech synthesis, hidden Markov models 1. BACKGROUND With the increase in power and resources of computer technology, building natural sounding synthetic voices has progressed from a

The Kaldi speech recognition toolkit,” in

by Daniel Povey , Arnab Ghoshal , Gilles Boulianne , Lukáš Burget , Ondřej Glembek , Nagendra Goel , Mirko Hannemann , Petr Motlíček , Yanmin Qian , Petr Schwarz , Jan Silovský , Georg Stemmer , Karel Veselý - Proc. ASRU, , 2011
"... Abstract-We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognitio ..."
Abstract - Cited by 147 (16 self) - Add to MetaCart
Abstract-We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.
(Show Context)

Citation Context

...r adaptation We support both model-space adaptation using maximum likelihood linear regression (MLLR) [8] and feature-space adaptation using feature-space MLLR (fMLLR), also known as constrained MLLR =-=[9]-=-. For both MLLR and fMLLR, multiple transforms can be estimated using a regression tree [10]. When a single fMLLR transform is needed, it can be used as an additional processing step in the feature pi...

Maximum Likelihood Modeling With Gaussian Distributions For Classification

by R. A. Gopinath - Proceedings of ICASSP , 1998
"... Maximum Likelihood (ML) modeling of multiclass data for classication often suers from the following problems: a) data insuciency implying overtrained or unreliable models b) large storage requirement c) large computational requirement and/or d) ML is not discriminating between classes. Sharing param ..."
Abstract - Cited by 121 (26 self) - Add to MetaCart
Maximum Likelihood (ML) modeling of multiclass data for classication often suers from the following problems: a) data insuciency implying overtrained or unreliable models b) large storage requirement c) large computational requirement and/or d) ML is not discriminating between classes. Sharing parameters across classes (or constraining the parameters) clearly tends to alleviate the rst three problems. It this paper we show that in some cases it can also lead to better discrimination (as evidenced by reduced misclassication error). The parameters considered are the means and variances of the gaussians and linear transformations of the feature space (or equivalently the gaussian means). Some constraints on the parameters are shown to lead to Linear Discrimination Analysis (a well-known result) while others are shown to lead to optimal feature spaces (a relatively new result) . Applications of some of these ideas to the speech recognition problem are also given. 1.

Channel compensation for SVM speaker recognition

by Alex Solomonoff, Carl Quillen, William M. Campbell - in Proceedings of Odyssey-04, The Speaker and Language Recognition Workshop
"... One of the major remaining challenges to improving accuracy in state-of-the-art speaker recognition algorithms is reducing the impact of channel and handset variations on system performance. For Gaussian Mixture Model based speaker recognition systems, a variety of channel-adaptation techniques are ..."
Abstract - Cited by 113 (16 self) - Add to MetaCart
One of the major remaining challenges to improving accuracy in state-of-the-art speaker recognition algorithms is reducing the impact of channel and handset variations on system performance. For Gaussian Mixture Model based speaker recognition systems, a variety of channel-adaptation techniques are known and available for adapting models between different channel conditions, but for the much more recent Support Vector Machine (SVM) based approaches to this problem, much less is known about the best way to handle this issue. In this paper we explore techniques that are specific to the SVM framework in order to derive fully non-linear channel compensations. The result is a system that is less sensitive to specific kinds of labeled channel variations observed in training. 1.
(Show Context)

Citation Context

... of GMM based systems, a variety of maximum likelihood model and feature space adaptations are known and have been studied in detail in the speech recognition and speaker-recognition literature (e.g. =-=[3]-=-, [4]). Because we often combine an SVM-based speaker recognition system with a GMM system running in parallel, it’s quite natural to consider This work was sponsored by the United States Air Force un...

Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm

by Junichi Yamagishi, Takao Kobayashi, Senior Member, Yuji Nakano, Katsumi Ogata, Juri Isogai - IEEE Trans. Audio Speech Lang. Process , 2009
"... Abstract—In this paper, we analyze the effects of several factors and configuration choices encountered during training and model construction when we want to obtain better and more stable adaptation in HMM-based speech synthesis. We then propose a new adaptation algorithm called constrained structu ..."
Abstract - Cited by 90 (28 self) - Add to MetaCart
Abstract—In this paper, we analyze the effects of several factors and configuration choices encountered during training and model construction when we want to obtain better and more stable adaptation in HMM-based speech synthesis. We then propose a new adaptation algorithm called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms. Here, we investigate six major aspects of the speaker adaptation: initial models; the amount of the training data for the initial models; the transform functions, estimation criteria, and sensitivity of several linear regression adaptation algorithms; and combination algorithms. Analyzing the effect of the initial model, we compare speaker-dependent models, gender-independent models, and the simultaneous use of the gender-dependent models to single use of the gender-dependent models. Analyzing the effect of the transform functions, we compare the transform function for only mean vectors with that for mean vectors and covariance matrices. Analyzing the effect of the estimation criteria, we compare the ML criterion with a robust estimation criterion called structural MAP. We evaluate the sensitivity of several thresholds for the piecewise linear regression algorithms and take up methods com-bining MAP adaptation with the linear regression algorithms. We incorporate these adaptation algorithms into our speech synthesis system and present several subjective and objective evaluation results showing the utility and effectiveness of these algorithms in speaker adaptation for HMM-based speech synthesis. Index Terms—Average voice, hidden Markov model (HMM)-based speech synthesis, speaker adaptation, speech synthesis, voice conversion. I.
(Show Context)

Citation Context

...UDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 1, JANUARY 2009 TABLE I DEFINITION OF ACRONYMS FOR FOUR LINEAR REGRESSION ALGORITHMS function for MLLR with that for constrained MLLR (CMLLR) [37], =-=[38]-=-. We then analyze the effect of the estimation criteria by comparing the ML criterion with a robust estimation criterion called structural MAP (SMAP) [39] in the linear regression adaptation algorithm...

Large Scale Discriminative Training For Speech Recognition

by P.C. Woodland, D. Povey , 2000
"... This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion whi ..."
Abstract - Cited by 88 (5 self) - Add to MetaCart
This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion which has been used to train HMM systems for conversational telephone speech transcription using up to 265 hours of training data. These experiments represent the largest-scale application of discriminative training techniques for speech recognition of which the authors are aware, and have led to significant reductions in word error rate for both triphone and quinphone HMMs compared to our best models trained using maximum likelihood estimation. The MMIE latticebased implementation used; techniques for ensuring improved generalisation; and interactions with maximum likelihood based adaptation are all discussed. Furthermore several variations to the MMIE training scheme are introduced with the a...

Cluster adaptive training of hidden Markov models,”

by M J F Gales - IEEE Trans. Speech and Audio Processing, , 2000
"... ..."
Abstract - Cited by 82 (17 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

... or acoustic environment. A variety of transforms have been examined, for example, vocal tract normalisation [10], maximum likelihood linear regression (MLLR) [12], constrained model-space transforms =-=[3, 7]-=- and speaker clustering [16, 14]. The majority of these techniques apply some transformation to a canonical model. Originally a speaker-independent (SI) model was used as the canonical model. During r...

Feature engineering in context-dependent deep neural networks for conversational speech transcription

by Frank Seide, Xie Chen, Dong Yu - in ASRU , 2011
"... Abstract—We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much a ..."
Abstract - Cited by 76 (15 self) - Add to MetaCart
Abstract—We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third—from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%—using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers. In this paper, we evaluate the effectiveness of feature transforms developed for GMM-HMMs—HLDA, VTLN, and fMLLR—applied to CD-DNN-HMMs. Results show that HLDA is subsumed (expected), as is much of the gain from VTLN (not expected): Deep networks learn vocal-tract length invariant structures to a significant degree. Unsupervised speaker adaptation with discriminatively estimated fMLLR-like transforms works (as hoped for) nearly as well as fMLLR for GMM-HMMs. We also improve model training by a discriminative pretraining procedure, yielding a small accuracy gain due to a better internal feature representation.
(Show Context)

Citation Context

...ks and yields large error reductions for deep networks. This paper aims at evaluating the effectiveness of several feature-engineering techniques developed for GMM-HMMs— HLDA [7], VTLN [8], and fMLLR =-=[9]-=-—when applied to CD-DNN-HMMs. We find that CD-DNN-HMMs subsume not only HLDA (expected) but also most of the gains of VTLN (unexpected). The DNN can be viewed as a complex discriminative feature extra...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University