• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Frame Discrimination training of HMMs for Large Vocabulary Speech Recognition (1999)

by D Povey, P C Woodland
Venue:Proc. ICASSP’99
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 16
Next 10 →

Large Scale Discriminative Training For Speech Recognition

by P.C. Woodland, D. Povey , 2000
"... This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion whi ..."
Abstract - Cited by 58 (5 self) - Add to MetaCart
This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion which has been used to train HMM systems for conversational telephone speech transcription using up to 265 hours of training data. These experiments represent the largest-scale application of discriminative training techniques for speech recognition of which the authors are aware, and have led to significant reductions in word error rate for both triphone and quinphone HMMs compared to our best models trained using maximum likelihood estimation. The MMIE latticebased implementation used; techniques for ensuring improved generalisation; and interactions with maximum likelihood based adaptation are all discussed. Furthermore several variations to the MMIE training scheme are introduced with the a...

Comparison of Discriminative Training Criteria and Optimization Methods for Speech Recognition

by Ralf Schlüter, Wolfgang Macherey, Boris Müller, Hermann Ney , 2001
"... The aim of this work is to build up a common framework for a class of discriminative training criteria and optimization methods for continuous speech recognition. A unified discriminative criterion based on likelihood ratios of correct and competing models with optional smoothing is presented. The u ..."
Abstract - Cited by 32 (6 self) - Add to MetaCart
The aim of this work is to build up a common framework for a class of discriminative training criteria and optimization methods for continuous speech recognition. A unified discriminative criterion based on likelihood ratios of correct and competing models with optional smoothing is presented. The unified criterion leads to particular criteria through the choice of competing word sequences and the choice of smoothing. Analytic and experimental comparisons are presented for both the maximum mutual information (MMI) and the minimum classification error (MCE) criterion together with the optimization methods gradient descent (GD) and extended Baum (EB) algorithm. A tree search-based restricted recognition method using word graphs is presented, so as to reduce the computational complexity of large vocabulary discriminative training. Moreover, for MCE training, a method using word graphs for efficient calculation of discriminative statistics is introduced. Experiments were performed for continuous speech recognition using the ARPA wall street journal (WSJ) corpus with a vocabulary of 5k words and for the recognition of continuously spoken digit strings using both the TI digit string corpus for American English digits, and the SieTill corpus for telephone line recorded German digits. For the MMI criterion, neither analytical nor experimental results do indicate significant differences between EB and GD optimization. For acoustic models of low complexity, MCE training gave significantly better results than MMI training. The recognition results for large vocabulary MMI training on the WSJ corpus show a significant dependence on the context length of the language model used for training. Best results were obtained using a unigram language model for MMI training. No significant co...

The CU-HTK March 2000 Hub5E Transcription System

by Thomas Hain, Philip Woodland, Gunnar Evermann, Dan Povey , 2000
"... This paper describes the Cambridge University HTK (CU-HTK) system developed for the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). A range of new features have been added to the HTK system used in the 1998 Hub5 evaluation, and the changes taken together ..."
Abstract - Cited by 18 (1 self) - Add to MetaCart
This paper describes the Cambridge University HTK (CU-HTK) system developed for the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). A range of new features have been added to the HTK system used in the 1998 Hub5 evaluation, and the changes taken together have resulted in an 11% relative decrease in word error rate on the 1998 evaluation test set. Major changes include the use of maximum mutual information estimation in training as well as conventional maximum likelihood estimation; the use of a full variance transform for adaptation; the inclusion of unigram pronunciation probabilities; and word-level posterior probability estimation using confusion networks for use in minimum word error rate decoding, confidence score estimation and system combination. On the March 2000 Hub5 evaluation set the CU-HTK system gave an overall word error rate of 25.4%, which was the best performance by a statistically significant margin. This paper describes th...

Frame-Discriminative And Confidence-Driven Adaptation For LVCSR

by Frank Wallhoff, Daniel Willett, Gerhard Rigoll , 2000
"... Maximum Likelihood Linear Regression (MLLR) has become the most popular approach for adapting speakerindependent Hidden Markov Models to a specic speaker's characteristics. However, it is well known, that discriminative training objectives outperform Maximum Likelihood training approaches, especiall ..."
Abstract - Cited by 9 (1 self) - Add to MetaCart
Maximum Likelihood Linear Regression (MLLR) has become the most popular approach for adapting speakerindependent Hidden Markov Models to a specic speaker's characteristics. However, it is well known, that discriminative training objectives outperform Maximum Likelihood training approaches, especially in cases where training data is very limited, as it always is the case in adaptation tasks. Therefore, this paper explores the application of a framebased discriminative training objective for adaptation. It presents evaluations for supervised as well as for unsupervised adaption on the 1993 WSJ adaptation tests of native and non-native speakers. Relative improvements in word error rate of up to 25% could be measured compared to the MLLR adapted recognition systems. Along with unsupervised adaptation, the paper also presents the improvements achieved by the application of condence measures. They provided an average relative improvement of 10% compared to ordinary unsupervised MLLR. 1. I...

Improved Discriminative Training Techniques for Large Vocabulary Continuous Speech Recognition

by D. Povey, P.C. Woodland - IEEE ICASSP'01 , 2001
"... This paper investigates the use of discriminative training techniques for large vocabulary speech recogntion with training datasets up to 265 hours. Techniques for improving lattice-based Maximum Mutual Information Estimation (MMIE) training are described and compared to Frame Discrimination (FD). ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
This paper investigates the use of discriminative training techniques for large vocabulary speech recogntion with training datasets up to 265 hours. Techniques for improving lattice-based Maximum Mutual Information Estimation (MMIE) training are described and compared to Frame Discrimination (FD). An objective function which is an interpolation of MMIE and standard Maximum Likelihood Estimation (MLE) is also discussed. Experimental results on both the Switchboard and North American Business News tasks show that MMIE training can yield significant performance improvements over standard MLE even for the most complex speech recognition problems with very large training sets.

Recent advances in speech recognition system for ibm darpa communicator

by Yuqing Gao, Yongxin Li, Vaibhava Goel, Michael Picheny - in Proceedings of the Eurospeech , 2001
"... In this paper, we present methods to improve speech recognition performance of the IBM DARPA Communicator system. Our efforts for acoustic modeling include training a domain specific yet broad acoustic model, speaker clustering and speaker adaptation using feature space transforms. For language mode ..."
Abstract - Cited by 6 (2 self) - Add to MetaCart
In this paper, we present methods to improve speech recognition performance of the IBM DARPA Communicator system. Our efforts for acoustic modeling include training a domain specific yet broad acoustic model, speaker clustering and speaker adaptation using feature space transforms. For language modeling, we achieved improvements by using compound words, carefully designed LM classes and adjusting the within class probabilities, using NLU state information to enhance the language model and building a language model with embedded grammar objects. Our efforts produced a relative error rate reduction of 34.6 % on the test set that consists of 1173 utterances that IBM received during the NIST evaluation of the DARPA Communicator systems in June 2000. We also tested our decoding on the data from some other sites to further demonstrate the robustness of the system improvements. 1.

Hidden Model Sequence Models for Automatic Speech Recognition

by Thomas Hain , 2001
"... Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In m ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In many cases the pronunciation model operates on a phoneme level and is derived independently of the underlying models. In contrast, this work is aimed at improving pronunciation modelling on a sub-phone level in a combined framework. The modelling of pronunciation variation is assumed to be of special importance for recognition of spontaneous speech.

The 1998 HTK Broadcast News Transcription System: Development and Results

by P. C. Woodland, T. Hain, G. L. Moore, T. R. Niesler, D. Povey, A. Tuerk, E. W. D. Whittaker , 1999
"... This paper presents the development of the HTK broadcast news transcription system for the November 1998 Hub4 evaluation. Relative to the previous year's system The system a number of features were added including vocal tract length normalisation; cluster-based variance normalisation; double the qua ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
This paper presents the development of the HTK broadcast news transcription system for the November 1998 Hub4 evaluation. Relative to the previous year's system The system a number of features were added including vocal tract length normalisation; cluster-based variance normalisation; double the quantity of acoustic training data; interpolated word level language models to combine text sources; increased broadcast news language model training data; and an extra adaptation stage using a full-variance transform. Overall these changes to the system reduced the error rate by 13% on the 1997 evaluation data and the final system had an overall word error rate of 13.8% for the 1998 evaluation data sets.

Structured Precision Matrix Modelling for Speech Recognition

by Khe Chai Sim , 2006
"... Declaration This dissertation is the result of my own work and includes nothing which is the outcome of the work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices i ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing which is the outcome of the work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices is approximately 53,000 words. ii Summary The most extensively and successfully applied acoustic model for speech recognition is the Hid-den Markov Model (HMM). In particular, a multivariate Gaussian Mixture Model (GMM) is typically used to represent the output density function of each HMM state. For reasons of ef-ficiency, the covariance matrix associated with each Gaussian component is assumed diagonal and the probability of successive observations is assumed independent given the HMM state sequence. Consequently, the spectral (intra-frame) and temporal (inter-frame) correlations are poorly modelled. This thesis investigates ways of improving these aspects by extending the standard HMM. Parameters for these extended models are estimated discriminatively using the

Substate Tying With Combined Parameter Training and Reduction in Tied-Mixture HMM Design

by Liang Gu, Kenneth Rose, Senior Member - in Tied-Mixture HMM Design, in ‘Transactions On Speech and Audio Processing , 2002
"... Two approaches are proposed for the design of tied-mixture hidden Markov models (TMHMM). One approach improves parameter sharing via partial tying of TMHMM states. To facilitate tying at the substate level, the state emission probabilities are constructed in two stages or, equivalently, are viewed a ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Two approaches are proposed for the design of tied-mixture hidden Markov models (TMHMM). One approach improves parameter sharing via partial tying of TMHMM states. To facilitate tying at the substate level, the state emission probabilities are constructed in two stages or, equivalently, are viewed as a "mixture of mixtures of Gaussians." This paradigm allows, and is complemented with, an optimization technique to seek the best complexity-accuracy tradeoff solution, which jointly exploits Gaussian density sharing and substate tying. Another approach to enhance model training is combined training and reduction of model parameters. The procedure starts by training a system with a large universal codebook of Gaussian densities. It then iteratively reduces the size of both the codebook and the mixing coefficient matrix, followed by parameter re-training. The additional cost in design complexity is modest. Experimental results on the ISOLET database and its E-set subset show that substate tying reduces the classification error rate by over 15%, compared to standard Gaussian sharing and whole-state tying. TMHMM design with combined training and reduction of parameters reduces the classification error rate by over 20% compared to conventional TMHMM design. When the two proposed approaches were integrated, 25% error rate reduction over TMHMM with whole-state tying was achieved.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University