Results 1 - 10
of
15
Large Scale Discriminative Training For Speech Recognition
, 2000
"... This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion whi ..."
Abstract
-
Cited by 58 (5 self)
- Add to MetaCart
This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion which has been used to train HMM systems for conversational telephone speech transcription using up to 265 hours of training data. These experiments represent the largest-scale application of discriminative training techniques for speech recognition of which the authors are aware, and have led to significant reductions in word error rate for both triphone and quinphone HMMs compared to our best models trained using maximum likelihood estimation. The MMIE latticebased implementation used; techniques for ensuring improved generalisation; and interactions with maximum likelihood based adaptation are all discussed. Furthermore several variations to the MMIE training scheme are introduced with the a...
Support vector machines for speech recognition
- Proceedings of the International Conference on Spoken Language Processing
, 1998
"... Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative informati ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative information and are prone to overfitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. In this paper, we show that SVMs provide a significant improvement in performance on a static pattern classification task based on the Deterding vowel data. We also describe an application of SVMs to large vocabulary speech recognition, and demonstrate an improvement in error rate on a continuous alphadigit task (OGI Aphadigits) and a large vocabulary conversational speech task (Switchboard). Issues related to the development and optimization of an SVM/HMM hybrid system are discussed.
Hybrid SVM/HMM Architectures for Speech Recognition
- in Speech Transcription Workshop
, 2000
"... In this paper, we describe the use of a powerful machine learning scheme, Support Vector Machines (SVM), within the framework of hidden Markov model (HMM) based speech recognition. The hybrid SVM/HMM system has been developed based on our public domain toolkit. The hybrid system has been evalua ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
In this paper, we describe the use of a powerful machine learning scheme, Support Vector Machines (SVM), within the framework of hidden Markov model (HMM) based speech recognition. The hybrid SVM/HMM system has been developed based on our public domain toolkit. The hybrid system has been evaluated on the OGI Alphadigits corpus and performs at 11.6% WER, as compared to 12.7% with a triphone mixture-Gaussian HMM system, while using only a fifth of the training data used by triphone system. Several important issues that arise out of the nature of SVM classifiers have been addressed. We are in the process of migrating this technology to large vocabulary recognition tasks like SWITCHBOARD. 1. INTRODUCTION Speech recogn i t i on can be v i ewed as a pa t t ern recognition problem where we desire each unique sound t o be d i s t i ngu i shab l e f r om a l l o t he r sounds . Traditionally statistical models, such as Gaussian mixture models, have been used to "represent" th...
On Supervised Learning From Sequential Data With Applications For Speech Recognition
, 1999
"... visualization of the problem to model human speech. A large number of example sequences of observation vectors (shown connected as continuous trajectories) depending on a given sequence of class labels, with each class representing for example a phoneme (here the name Keiko with given durations). In ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
visualization of the problem to model human speech. A large number of example sequences of observation vectors (shown connected as continuous trajectories) depending on a given sequence of class labels, with each class representing for example a phoneme (here the name Keiko with given durations). In this synthetic example, the one-dimensional target data would be represented poorly by a uni-modal Gaussian distribution with a constant variance (which corresponds to using the squared-error objective function), which would average the two separate branches, indicated by the fat lines as the mean and constant variance of the single Gaussian. Compare this figure with Figure 3.10, Figure 3.11 and Figure 3.12 to see a subsequent improvement of the model.
Articulatory Methods for Speech Production and Recognition
, 1996
"... roduction-based knowledge into the recognition framework. By using an explicit time-domain articulatory model of the mechanisms of co-articulation, it is hoped to obtain a more accurate model of contextual effects in the acoustic signal, while using fewer parameters than traditional acoustically-dri ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
roduction-based knowledge into the recognition framework. By using an explicit time-domain articulatory model of the mechanisms of co-articulation, it is hoped to obtain a more accurate model of contextual effects in the acoustic signal, while using fewer parameters than traditional acoustically-driven approaches. Separate articulatory and acoustic models are provided, and in each case the parameters of the models are automatically optimised over a training data set. A predictive statistically-based model of co-articulation is described, and found to yield improved articulatory modelling accuracy compared with X-ray articulatory traces. Parameterised acoustic vectors are synthesised by a set of artificial neural networks, and the resulting acoustic representations are used to re-score N-best recognition hypothesis lists produced by an HMM-based recogniser. The system is evaluated on two test databases, one including speaker-specific X-ray training data and the other aco
Fast Implementation Methods For Viterbi-Based Word-Spotting
- In Proc. ICASSP
, 1996
"... This paper explores methods of increasing the speed of a Viterbi-based word-spotting system for audio document retrieval. Fast processing is essential since the user expects to receive the results of a keyword search many times faster than the actual length of the speech. A number of computational s ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
This paper explores methods of increasing the speed of a Viterbi-based word-spotting system for audio document retrieval. Fast processing is essential since the user expects to receive the results of a keyword search many times faster than the actual length of the speech. A number of computational short-cuts to the standard Viterbi word-spotter are presented. These are based on exploiting the background Viterbi phone recognition path that is computed to provide a normalisation base. An initial approximation using the phone transition boundaries reduces the retrieval time by a factor of 5, while achieving a slight improvement in word-spotting performance. To further reduce retrieval time, pattern matching, feature selection, and Gaussian selection techniques are applied to this approximate pass to give a total \Theta50 increase in speed with little loss in performance. In addition, a low memory requirement means that these approaches can be implemented on any platform, including hand-he...
Training algorithms for hidden conditional random fields
- In Proc. ICASSP
, 2006
"... We investigate algorithms for training hidden conditional random fields (HCRFs) – a class of direct models with hidden state sequences. We compare stochastic gradient ascent with the RProp algorithm, and investigate stochastic versions of RProp. We propose a new scheme for model flattening, and com ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We investigate algorithms for training hidden conditional random fields (HCRFs) – a class of direct models with hidden state sequences. We compare stochastic gradient ascent with the RProp algorithm, and investigate stochastic versions of RProp. We propose a new scheme for model flattening, and compare it to the state of the art. Finally we give experimental results on the TIMIT phone classification task showing how these training options interact, comparing HCRFs to HMMs trained using extended Baum-Welch as well as
Automatic Model Complexity Control Using Marginalized Discriminative Growth Functions
- In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU
, 2003
"... Designing a large vocabulary speech recognition system is a highly complex problem. Many techniques affect both the system complexity and recognition performance. Automatic complexity control criteria are needed to quickly predict the recognition performance ranking of systems with varying complexit ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Designing a large vocabulary speech recognition system is a highly complex problem. Many techniques affect both the system complexity and recognition performance. Automatic complexity control criteria are needed to quickly predict the recognition performance ranking of systems with varying complexity, in order to select an optimal model structure with the minimum word error. In this paper a novel complexity control technique is proposed by using the marginalization of discriminative growth functions. A two stage approach is adopted to make the marginalization efficient. First a lower bound, related to the auxiliary function, is used to remove the dependence on the latent variables. Second a Laplace approximation is used for the integration. Experimental results on a spontaneous speech recognition task show that marginalized MMI growth function outperforms using held out data likelihood and standard Bayesian schemes in terms of both recognition performance ranking error and word error.
Subspace Models For Speech Transitions Using Principal Curves
- Proceedings of Institute of Acoustics
, 1998
"... this paper shows that much of the discriminatory information is retained in the projection we propose. This is illustrated on a simple problem involving the discrimination of /b/, /d/ & /g/, on the ISOLET [3] and TIMIT [5] database. Receiver Operating Characteristic (ROC) curve is used to present th ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
this paper shows that much of the discriminatory information is retained in the projection we propose. This is illustrated on a simple problem involving the discrimination of /b/, /d/ & /g/, on the ISOLET [3] and TIMIT [5] database. Receiver Operating Characteristic (ROC) curve is used to present the compromise between detection of a transition and false alarms. 2. SUBSPACE MODEL
String and Lattice based Discriminative Training for the Corpus of Spontaneous Japanese Lecture Transcription Task
- in Proc. Interspeech, 2007
"... This article aims to provide a comprehensive set of acoustic model discriminative training results for the Corpus of Spontaneous Japanese (CSJ) lecture speech transcription task. Discriminative training was carried out for this task using a 100,000 word trigram for several acoustic model topologies, ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
This article aims to provide a comprehensive set of acoustic model discriminative training results for the Corpus of Spontaneous Japanese (CSJ) lecture speech transcription task. Discriminative training was carried out for this task using a 100,000 word trigram for several acoustic model topologies, using both diagonal and full covariance models, and using both stringbased and lattice-based training paradigms. We describe our implementation of the proposal by Macherey et al. for numerical subtraction of the reference lattice statistics from the competitor lattice statistics during lattice-based Minimum Classification Error (MCE) training. We also present results for latticebased training that does not use such subtraction, corresponding to the well-known Maximum Mutual Information (MMI) approach. Discriminative training yielded relative reductions in Word Error Rate of up to 13%. Specific problems encountered in implementing discriminative training for this task are discussed. 1.

