• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Comparison of large margin training to other discriminative methods for phonetic recognition by hidden markov models,” in ICASSP, (2007)

by F Sha, L Saul
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 36
Next 10 →

Large margin hidden Markov models for automatic speech recognition

by Fei Sha, Lawrence K. Saul - in Advances in Neural Information Processing Systems 19 , 2007
"... We study the problem of parameter estimation in continuous density hidden Markov models (CD-HMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on max-margin Markov networks, our ap ..."
Abstract - Cited by 83 (7 self) - Add to MetaCart
We study the problem of parameter estimation in continuous density hidden Markov models (CD-HMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on max-margin Markov networks, our approach is specifically geared to the modeling of real-valued observations (such as acoustic feature vectors) using Gaussian mixture models. Unlike previous discriminative frameworks for ASR, such as maximum mutual information and minimum classification error, our framework leads to a convex optimization, without any spurious local minima. The objective function for large margin training of CD-HMMs is defined over a parameter space of positive semidefinite matrices. Its optimization can be performed efficiently with simple gradient-based methods that scale well to large problems. We obtain competitive results for phonetic recognition on the TIMIT speech corpus. 1
(Show Context)

Citation Context

...simple case where the observations in each hidden state are modeled by a single ellipsoid. The extension to multiple mixture components closely follows the approach in section 2.3 and can be found in =-=[14, 16]-=-. Margin-based learning of transition probabilities is likewise straightforward but omitted for brevity. Both these extensions were implemented, however, for the experiments on phonetic recognition in...

Speech Recognition Using Augmented Conditional Random Fields

by Yasser Hifny, Steve Renals
"... Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ..."
Abstract - Cited by 29 (2 self) - Add to MetaCart
Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT
(Show Context)

Citation Context

...ction I), which is not addressed in the ACRF framework. The other two enhancements are addressed within the ACRF framework. In general, improvements based on using different objective functions [62], =-=[63]-=- do not address the acoustic modeling formulation and can be used to train -ACRFs as well. -ACRFs can also take advantage of the TRAP tandem approach [64], [65] as a powerful frontend [20]. System com...

Hierarchical large-margin Gaussian mixture models for phonetic classification

by Hung-an Chang, James R. Glass - IEEE Workshop on ASRU , 2007
"... In this paper we present a hierarchical large-margin Gaussian mixture modeling framework and evaluate it on the task of phonetic classification. A two-stage hierarchical classifier is trained by alternately updating parameters at different levels in the tree to maximize the joint margin of the overa ..."
Abstract - Cited by 21 (2 self) - Add to MetaCart
In this paper we present a hierarchical large-margin Gaussian mixture modeling framework and evaluate it on the task of phonetic classification. A two-stage hierarchical classifier is trained by alternately updating parameters at different levels in the tree to maximize the joint margin of the overall classification. Since the loss function required in the training is convex to the parameter space the problem of spurious local minima is avoided. The model achieves good performance with fewer parameters than single-level classifiers. In the TIMIT benchmark task of context-independent phonetic classification, the proposed modeling scheme achieves a state-of-the-art phonetic classification error of 16.7 % on the core test set. This is an absolute reduction of 1.6 % from the best previously reported result on this task, and 4-5 % lower than a variety of classifiers that have been recently examined on this task. Index Terms — hierarchical classifier, committee classifier, large margin GMM, phonetic classification 1.
(Show Context)

Citation Context

...on is needed to better fit the nature of ASR systems for LVCRS tasks. There are several possible ways to modify the loss function, one of which is to expand the token-level loss to string-level as in =-=[34]-=-. Let Xn be the observation sequence of the n th utterance in the training data, and Yn be the corresponding label sequence. The distance metric of Xn with respect to Yn can be computed by summing up ...

Conditional random fields for integrating local discriminative classifiers

by Jeremy Morris, Student Member, Eric Fosler-lussier, Senior Member - Audio, Speech, and Language Processing, IEEE Transactions on , 2008
"... Abstract—Conditional random fields (CRFs) are a statistical framework that has recently gained in popularity in both the automatic speech recognition (ASR) and natural language processing communities because of the different nature of assumptions that are made in predicting sequences of labels compa ..."
Abstract - Cited by 18 (2 self) - Add to MetaCart
Abstract—Conditional random fields (CRFs) are a statistical framework that has recently gained in popularity in both the automatic speech recognition (ASR) and natural language processing communities because of the different nature of assumptions that are made in predicting sequences of labels compared to the more traditional hidden Markov model (HMM). In the ASR community, CRFs have been employed in a method similar to that of HMMs, using the sufficient statistics of input data to compute the probability of label sequences given acoustic input. In this paper, we explore the application of CRFs to combine local posterior estimates provided by multilayer perceptrons (MLPs) corresponding to the frame-level prediction of phone classes and phonological attribute classes. We compare phonetic recognition using CRFs to an HMM system trained on the same input features and show that the monophone label CRF is able to achieve superior performance to a monophone-based HMM and performance comparable to a 16 Gaussian mixture triphone-based HMM; in both of these cases, the CRF obtains these results with far fewer free parameters. The CRF is also able to better combine these posterior estimators, achieving a substantial increase in performance over an HMM-based triphone system by mixing the two highly correlated sets of phone class and phonetic attribute class posteriors. Index Terms—Automatic speech recognition (ASR), random fields. I.

PAC-BAYESIAN APPROACH FOR MINIMIZATION OF PHONEME ERROR RATE

by Joseph Keshet, David Mcallester, Tamir Hazan
"... We describe a new approach for phoneme recognition which aims at minimizing the phoneme error rate. Building on structured prediction techniques, we formulate the phoneme recognizer as a linear combination of feature functions. We state a PAC-Bayesian generalization bound, which gives an upper-bound ..."
Abstract - Cited by 17 (7 self) - Add to MetaCart
We describe a new approach for phoneme recognition which aims at minimizing the phoneme error rate. Building on structured prediction techniques, we formulate the phoneme recognizer as a linear combination of feature functions. We state a PAC-Bayesian generalization bound, which gives an upper-bound on the expected phoneme error rate in terms of the empirical phoneme error rate. Our algorithm is derived by finding the gradient of the PAC-Bayesian bound and minimizing it by stochastic gradient descent. The resulting algorithm is iterative and easy to implement. Experiments on the TIMIT corpus show that our method achieves the lowest phoneme error rate compared to other discriminative and generative models with the same expressive power. Index Terms — PAC-Bayesian theorem, phoneme recognition, structured prediction, discriminative training, kernels 1.

Modified MMI/MPE: A Direct Evaluation of the Margin in Speech Recognition

by Georg Heigold, Thomas Deselaers, Ralf Schlüter, Hermann Ney
"... In this paper we show how common speech recognition training criteria such as the Minimum Phone Error criterion or the Maximum Mutual Information criterion can be extended to incorporate a margin term. Different margin-based training algorithms have been proposed to refine existing training algorith ..."
Abstract - Cited by 15 (8 self) - Add to MetaCart
In this paper we show how common speech recognition training criteria such as the Minimum Phone Error criterion or the Maximum Mutual Information criterion can be extended to incorporate a margin term. Different margin-based training algorithms have been proposed to refine existing training algorithms for general machine learning problems. However, for speech recognition, some special problems have to be addressed and all approaches proposed either lack practical applicability or the inclusion of a margin term enforces significant changes to the underlying model, e.g. the optimization algorithm, the loss function, or the parameterization of the model. In our approach, the conventional training criteria are modified to incorporate a margin term. This allows us to do large-margin training in speech recognition using the same efficient algorithms for accumulation and optimization and to use the same software as for conventional discriminative training. We show that the proposed criteria are equivalent to Support Vector Machines with suitable smooth loss functions, approximating the non-smooth hinge loss function or the hard error (e.g. phone error). Experimental results are given for two different tasks: the rather simple digit string recognition task Sietill which severely suffers from overfitting and the large vocabulary European Parliament Plenary Sessions English task which is supposed to be dominated by the risk and the generalization does not seem to be such an issue.
(Show Context)

Citation Context

...ber of classes (number of possible word sequences). Stimulated by the success of SVMs, different margin-based training algorithms have been proposed for ASR, e.g. (Yu et al., 2007; Yin & Jiang, 2007; =-=Sha & Saul, 2007-=-; Li et al., 2007). Although the reported results for these approaches are very promising, the approaches have some shortcomings in particular for large-scale appli-Modified MPE Table 1. Relative imp...

PHONE RECOGNITION USING RESTRICTED BOLTZMANN MACHINES

by Abdel-rahman Mohamed, Geoffrey Hinton
"... For decades, Hidden Markov Models (HMMs) have been the state-of-the-art technique for acoustic modeling despite their unrealistic independence assumptions and the very limited representational capacity of their hidden states. Conditional Restricted Boltzmann Machines (CRBMs) have recently proved to ..."
Abstract - Cited by 11 (3 self) - Add to MetaCart
For decades, Hidden Markov Models (HMMs) have been the state-of-the-art technique for acoustic modeling despite their unrealistic independence assumptions and the very limited representational capacity of their hidden states. Conditional Restricted Boltzmann Machines (CRBMs) have recently proved to be very effective for modeling motion capture sequences and this paper investigates the application of this more powerful type of generative model to acoustic modeling. On the standard TIMIT corpus, one type of CRBM outperforms HMMs and is comparable with the best other methods, achieving a phone error rate (PER) of 26.7 % on the TIMIT core test set. Index Terms — phone recognition, restricted Boltzmann machines, distributed representations.
(Show Context)

Citation Context

...05. Table 3 compares the results achieved by the ICRBM model to other proposed models. Table 3. Reported results on TIMIT core test set Method PER Conditional Random Field [11] 34.8% Large-Margin GMM =-=[12]-=- 28.2% CD-HMM [2] 27.3% ICRBM (this paper) 26.7% Augmented conditional Random Fields [2] 26.6% Recurrent Neural Nets [13] 26.1% Monophone HTMs [1] 24.8% Heterogeneous Classifiers [14] 24.4% 7. CONCLUS...

An exploration of large vocabulary tools for small vocabulary phonetic recognition

by Tara N Sainath , Bhuvana Ramabhadran , Michael Picheny - in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, 2009
"... Abstract-While research in large vocabulary continuous speech recognition (LVCSR) has sparked the development of many state of the art research ideas, research in this domain suffers from two main drawbacks. First, because of the large number of parameters and poorly labeled transcriptions, gaining ..."
Abstract - Cited by 10 (3 self) - Add to MetaCart
Abstract-While research in large vocabulary continuous speech recognition (LVCSR) has sparked the development of many state of the art research ideas, research in this domain suffers from two main drawbacks. First, because of the large number of parameters and poorly labeled transcriptions, gaining insight into further improvements based on error analysis is very difficult. Second, LVCSR systems often take a significantly longer time to train and test new research ideas compared to small vocabulary tasks. A small vocabulary task like TIMIT provides a phonetically rich and hand-labeled corpus and offers a good test bed to study algorithmic improvements. However, oftentimes research ideas explored for small vocabulary tasks do not always provide gains on LVCSR systems. In this paper, we address these issues by taking the standard "recipe" used in typical LVCSR systems and applying it to the TIMIT phonetic recognition corpus, which provides a standard benchmark to compare methods. We find that at the speaker-independent (SI) level, our results offer comparable performance to other SI HMM systems. By taking advantage of speaker adaptation and discriminative training techniques commonly used in LVCSR systems, we achieve an error rate of 20%, the best results reported on the TIMIT task to date, moving us closer to the human reported phonetic recognition error rate of 15%. We propose the use of this system as the baseline for future research and believe that it will serve as a good framework to explore ideas that will carry over to LVCSR systems.
(Show Context)

Citation Context

...ning on small vocabulary [7] have translated into huge gains for LVCSR [8]. In addition, improvements seen on TIMIT using neural nets [2] have also been successfully applied to LVCSR systems [9]. Our phonetic recognition experiments reveal that at the SI level, we are able to achieve a phonetic error rate (PER) of 25.6%, which compares to one of the best SI-Hidden Markov Model (HMM) results reported in the literature ([10]). Next, we find that utilizing discriminative training, the results are significantly better than the performance of other discriminative training systems on the TIMIT task [11]. Incorporating speaker adaptation allows us to achieve an error rate of 20.0%. To our knowledge, we believe that utilizing our full system offers the best results on the TIMIT task to date. A spectrogram reading experiment in [12] reported a human level error rate of reading phonemes of approximately 15.0%. Our error rate of 20.0% illustrates the benefits of an LVCSR recipe for speech recognition, pushing speech research closer towards the ultimate goal of reaching human-level performance. A further error analysis indicates that most of the errors are due to confusions between phonemes within...

A fast online algorithm for large margin training of continuous density hidden markov models

by Chih-chieh Cheng, Fei Sha, Lawrence K. Saul - in Proceedings of Interspeech-2009 , 2009
"... We propose an online learning algorithm for large margin training of continuous density hidden Markov models. The online algorithm updates the model parameters incrementally after the decoding of each training utterance. For large margin training, the algorithm attempts to separate the log-likelihoo ..."
Abstract - Cited by 9 (3 self) - Add to MetaCart
We propose an online learning algorithm for large margin training of continuous density hidden Markov models. The online algorithm updates the model parameters incrementally after the decoding of each training utterance. For large margin training, the algorithm attempts to separate the log-likelihoods of correct and incorrect transcriptions by an amount proportional to their Hamming distance. We evaluate this approach to hidden Markov modeling on the TIMIT speech database. We find that the algorithm yields significantly lower phone error rates than other approaches—both online and batch—that do not attempt to enforce a large margin. We also find that the algorithm converges much more quickly than analogous batch optimizations for large margin training. Index Terms: hidden Markov models, online learning, large margin classification, discriminative training, automatic speech recognition 1.
(Show Context)

Citation Context

...over time. However, researchers continue to experiment with new and improved methods for parameter estimation. Recently, several researchers have proposed methods for large margin training of CD-HMMs =-=[2, 3, 4, 5, 6, 7]-=-. In large margin training, acoustic models are estimated to assign significantly higher scores to correct transcriptions than competing ones; in particular, the margin between these scores may be req...

Investigations on Convex Optimization Using Log-Linear HMMs for Digit String Recognition

by Georg Heigold, David Rybach, Ralf Schlüter, Hermann Ney
"... Discriminative methods are an important technique to refine the acoustic model in speech recognition. Conventional discriminative training is initialized with some baseline model and the parameters are re-estimated in a separate step. This approach has proven to be successful, but it includes many h ..."
Abstract - Cited by 9 (7 self) - Add to MetaCart
Discriminative methods are an important technique to refine the acoustic model in speech recognition. Conventional discriminative training is initialized with some baseline model and the parameters are re-estimated in a separate step. This approach has proven to be successful, but it includes many heuristics, approximations, and parameters to be tuned. This tuning involves much engineering and makes it difficult to reproduce and compare experiments. In contrast to the conventional training, convex optimization techniques provide a sound approach to estimate all model parameters from scratch. Such a straight approach hopefully dispense with additional heuristics, e.g. scaling of posteriors. This paper addresses the question how well this concept using log-linear models carries over to practice. Experimental results are reported for a digit string recognition task, which allows for the investigation of this issue without approximations. Index Terms: convex optimization, conditional random fields, acoustic modeling, digit string recognition
(Show Context)

Citation Context

.... The motivation for this design decision is the fact that it is hard to initialize the density indices without a reasonable generative model. Note that this is an important difference to the work in =-=[4, 3, 5]-=-. The initialization of some log-linear models by generative models is possible due to the equivalence relation of Gaussian and loglinear models [6]. For instance, such a log-linear model using first ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University