## Comparison of large margin training to other discriminative methods for phonetic recognition by hidden Markov models (2007)

### Cached

### Download Links

Venue: | In Proceedings of ICASSP 2007 |

Citations: | 27 - 4 self |

### BibTeX

@INPROCEEDINGS{Sha07comparisonof,

author = {Fei Sha},

title = {Comparison of large margin training to other discriminative methods for phonetic recognition by hidden Markov models},

booktitle = {In Proceedings of ICASSP 2007},

year = {2007}

}

### OpenURL

### Abstract

In this paper we compare three frameworks for discriminative training of continuous-density hidden Markov models (CD-HMMs). Specifically, we compare two popular frameworks, based on conditional maximum likelihood (CML) and minimum classification error (MCE), to a new framework based on margin maximization. Unlike CML and MCE, our formulation of large margin training explicitly penalizes incorrect decodings by an amount proportional to the number of mislabeled hidden states. It also leads to a convex optimization over the parameter space of CD-HMMs, thus avoiding the problem of spurious local minima. We used discriminatively trained CD-HMMs from all three frameworks to build phonetic recognizers on the TIMIT speech corpus. The different recognizers employed exactly the same acoustic front end and hidden state space, thus enabling us to isolate the effect of different cost functions, parameterizations, and numerical optimizations. Experimentally, we find that our framework for large margin training yields significantly lower error rates than both CML and MCE training. Index Terms — speech recognition, discriminative training, MMI, MCE, large margin, phoneme recognition 1.

### Citations

435 | Max-margin markov networks
- Taskar, Guestrin, et al.
- 2003
(Show Context)
Citation Context ...tion [1, 2]. Our framework has two salient features: (i) it attempts to separate the likelihoods of correct versus incorrect label sequences by margins proportional to the number of mislabeled states =-=[10]-=-; (ii) the required optimization is convex, thus avoiding the pitfall of spurious local minima. These features also distinguish our approach to large margin training of CD-HMMs from other recent formu... |

250 |
Speaker-independent phone recognition using bidden Markov models
- Lee, Hon
- 1989
(Show Context)
Citation Context ...several variants of these frameworks) to explore the effects of different parameterizations, initializations, and cost functions. 3.1. Setup CD-HMMs were evaluated on the task of phonetic recognition =-=[12]-=-— namely, mapping speech utterances to sequences of phonemes, as opposed to higher-level units, such as words. Phonetic label sequences of test utterances were inferred using Viterbi decoding. Note th... |

183 |
Discriminative Learning for Minimum Error Classification
- Juang, Katagiri
- 1992
(Show Context)
Citation Context ...iversity of California (San Diego) 9500 Gilman Drive, Mail Code 0404 La Jolla, CA 92093-0404 estimated directly to maximize the conditional likelihood [3, 4] or minimize the classification error rate =-=[5]-=-. Though not as straightforward to implement as ML estimation, discriminative methods yield much lower error rates on most tasks in automatic speech recognition (ASR). We investigate salient differenc... |

163 |
Maximum Mutual Information Estimation of Hidden Markov Models Parameters for Speech Recognition
- Souza, V, et al.
- 1986
(Show Context)
Citation Context ...Department of Computer Science and Engineering University of California (San Diego) 9500 Gilman Drive, Mail Code 0404 La Jolla, CA 92093-0404 estimated directly to maximize the conditional likelihood =-=[3, 4]-=- or minimize the classification error rate [5]. Though not as straightforward to implement as ML estimation, discriminative methods yield much lower error rates on most tasks in automatic speech recog... |

137 |
Speech Database Development: Design and Analysis of the Acoustic-Phonetic Corpus
- Lamel, Kassel, et al.
- 1986
(Show Context)
Citation Context ...s on most tasks in automatic speech recognition (ASR). We investigate salient differences between CML, MCE, and large margin training through carefully designed experiments on the TIMIT speech corpus =-=[6]-=-. Though much smaller than typical corpora used for large vocabulary ASR, the TIMIT corpus provides an apt benchmark for evaluating the intrinsic merits of different frameworks for discriminative trai... |

99 |
Large Scale Discriminative Training of Hidden Markov Models for Speech Recognition
- Woodland, Povey
(Show Context)
Citation Context ...n more explicitly by rewriting eq. (4) as: θ CML " = arg max log p(Xn, Yn) − log θ X # p(Xn, S) . (5) The CML estimator in eq. (4) is closely related to the maximum mutual information (MMI) estimator =-=[7, 8]-=-, given by: n n θ MMI X p(Xn, Yn) = arg max log . (6) θ p(Xn)p(Yn) Note that eqs. (4) and (6) yield identical estimators in the setting where the (language model) probabilities p(Yn) are held fixed. 2... |

52 | Heterogeneous acoustic measurements for phonetic classi cation
- Halberstadt, Glass
- 1997
(Show Context)
Citation Context ...ies between phonetic segments were correctly located prior to decoding. This distinguishes the task of phonetic recognition, considered in this paper, from the simpler task of phonetic classification =-=[13]-=- considered in our earlier work [1]. M ML CML MCE Margin 1 40.1% 36.4% 35.6% 31.2% 2 36.5% 34.6% 34.5% 30.8% 4 34.7% 32.8% 32.4% 29.8% 8 32.7% 31.5% 30.9% 28.2% Table 1. Phonetic error rates from diff... |

51 |
Large margin Gaussian mixture modeling for phonetic classification and recognition
- Sha, Saul
(Show Context)
Citation Context ...per, we present a systematic comparison of several leading frameworks for parameter estimation in CD-HMMs. These frameworks include a recently proposed scheme based on the goal of margin maximization =-=[1, 2]-=-, an idea that has been widely applied in the field of machine learning. We compare the objective function and learning algorithm in this framework for large margin training to those of other traditio... |

49 | Mmie training of large vocabulary recognition systems
- Valtchev, Odell, et al.
- 1997
(Show Context)
Citation Context ...n more explicitly by rewriting eq. (4) as: θ CML " = arg max log p(Xn, Yn) − log θ X # p(Xn, S) . (5) The CML estimator in eq. (4) is closely related to the maximum mutual information (MMI) estimator =-=[7, 8]-=-, given by: n n θ MMI X p(Xn, Yn) = arg max log . (6) θ p(Xn)p(Yn) Note that eqs. (4) and (6) yield identical estimators in the setting where the (language model) probabilities p(Yn) are held fixed. 2... |

48 | Large margin hidden Markov models or automatic speech recognition
- Sha, Saul
- 2006
(Show Context)
Citation Context ...raining, we used steepest gradient descent (which worked better); for margin maximization, we used a combination of conjugate gradient and projected subgradient descent, as described in previous work =-=[1, 2]-=-. For CML training, we obtained competitive results from conjugate gradient descent and did not experiment with the extended Baum-Welch algorithm [14]. 3.2. Experimental results Table 1 shows the erro... |

39 |
MMI training for continuous phoneme recognition on the TIMIT database
- Kapadia, Valtchev, et al.
- 1993
(Show Context)
Citation Context ...gradient descent, as described in previous work [1, 2]. For CML training, we obtained competitive results from conjugate gradient descent and did not experiment with the extended Baum-Welch algorithm =-=[14]-=-. 3.2. Experimental results Table 1 shows the error rates of different CD-HMMs trained by ML, CML, MCE, and margin maximization. Here, M denotes the number of mixture components per state (in each GMM... |

26 |
Discriminative training for large vocabulary speech recognition using minimum classification error
- McDermott, Hazen, et al.
- 2007
(Show Context)
Citation Context ...(7) does not matter, as long as it is finite. The nondifferentiability of the sign and max functions in eq. (7) makes it difficult to minimize the misclassification error directly. Thus, MCE training =-=[9]-=- adopts the surrogate cost function: 0 2 3 1 1 Nerr ≈ X B σ@− log p(Xn, Yn) + log 4 1 X e C n S�=Yn η log p(Xn,S) (8) In this approximation, a sigmoid function σ(z) = (1 + e −αz ) −1 replaces the sign... |

19 |
Large margin HMMs for speech recognition
- Li, Jiang, et al.
(Show Context)
Citation Context ...he required optimization is convex, thus avoiding the pitfall of spurious local minima. These features also distinguish our approach to large margin training of CD-HMMs from other recent formulations =-=[11]-=-. We start by reviewing the discriminant functions in large margin CD-HMMs [1, 2]. These parameterized functions of observations X and states S take a form analogous to the log-probability in eq. (1).... |

16 |
A decision-theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood
- Nádas
- 1983
(Show Context)
Citation Context ...Department of Computer Science and Engineering University of California (San Diego) 9500 Gilman Drive, Mail Code 0404 La Jolla, CA 92093-0404 estimated directly to maximize the conditional likelihood =-=[3, 4]-=- or minimize the classification error rate [5]. Though not as straightforward to implement as ML estimation, discriminative methods yield much lower error rates on most tasks in automatic speech recog... |