## Hidden conditional random fields for phone classification (2005)

### Cached

### Download Links

Venue: | in Interspeech |

Citations: | 83 - 6 self |

### BibTeX

@INPROCEEDINGS{Gunawardana05hiddenconditional,

author = {Asela Gunawardana and Milind Mahajan and Alex Acero and John C. Platt},

title = {Hidden conditional random fields for phone classification},

booktitle = {in Interspeech},

year = {2005},

pages = {1117--1120}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper, we show the novel application of hidden conditional random fields (HCRFs) – conditional random fields with hidden state sequences – for modeling speech. Hidden state sequences are critical for modeling the non-stationarity of speech signals. We show that HCRFs can easily be trained using the simple direct optimization technique of stochastic gradient descent. We present the results on the TIMIT phone classification task and show that HCRFs outperforms comparable ML and CML/MMI trained HMMs. In fact, HCRF results on this task are the best single classifier results known to us. We note that the HCRF framework is easily extensible to recognition since it is a state and label sequence modeling technique. We also note that HCRFs have the ability to handle complex features without any change in training procedure. 1.

### Citations

2310 | Conditional random fields: probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...conditional state transition probabilities are exponential (“maximum entropy”) distributions that may depend on arbitrary features of the entire observation sequence. Conditional random fields (CRFs) =-=[4]-=- are generalizations of Microsoft Research One Microsoft Way Redmond, WA 98052 USA {aselag,milindm,alexac,jplatt}@microsoft.com MEMMs where the conditional probability of the entire state sequence giv... |

1894 |
Numerical Optimization
- Nocedal, SJ
- 2000
(Show Context)
Citation Context ...necessary to use special purpose algorithms such as the EBW algorithm used in MMI and MPE estimation. CRFs are typically trained using iterative scaling methods or quasi-Newton methods such as L-BFGS =-=[8]-=-. It is possible to train HCRFs using Generalized EM (GEM) where the M-step is an iterative algorithm such as GIS or L-BFGS, rather than a closed form solution. As an alternative to (G)EM, direct opt... |

488 | Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. The ACL02 conference on Empirical methods in natural language processingVolume 10
- Collins
- 2002
(Show Context)
Citation Context ...the training set and the same sample can be processed multiple times. We also used a parameter averaging technique which is known to benefit robustness of stochastic approximation algorithms like SGD =-=[9, 15]-=-. The averaged parameters are obtained as λavg = 1 �N N n=1 λ(n) . SGD training can be viewed as a softened extension of perceptron training [15] to hidden variable problems. Both L-BFGS and SGD requi... |

444 | Shallow parsing with conditional random fields
- Sha, Pereira
- 2003
(Show Context)
Citation Context ...kelihood of the training set N� L(λ) = ∀s ∀s log p(w n=1 (n) |o (n) ; λ). L-BFGS is a well-known low-memory quasi-Newton method which has been applied successfully to the estimation of CRF parameters =-=[14]-=-. L-BFGS approximates the inverse of the Hessian using the history of the changes in parameter and gradient values (known as correction pairs) at previous L-BFGS iterations. Typically, 3 to 20 such mo... |

439 | Maximum entropy markov models for information extraction and segmentation
- McCallum, Freitag, et al.
- 2000
(Show Context)
Citation Context ... states need to model the observations in a uniform way, and that it is difficult to incorporate long-range dependencies between the states and the observations. Maximum entropy Markov models (MEMMs) =-=[3]-=- are direct (non-generative) models that attempt to remedy this – instead of observations being generated at each state, the state sequence is generated conditioned on the observations. The state at e... |

431 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...veal the “correct” training state sequence through Viterbi alignment, which is used as ground truth during training. This allows the models to be trained using the generalized iterative scaling (GIS) =-=[7]-=- algorithm and its variants. We generalize this work and use CRFs with hidden state sequences for modeling speech. We term these models hidden CRFs (HCRFs). HCRFs are able to use features which can be... |

371 |
Stochastic Approximation Algorithms and Applications
- Kushner, Yin
- 1997
(Show Context)
Citation Context ...bly desirable since it avoids the indirection involved in the use of the EM auxiliary function. We have successfully used direct optimization techniques such as L-BFGS and stochastic gradient descent =-=[9]-=- to estimate HCRF parameters. We note that this approach is generalizable to other smooth discriminative criteria such as the conditional expectation of the raw phone or word error rate [10], or the s... |

183 |
Discriminative Learning for Minimum Error Classification
- Juang, Katagiri
- 1992
(Show Context)
Citation Context ...is approach is generalizable to other smooth discriminative criteria such as the conditional expectation of the raw phone or word error rate [10], or the smoothed empirical error of the training data =-=[11]-=-. We compare the performance of the novel HCRF models for speech to that of ML trained HMMs and maximum mutual information (MMI) trained HMMs on the TIMIT phone classification task and show that HCRFs... |

179 |
Minimum phone error and I-smoothing for improved discriminative training
- Povey, Woodland
- 2002
(Show Context)
Citation Context ...nt descent [9] to estimate HCRF parameters. We note that this approach is generalizable to other smooth discriminative criteria such as the conditional expectation of the raw phone or word error rate =-=[10]-=-, or the smoothed empirical error of the training data [11]. We compare the performance of the novel HCRF models for speech to that of ML trained HMMs and maximum mutual information (MMI) trained HMMs... |

116 |
Discriminative training for large vocabulary speech recognition
- Povey
- 2003
(Show Context)
Citation Context ...the success of extended Baum-Welch (EBW) based techniques such as maximum mutual information (MMI) and minimum phone error (MPE) training in large vocabulary conversational speech recognition (LVCSR) =-=[1]-=-. However, the methods are poorly understood as they are used in ways in which their convergence guarantees no longer hold, and their successful use is as much art as it is science [1]. The rationale ... |

97 |
An inequality for rational functions with applications to some statistical estimation problems
- Gopalakrishnan, Kanevsky, et al.
- 1991
(Show Context)
Citation Context ...hniques is that general unconstrained optimization algorithms are not well-suited to optimizing generative hidden Markov models (HMMs) under discriminative criteria such as the conditional likelihood =-=[2]-=-. We present a class of models that in contrast to HMMs are discriminative rather than generative in nature, and are amenable to the use of general purpose unconstrained optimization algorithms. The H... |

72 | On the use of support vector machines for phonetic classification,” ICASSP
- Clarkson, Moreno
- 1999
(Show Context)
Citation Context ... HMMs using the same feature set and the model structure. The performance of HCRFs is the best single classifier results we know of on this task – including techniques such as support vector machines =-=[12]-=- and neural networks [13]. The advantage of HCRFs is that the model is a state sequence probability model, even when applied to the phone classification task, and can easily be extended to recognition... |

52 | Heterogeneous acoustic measurements for phonetic classi cation
- Halberstadt, Glass
- 1997
(Show Context)
Citation Context ... just the squared terms as shown above. 4. Experimental Results In this paper, we validate the ideas described above on the TIMIT phone classification task. We use the experimental setup described in =-=[16]-=-. Results are reported on the MIT development test set [16] and the NIST core test set. The training, development, and evaluation sets have 142,910, 15,334, and 7333 phonetic segments respectively. We... |

19 | Phone classification with segmental features and a binary-pair partitioned neural network classifier
- Zahorian, Silsbee, et al.
- 1997
(Show Context)
Citation Context ...ure set and the model structure. The performance of HCRFs is the best single classifier results we know of on this task – including techniques such as support vector machines [12] and neural networks =-=[13]-=-. The advantage of HCRFs is that the model is a state sequence probability model, even when applied to the phone classification task, and can easily be extended to recognition tasks where the boundari... |

7 | A comparative study on maximum entropy and discriminative training for acoustic modeling in automatic speech recognition
- Macherey, Ney
- 2003
(Show Context)
Citation Context ...fully for tasks such as part-of-speech (POS) tagging and information extraction [3, 4]. MEMMs have also been applied to ASR with some success [5], while recent work on maximum entropy acoustic models =-=[6]-=- can be interpreted as an application of a somewhat constrained CRF to ASR. In ASR, the use of mixture models and multiple state models in modeling the observations means that the training data is inc... |

1 |
Maximum entropy direct model as a unified direct model for acoustic modeling in speech recognition
- Gao
- 2004
(Show Context)
Citation Context ...hm for decoding [4]. MEMMs and CRFs have been used successfully for tasks such as part-of-speech (POS) tagging and information extraction [3, 4]. MEMMs have also been applied to ASR with some success =-=[5]-=-, while recent work on maximum entropy acoustic models [6] can be interpreted as an application of a somewhat constrained CRF to ASR. In ASR, the use of mixture models and multiple state models in mod... |