Results 1  10
of
116
Hidden Markov processes
 IEEE Trans. Inform. Theory
, 2002
"... Abstract—An overview of statistical and informationtheoretic aspects of hidden Markov processes (HMPs) is presented. An HMP is a discretetime finitestate homogeneous Markov chain observed through a discretetime memoryless invariant channel. In recent years, the work of Baum and Petrie on finite ..."
Abstract

Cited by 259 (5 self)
 Add to MetaCart
(Show Context)
Abstract—An overview of statistical and informationtheoretic aspects of hidden Markov processes (HMPs) is presented. An HMP is a discretetime finitestate homogeneous Markov chain observed through a discretetime memoryless invariant channel. In recent years, the work of Baum and Petrie on finitestate finitealphabet HMPs was expanded to HMPs with finite as well as continuous state spaces and a general alphabet. In particular, statistical properties and ergodic theorems for relative entropy densities of HMPs were developed. Consistency and asymptotic normality of the maximumlikelihood (ML) parameter estimator were proved under some mild conditions. Similar results were established for switching autoregressive processes. These processes generalize HMPs. New algorithms were developed for estimating the state, parameter, and order of an HMP, for universal coding and classification of HMPs, and for universal decoding of hidden Markov channels. These and other related topics are reviewed in this paper. Index Terms—Baum–Petrie algorithm, entropy ergodic theorems, finitestate channels, hidden Markov models, identifiability, Kalman filter, maximumlikelihood (ML) estimation, order estimation, recursive parameter estimation, switching autoregressive processes, Ziv inequality. I.
P.C.: Minimum phone error and Ismoothing for improved discriminative training
 In: Proc. ICASSP
, 2002
"... In this paper we introduce the Minimum Phone Error (MPE) and Minimum Word Error (MWE) criteria for the discriminative training of HMM systems. The MPE/MWE criteria are smoothed approximations to the phone or word error rate respectively. We also discuss Ismoothing which is a novel technique for s ..."
Abstract

Cited by 240 (9 self)
 Add to MetaCart
(Show Context)
In this paper we introduce the Minimum Phone Error (MPE) and Minimum Word Error (MWE) criteria for the discriminative training of HMM systems. The MPE/MWE criteria are smoothed approximations to the phone or word error rate respectively. We also discuss Ismoothing which is a novel technique for smoothing discriminative training criteria using statistics for maximum likelihood estimation (MLE). Experiments have been performed on the Switchboard/Call Home corpora of telephone conversations with up to 265 hours of training data. It is shown that for the maximum mutual information estimation (MMIE) criterion, Ismoothing reduces the word error rate (WER) by 0.4 % absolute over the MMIE baseline. The combination of MPE and Ismoothing gives an improvement of 1 % over MMIE and a total reduction in WER of 4.8 % absolute over the original MLE system. 1.
Hidden conditional random fields for phone classification
 in Interspeech
, 2005
"... In this paper, we show the novel application of hidden conditional random fields (HCRFs) – conditional random fields with hidden state sequences – for modeling speech. Hidden state sequences are critical for modeling the nonstationarity of speech signals. We show that HCRFs can easily be trained u ..."
Abstract

Cited by 109 (7 self)
 Add to MetaCart
(Show Context)
In this paper, we show the novel application of hidden conditional random fields (HCRFs) – conditional random fields with hidden state sequences – for modeling speech. Hidden state sequences are critical for modeling the nonstationarity of speech signals. We show that HCRFs can easily be trained using the simple direct optimization technique of stochastic gradient descent. We present the results on the TIMIT phone classification task and show that HCRFs outperforms comparable ML and CML/MMI trained HMMs. In fact, HCRF results on this task are the best single classifier results known to us. We note that the HCRF framework is easily extensible to recognition since it is a state and label sequence modeling technique. We also note that HCRFs have the ability to handle complex features without any change in training procedure. 1.
Large Scale Discriminative Training For Speech Recognition
, 2000
"... This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion whi ..."
Abstract

Cited by 86 (5 self)
 Add to MetaCart
This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion which has been used to train HMM systems for conversational telephone speech transcription using up to 265 hours of training data. These experiments represent the largestscale application of discriminative training techniques for speech recognition of which the authors are aware, and have led to significant reductions in word error rate for both triphone and quinphone HMMs compared to our best models trained using maximum likelihood estimation. The MMIE latticebased implementation used; techniques for ensuring improved generalisation; and interactions with maximum likelihood based adaptation are all discussed. Furthermore several variations to the MMIE training scheme are introduced with the a...
Large margin hidden Markov models for automatic speech recognition
 in Advances in Neural Information Processing Systems 19
, 2007
"... We study the problem of parameter estimation in continuous density hidden Markov models (CDHMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on maxmargin Markov networks, our ap ..."
Abstract

Cited by 81 (9 self)
 Add to MetaCart
(Show Context)
We study the problem of parameter estimation in continuous density hidden Markov models (CDHMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on maxmargin Markov networks, our approach is specifically geared to the modeling of realvalued observations (such as acoustic feature vectors) using Gaussian mixture models. Unlike previous discriminative frameworks for ASR, such as maximum mutual information and minimum classification error, our framework leads to a convex optimization, without any spurious local minima. The objective function for large margin training of CDHMMs is defined over a parameter space of positive semidefinite matrices. Its optimization can be performed efficiently with simple gradientbased methods that scale well to large problems. We obtain competitive results for phonetic recognition on the TIMIT speech corpus. 1
Comparison of Discriminative Training Criteria and Optimization Methods for Speech Recognition
, 2001
"... The aim of this work is to build up a common framework for a class of discriminative training criteria and optimization methods for continuous speech recognition. A unified discriminative criterion based on likelihood ratios of correct and competing models with optional smoothing is presented. The u ..."
Abstract

Cited by 60 (8 self)
 Add to MetaCart
The aim of this work is to build up a common framework for a class of discriminative training criteria and optimization methods for continuous speech recognition. A unified discriminative criterion based on likelihood ratios of correct and competing models with optional smoothing is presented. The unified criterion leads to particular criteria through the choice of competing word sequences and the choice of smoothing. Analytic and experimental comparisons are presented for both the maximum mutual information (MMI) and the minimum classification error (MCE) criterion together with the optimization methods gradient descent (GD) and extended Baum (EB) algorithm. A tree searchbased restricted recognition method using word graphs is presented, so as to reduce the computational complexity of large vocabulary discriminative training. Moreover, for MCE training, a method using word graphs for efficient calculation of discriminative statistics is introduced. Experiments were performed for continuous speech recognition using the ARPA wall street journal (WSJ) corpus with a vocabulary of 5k words and for the recognition of continuously spoken digit strings using both the TI digit string corpus for American English digits, and the SieTill corpus for telephone line recorded German digits. For the MMI criterion, neither analytical nor experimental results do indicate significant differences between EB and GD optimization. For acoustic models of low complexity, MCE training gave significantly better results than MMI training. The recognition results for large vocabulary MMI training on the WSJ corpus show a significant dependence on the context length of the language model used for training. Best results were obtained using a unigram language model for MMI training. No significant co...
The Dynamics of Nonlinear Relaxation Labeling Processes
, 1997
"... We present some new results which definitively explain the behavior of the classical, heuristic nonlinear relaxation labeling algorithm of Rosenfeld, Hummel, and Zucker in terms of the HummelZucker consistency theory and dynamical systems theory. In particular, it is shown that, when a certain symm ..."
Abstract

Cited by 41 (12 self)
 Add to MetaCart
We present some new results which definitively explain the behavior of the classical, heuristic nonlinear relaxation labeling algorithm of Rosenfeld, Hummel, and Zucker in terms of the HummelZucker consistency theory and dynamical systems theory. In particular, it is shown that, when a certain symmetry condition is met, the algorithm possesses a Liapunov function which turns out to be (the negative of) a wellknown consistency measure. This follows almost immediately from a powerful result of Baum and Eagon developed in the context of Markov chain theory. Moreover, it is seen that most of the essential dynamical properties of the algorithm are retained when the symmetry restriction is relaxed. These properties are also shown to naturally generalize to higherorder relaxation schemes. Some applications and implications of the presented results are finally outlined.
Speech Recognition Using Augmented Conditional Random Fields
"... Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by stateoftheart stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by stateoftheart stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT
Discriminative speaker adaptation with conditional maximum likelihood linear regression
 In Eurospeech
, 2001
"... We present a simplified derivation of the extended BaumWelch procedure, which shows that it can be used for Maximum Mutual Information (MMI) of a large class of continuous emission density hidden Markov models (HMMs). We use the extended BaumWelch procedure for discriminative estimation of MLLRty ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
We present a simplified derivation of the extended BaumWelch procedure, which shows that it can be used for Maximum Mutual Information (MMI) of a large class of continuous emission density hidden Markov models (HMMs). We use the extended BaumWelch procedure for discriminative estimation of MLLRtype speaker adaptation transformations. The resulting adaptation procedure, termed Conditional Maximum Likelihood Linear Regression (CMLLR), is used successfully for supervised and unsupervised adaptation tasks on the Switchboard corpus, yielding an improvement over MLLR. The interaction of unsupervised CMLLR with segmental minimum Bayes risk lattice voting procedures is also explored, showing that the two procedures are complimentary. 1.
On Reversing Jensen's Inequality
 In Advances in Neural Information Processing Systems 13
, 2000
"... Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of s ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
(Show Context)
Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent loglikelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown.