## Speech Recognition Using Augmented Conditional Random Fields

Citations: | 22 - 0 self |

### BibTeX

@MISC{Hifny_speechrecognition,

author = {Yasser Hifny and Steve Renals},

title = {Speech Recognition Using Augmented Conditional Random Fields},

year = {}

}

### OpenURL

### Abstract

Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT

### Citations

9458 | The nature of statistical learning theory
- Vapnik
- 1995
(Show Context)
Citation Context ...one models which use a window of left and right neighboring phones [23]–[27]. • Augmenting the observation space with a large number of dimensions, which can simplify the classification problem [28], =-=[29]-=-. • Relaxing the HMM conditional independence assumptions, which can be done by integrating acoustic context information in the modeling process to take into account longer time intervals [30]–[32]. A... |

8539 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...and directly related to the definition of a pattern classification problem. Generative HMMs are well understood models and may be trained efficiently using the expectation-maximization (EM) algorithm =-=[5]-=-. Using Bayes rule, the coarse density estimates provided by HMMs can be used for discrimination. Consequently, HMMs provide a means to learn and generate spectral information in order to discriminate... |

5086 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...g the CML nonlinear objective function using Taylor’s expansion around the current model point in the parameter space [42]. Such approaches are well established in artificial neural networks research =-=[43]-=-, [28]. For example, the CRF training process has been accelerated by using a stochastic meta-descent algorithm which utilizes second-order information to adapt the gradient step sizes [44]. Similar m... |

4490 | A tutorial on hidden Markov models and selected applications in speech recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...ty of the hidden state sequence given a model to be approximated using a first order Markov chain I. INTRODUCTION STATE-of-the-art automatic speech recognition systems use hidden Markov models (HMMs) =-=[1]-=-–[4] to model the temporal variation, with local spectral variability modeled using flexible distributions such as mixtures of Gaussian densities. HMMs can divide the acoustic space into a large numbe... |

3895 |
Neural Networks: A Comprehensive Foundation
- Haykin
- 1999
(Show Context)
Citation Context ...ent phone models which use a window of left and right neighboring phones [23]–[27]. • Augmenting the observation space with a large number of dimensions, which can simplify the classification problem =-=[28]-=-, [29]. • Relaxing the HMM conditional independence assumptions, which can be done by integrating acoustic context information in the modeling process to take into account longer time intervals [30]–[... |

2457 | Conditional random fields: Probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...ANDOM FIELDS (ACRFS) ACRFs incorporate acoustic context information into an augmented space in order to model the sequential phenomena of the speech signal. ACRFs are derived from linear chain CRFs 3 =-=[40]-=-, which are undirected graphical models that maintain the Markov properties of HMMs, formulated using the maximum entropy (MaxEnt) principle [41]. Linear chain CRFs can be thought as the undirected gr... |

2172 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...rameters that must be estimated robustly. The regularizer or Lasso penalty is often used to increase the sparseness of the model since it can lead to solutions where some elements of are exactly zero =-=[50]-=-. The gradient of the objective function is given by where the gradient of than since undefined (19) can be defined for points other for for for (20) is substituted by or at , to ensure the increase o... |

2035 |
Numerical optimization
- Nocedal, Wright
- 1999
(Show Context)
Citation Context ...hes. These methods rely on a locally linear or quadratic approximation by expanding the CML nonlinear objective function using Taylor’s expansion around the current model point in the parameter space =-=[42]-=-. Such approaches are well established in artificial neural networks research [43], [28]. For example, the CRF training process has been accelerated by using a stochastic meta-descent algorithm which ... |

1363 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
- 1992
(Show Context)
Citation Context ...r consideration (i.e., the acoustic regions most likely to account for the current frame). The selection of an -best 1 More recently, approaches that employ feature spaces induced by Mercer’s kernels =-=[35]-=-, [29] have been widely used since they are theoretically attractive, enabling computations in possibly infinite-dimensional feature spaces to be performed in finite-dimensional kernel spaces. 2 Typic... |

1311 |
Statistical Decision Theory and Bayesian Analysis
- Berger
- 1985
(Show Context)
Citation Context ...odel . Selecting a suitable value for can usually be achieved via cross validation. Alternatively, the problem can also be cast as model selection within the marginal likelihood or evidence framework =-=[52]-=-. A simple and pragmatic method that can be useful for large-scale optimization required for speech recognition is proposed. This method is based on training the unregularized simple ACRFs ( and ) for... |

761 | Statistical methods for speech recognition - Jelinek - 1997 |

564 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...CRFs since these methods usually tend to be less heuristic than numerical optimization methods. Exact lower bound optimization of linear chain CRFs [40] based on iterative scaling (IS) variants [46], =-=[47]-=- is very slow. In our work, we use a family of iterative scaling algorithms, which we call Approximate Iterative Scaling (AIS), to speed up the training process. While AIS algorithms follow the exact ... |

473 |
Connectionist speech recognition: A hybrid approach
- Bourlard, Morgan
- 1994
(Show Context)
Citation Context ...coustic Context Acoustic context, taking into account a longer time interval for state discrimination within the modeling process, has been used previously in hybrid connectionist/HMM acoustic models =-=[38]-=- and for discriminant feature extraction [39], [18], [19]. Once the high-dimensional augmented vector is constructed, it is possible to take advantage of acoustic context by adding the surrounding aug... |

455 | A maximum entropy model for Part-Of-speech tagging
- Ratnaparkhi
- 1996
(Show Context)
Citation Context ...rly to speech recognition. 6 Some researchers prune the parameter space by removing the parameters associated with constraints that have low empirical expectation values before the training process 7 =-=[56]-=-, [57]. This technique belongs to learning model parameters methods only, while fixing the structure of the model. Of course, learning model structure is a harder problem with respect to learning the ... |

434 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...rain ACRFs since these methods usually tend to be less heuristic than numerical optimization methods. Exact lower bound optimization of linear chain CRFs [40] based on iterative scaling (IS) variants =-=[46]-=-, [47] is very slow. In our work, we use a family of iterative scaling algorithms, which we call Approximate Iterative Scaling (AIS), to speed up the training process. While AIS algorithms follow the ... |

434 | Optimal brain damage
- Cun, Denker, et al.
- 1990
(Show Context)
Citation Context ...ch recognition and will lead to very sparse models as the results show in Section VI. Other pruning methods have been developed based on forward greedy constraint induction [47], Optimal Brain Damage =-=[53]-=-, or Optimal Brain Surgeon [54]. However, these methods scale Authorized licensed use limited to: The University of Edinburgh. Downloaded on February 10, 2009 at 12:51 from IEEE Xplore. Restrictions a... |

411 | Exploiting generative models in discriminative classifiers
- Jaakkola, Haussler
- 1999
(Show Context)
Citation Context ...te the weights of these e-family activation functions as well as the parameters associated with transition constraints. Score-space kernels [75], [76], which are a generalization of the Fisher kernel =-=[77]-=-, are used to extract new sufficient statistics, which may relax the conditional independence assumptions in a systematic fashion. These sufficient statistics are used to train conditional statistical... |

284 | Spoken language processing: a guide to theory, algorithm, and system development - Huang, Acero, et al. - 2001 |

262 |
Speaker independent phone recognition using hidden markov models
- Lee, Hon
- 1989
(Show Context)
Citation Context ...sented the speech using 12th-order Mel frequency cepstral coefficients (MFCCs), energy, along with their first and second temporal derivatives, resulting in a 39-element feature vector. Following Lee =-=[58]-=-, the original 61 phone classes in TIMIT were mapped to a set of 48 labels, which were used for training. This set of 48 phone classes was mapped down to a set of 39 classes [58], after decoding, and ... |

258 |
Adaptive control processes
- Bellman
- 1961
(Show Context)
Citation Context ... in the system and is the dimensionality of the constructed augmented space. 5 Training a large number of parameters can lead to overfitting and poor generalization due to the curse of dimensionality =-=[49]-=-. To address this, we have employed an regularizer (Section V-A), which we use in the context of an efficient, incremental training algorithm (Section V-B). A. -ACRF Models Regularization is a common ... |

236 | A comparison of algorithms for maximum entropy parameter estimation
- Malouf
- 2002
(Show Context)
Citation Context ...he available optimization techniques for this class of model are based on either iterative scaling or gradient descent. In the case of maximum entropy modeling for natural language processing, Malouf =-=[69]-=- demonstrated that gradient-based optimization is considerably more efficient than approaches based on iterative scaling. However, we note that in this case the structure of the problem—binary feature... |

229 |
Geometrical and statistical properties of systems of linear inequalities with application in pattern recognition
- Cover
- 1965
(Show Context)
Citation Context ...ture projection into high-dimensional spaces is a powerful tool to simplify classification problems, since high-dimensional spaces are more likely to be linearly separable than low-dimensional spaces =-=[34]-=-, as illustrated in Fig. 1. This is usually achieved by mapping the low-dimensional input space into a high-dimensional space, with linear decision boundaries used Fig. 1. Two-dimensional classificati... |

188 |
On the rationale of maximum entropy methods
- Jaynes
- 1982
(Show Context)
Citation Context ...ech signal. ACRFs are derived from linear chain CRFs 3 [40], which are undirected graphical models that maintain the Markov properties of HMMs, formulated using the maximum entropy (MaxEnt) principle =-=[41]-=-. Linear chain CRFs can be thought as the undirected graphical twins for HMMs regardless of their training (generative or discriminative). ACRF acoustic models are a particular implementation of linea... |

170 |
Maximum mutual information estimation of hidden Markov model parameters for speech recognition
- Bahl, Brown, et al.
- 1986
(Show Context)
Citation Context ...tion error. One way to address this problem within the HMM framework is to utilize the parameters efficiently to improve the discrimination between speech classes via discriminative training for HMMs =-=[7]-=-–[12]. Large-vocabulary continuous speech recognition systems based on continuous Gaussian mixture HMMs are very successful [13], mainly because the associated algorithms are computationally very effi... |

136 | Efficient backprop
- LeCun, Bottou, et al.
- 1998
(Show Context)
Citation Context ... acoustic context) hyperparameter estimation becomes demanding. Stochastic or online updates based on gradient descent algorithms have proven to be very efficient for a number of large-scale problems =-=[70]-=-. In our work, we investigated the usage of online iterative scaling algorithm [48]. However, while it was easy to show it provides faster convergence, we found this advantage outweighed by the necess... |

101 | Accelerated training of conditional random. fields with stochastic meta-descent
- Vishwanathan, Schraudolph, et al.
- 2006
(Show Context)
Citation Context ...s research [43], [28]. For example, the CRF training process has been accelerated by using a stochastic meta-descent algorithm which utilizes second-order information to adapt the gradient step sizes =-=[44]-=-. Similar methods have been used to train HMMs by relaxing the probabilistic constraints during the HMM training process [45]. For an e-family activation function based on first-order sufficient stati... |

98 | An Inequality for Rational Functions with Applications to Some Statistical Estimation Problems - Gopalakrishnan, Kanevsky, et al. - 1991 |

90 | Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition,” Speech Communication - Kumar, Andreou - 1998 |

90 | Hidden-articulator Markov models for speech recognition,” ITRW ASR2000
- Richardson, Bilmes, et al.
- 2000
(Show Context)
Citation Context ...], [29]. • Relaxing the HMM conditional independence assumptions, which can be done by integrating acoustic context information in the modeling process to take into account longer time intervals [30]–=-=[32]-=-. Acoustic context information may be incorporated using dynamic features [33] or implicitly based on feature projection [20], [21]. In this paper, a new acoustic model closely related to the HMM fram... |

85 | Hidden conditional random fields for phone classification
- Gunawardana, Mahajan, et al.
- 2005
(Show Context)
Citation Context ...ction matrix in fMPE). fMPE feature projection methods may be used within -ACRF framework. Maximum entropy acoustic modeling based on low-dimensional spaces has became an active area of research [72]–=-=[74]-=-, [48]. A linear chain CRF model analogous to an HMM (as it used in speech recognition) relaxes the stochastic transition constraints and its local observation scoring is based on quadratic activation... |

83 | fMPE: Discriminatively trained features for speech recognition
- Povey, Kingsbury, et al.
(Show Context)
Citation Context ...the modeling process to take into account longer time intervals [30]–[32]. Acoustic context information may be incorporated using dynamic features [33] or implicitly based on feature projection [20], =-=[21]-=-. In this paper, a new acoustic model closely related to the HMM framework is proposed and evaluated. This framework focuses on augmenting the observation space and integrating the acoustic context in... |

81 |
H.: Linear discriminant analysis for improved large vocabulary continuous speech recognition
- Haeb-Umbach, Ney
- 1992
(Show Context)
Citation Context ...res representing the observation space, while preserving the information needed to discriminate between speech classes. These limitations of the HMM may be addressed in part through the use of linear =-=[14]-=-–[16] or nonlinear feature projection methods [17]–[22], which extract new sufficient statistics that take into account acoustic context and improve the discrimination between speech classes. This may... |

79 |
A review of large-vocabulary continuous-speech recognition
- Young
- 1996
(Show Context)
Citation Context ...scrimination between speech classes via discriminative training for HMMs [7]–[12]. Large-vocabulary continuous speech recognition systems based on continuous Gaussian mixture HMMs are very successful =-=[13]-=-, mainly because the associated algorithms are computationally very efficient and scale well as the amount of training data increases. These attractive properties arise from two assumptions that lead ... |

79 | Grafting: Fast, incremental feature selection by gradient descent in function spaces
- Perkins, Lacker, et al.
(Show Context)
Citation Context ...ization algorithm to train ACRFs and prune their parameter space concurrently. Alternatively, gradient based optimization can be used to train general models as the gradient is defined and calculated =-=[51]-=-. The value for the hyperparameter specifies the compromise between the complexity of the model and modeling accuracy. Increasing the value of will lead to reduction of the number of the active parame... |

77 | Maximum likelihood discriminant feature spaces
- Saon, Padmanabhan, et al.
- 2000
(Show Context)
Citation Context ...epresenting the observation space, while preserving the information needed to discriminate between speech classes. These limitations of the HMM may be addressed in part through the use of linear [14]–=-=[16]-=- or nonlinear feature projection methods [17]–[22], which extract new sufficient statistics that take into account acoustic context and improve the discrimination between speech classes. This may be a... |

74 | The Acoustic-Modeling Problem in Automatic Speech Recognition - Brown - 1987 |

74 |
Heterogeneous measurements and multiple classifiers for speech recognition
- Halberstadt, Glass
- 1998
(Show Context)
Citation Context ...ed to train -ACRFs as well. -ACRFs can also take advantage of the TRAP tandem approach [64], [65] as a powerful frontend [20]. System combination and rescoring, committee based methods, such as [65], =-=[66]-=- may be applied for -ACRFs. Note that 50 speakers from the full test set was used for cross validation in [66]. Bilmes [32] experimentally compared acoustic context modeling with the basic HMM formula... |

73 | Speech recognition using SVMs
- Smith, Gales
- 2002
(Show Context)
Citation Context ...ilable). The goal of the training process is to estimate the weights of these e-family activation functions as well as the parameters associated with transition constraints. Score-space kernels [75], =-=[76]-=-, which are a generalization of the Fisher kernel [77], are used to extract new sufficient statistics, which may relax the conditional independence assumptions in a systematic fashion. These sufficien... |

72 | Large scale discriminative training for speech recognition
- Woodland, Povey
- 2000
(Show Context)
Citation Context ... error. One way to address this problem within the HMM framework is to utilize the parameters efficiently to improve the discrimination between speech classes via discriminative training for HMMs [7]–=-=[12]-=-. Large-vocabulary continuous speech recognition systems based on continuous Gaussian mixture HMMs are very successful [13], mainly because the associated algorithms are computationally very efficient... |

56 |
Context-dependent modeling for acoustic-phonetic recognition of continuous speech
- Schwartz, Chow, et al.
- 1985
(Show Context)
Citation Context ...N USING ACRFs 355 • Augmenting the state space by increasing the number of hidden states. This can done by using context-dependent phone models which use a window of left and right neighboring phones =-=[23]-=-–[27]. • Augmenting the observation space with a large number of dimensions, which can simplify the classification problem [28], [29]. • Relaxing the HMM conditional independence assumptions, which ca... |

55 |
Vector Quantization for the Efficient Computation of Continuous Density Likelihoods
- Bocchieri
- 1993
(Show Context)
Citation Context ...likelihood score will take the role of the constraint posterior score (3) in the augmented spaces framework. Scoring a large number of Gaussians may be accelerated using Gaussian selection techniques =-=[36]-=-, [37]. The augmented spaces framework supports other e-family activation functions. Samples of these activation functions are The e-family activation functions (5) and (6) can be estimated by accumul... |

50 |
Optimal brain surgeon and general network pruning
- Hassibi, Stork, et al.
- 1993
(Show Context)
Citation Context ... very sparse models as the results show in Section VI. Other pruning methods have been developed based on forward greedy constraint induction [47], Optimal Brain Damage [53], or Optimal Brain Surgeon =-=[54]-=-. However, these methods scale Authorized licensed use limited to: The University of Edinburgh. Downloaded on February 10, 2009 at 12:51 from IEEE Xplore. Restrictions apply.360 IEEE TRANSACTIONS ON ... |

46 | Comparison of Discriminative Training Criteria - Schluter, Macherey - 1998 |

46 |
Tandem connectionist feature stream extraction for conventional hmm systems
- Hermansky, Ellis, et al.
- 2000
(Show Context)
Citation Context ...onger time interval for state discrimination within the modeling process, has been used previously in hybrid connectionist/HMM acoustic models [38] and for discriminant feature extraction [39], [18], =-=[19]-=-. Once the high-dimensional augmented vector is constructed, it is possible to take advantage of acoustic context by adding the surrounding augmented frames to the current frame during state scoring. ... |

45 |
Explicit Time Correlation in Hidden Markov Models for Speech Recognition
- Wellekens
- 1987
(Show Context)
Citation Context ...m [28], [29]. • Relaxing the HMM conditional independence assumptions, which can be done by integrating acoustic context information in the modeling process to take into account longer time intervals =-=[30]-=-–[32]. Acoustic context information may be incorporated using dynamic features [33] or implicitly based on feature projection [20], [21]. In this paper, a new acoustic model closely related to the HMM... |

41 |
MMI training for continuous phoneme recognition on the TIMIT database
- Kapadia, Valtchev, et al.
(Show Context)
Citation Context ...on (Section I), which is not addressed in the ACRF framework. The other two enhancements are addressed within the ACRF framework. In general, improvements based on using different objective functions =-=[62]-=-, [63] do not address the acoustic modeling formulation and can be used to train -ACRFs as well. -ACRFs can also take advantage of the TRAP tandem approach [64], [65] as a powerful frontend [20]. Syst... |

40 |
A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood
- Nadas
- 1983
(Show Context)
Citation Context ...tion that generated the data is indeed an HMM, then, given sufficient data, Bayes classification based on HMMs estimated using maximum likelihood will minimize the probability of classification error =-=[6]-=-. In practice, the decision boundaries constructed after generative training are not optimal and generative HMMs are not guaranteed to minimize the classification error. One way to address this proble... |

38 |
Hierarchical Structures of Neural Networks for Phoneme Recognition
- Schwarz, Matejka, et al.
- 2006
(Show Context)
Citation Context ... on using different objective functions [62], [63] do not address the acoustic modeling formulation and can be used to train -ACRFs as well. -ACRFs can also take advantage of the TRAP tandem approach =-=[64]-=-, [65] as a powerful frontend [20]. System combination and rescoring, committee based methods, such as [65], [66] may be applied for -ACRFs. Note that 50 speakers from the full test set was used for c... |

34 | What HMMs can do
- Bilmes
- 2002
(Show Context)
Citation Context ...f the hidden state sequence given a model to be approximated using a first order Markov chain I. INTRODUCTION STATE-of-the-art automatic speech recognition systems use hidden Markov models (HMMs) [1]–=-=[4]-=- to model the temporal variation, with local spectral variability modeled using flexible distributions such as mixtures of Gaussian densities. HMMs can divide the acoustic space into a large number of... |

34 | Large vocabulary speaker-independent continuous speech recognition: the SPHINX system. Unpublished Doctoral dissertation - Lee - 1988 |