#### DMCA

## Discriminative models for speech recognition (1997)

Venue: | In Information Theory and Applications Workshop |

Citations: | 22 - 8 self |

### Citations

13233 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ... two classes that the label changes part way through. One form of discriminative classifier that has been found to yield good empirical results on a range of tasks is the Support Vector Machine (SVM) =-=[34]-=-. By using these generative kernel features SVMs can be applied to binary classification tasks with sequence data. This approach has been applied in the speech processing area to simple small vocabula... |

11970 | Maximum likelihood from incomplete data via the em algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ... Likelihood (ML) training. The likelihood criterion may be expressed as Fml(λ) = 1 R R∑ r=1 log(p(O (r) |w (r) ref; λ)) (3) This optimisation is normally performed using Expectation Maximisation (EM) =-=[16]-=-. During inference, or decoding, classification is based on Bayes’ decision rule ˆw = arg max {P (w|O1:T ; λ)} (4) w where the word sequence posterior is then obtained using Bayes’ rule 1 P (w|O1:T ; ... |

5891 | A tutorial on hidden Markov models and selected applications in speech recognition,”
- Rabiner
- 1989
(Show Context)
Citation Context ...s to the acoustic models have been made, for example speaker adaptation [2], adaptive training [3] and semi-tied covariance matrices [4], the underlying model has remained a Hidden Markov Model (HMM) =-=[5]-=-. One of the major developments that has significantly improved the performance of ASR systems is the use of discriminative criteria for training HMMs, rather than using the Maximum Likelihood (ML) cr... |

3484 | Conditional random fields: Probabilistic models for segmenting and labeling sequence datasets
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...elegant approach to incorporating a language model in this framework. This has limited possible gains with this form of model [12]. V. HIDDEN CONDITIONAL RANDOM FIELD Conditional Random Fields (CRFs) =-=[28]-=- are one approach to constructing discriminative models. Given the observation sequence O1:T = {o1, . . . , oT } and label sequence w = {w1, . . . , wL}, the standard form for this model is P (w|O1:T ... |

2303 | Text categorization with support vector machines.
- Joachims
- 1998
(Show Context)
Citation Context ...res from the observation and word sequences can be used for inference compared to generative models. These discriminative models have started to dominate the area of Natural Language Processing (NLP) =-=[4]-=-, [5]. One issue in NLP training is that text data comprises variable length sequences of words yielding a vast number of possible classes. It is thus rarely possible to robustly construct models of c... |

1655 |
Error bounds for convolution codes and an asymptotically optimum decoding algorithm”,
- Viterbi
- 2006
(Show Context)
Citation Context ...) where the model parameters for a particular word sequence λ (ω) defines the set of valid state sequences. Inference with these forms of model can be efficiently achieved using the Viterbi algorithm =-=[23]-=-, where the likelihood is approximated using the best-state sequence. In most state-of-the-art ASR systems, the parameters of the distributions in Equation 3 are trained using discriminative criteria ... |

922 |
Statistical Methods for Speech Recognition.
- Jelinek
- 1997
(Show Context)
Citation Context ...s (HMMs) [1] are typically used as the acoustic models to derive the likelihood of a particular class generating an observation sequence. This is combined with a prior, e.g., an N-gram language model =-=[2]-=-, to yield a posterior probability of the class given the observation. Acceptable performance in generative models is accomplished via refinements to the standard HMM acoustic models, including contex... |

818 | Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...ous speech recognition (LVCSR) tasks, such as Broadcast News transcription [1], to be addressed. Though a number of modifications to the acoustic models have been made, for example speaker adaptation =-=[2]-=-, adaptive training [3] and semi-tied covariance matrices [4], the underlying model has remained a Hidden Markov Model (HMM) [5]. One of the major developments that has significantly improved the perf... |

660 | Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms.
- Collins
- 2002
(Show Context)
Citation Context ...(P (w|O;α)) (16) This is the form typically used for training discriminative models such as CRFs [5] and is usually the starting point for structured discriminative models [17]. Perceptron Algorithm: =-=[42]-=- Fper(α,w,O) = [ max ˜w=w { −log ( )}] P(w|O;α) (17) P( ˜w|O;α) + where [x]+ is the hinge-loss function. This can be extended to the averaged perceptron algorithm where the parameters, α are averaged... |

624 | Large margin methods for structured and interdependent output variables.
- Tsochantaridis, Joachims, et al.
- 2005
(Show Context)
Citation Context ...make use of standard optimisation approaches associated with the perceptron criterion and structured SVMs discussed in section V. For some structured discriminative models, such as the structured SVM =-=[39]-=-, this approximation is essential. With a single segmentation the following posterior is obtained P(w1:L|O1:T,â;α) = (11) 1 Z exp ( α T[ ∑ |â| τ=1φ(O {âτ},âi ]) τ) The issue now is how the segmentatio... |

551 | Exploiting generative models in discriminative classifiers.
- Jaakkola, Haussler
- 1998
(Show Context)
Citation Context ...ndard form is used then score-spaces from the kernel can be used as the basis for the feature-function. One interesting form of score-space for this form of kernel is based on generative models [18], =-=[51]-=- 7 φ(O {ai},a i i ) = ⎡ ⎢ ⎣ log ( p(O {ai};λ (ai i ) ) ) ∇ λ (a i i ) log( p(O {ai};λ (ai i ) ) ) . ∇ ρ λ (ai i ) log( p(O {ai};λ (ai i ) ) ) ⎤ ⎥ ⎦ (28) where ∇ ρ λ represents the (diagonalised) ρ-th ... |

449 | Support vector machine learning for interdependent and structured output spaces,
- Tsochantaridis, Hofmann, et al.
- 2004
(Show Context)
Citation Context ...odels described above have been based on whole sentence models, where the generative model parameters λ (ω) , the prior P(ω), and the 2 In this presentation, in common with work on CRFs [5] and SSVMs =-=[30]-=- joint feature-spaces involving both features and labels will be used. Even when structure is introduced this requirement to handle variable length data is still necessary. discriminative model featur... |

321 | Cutting-plane training of structural SVMs.
- Joachims, Finley, et al.
- 2009
(Show Context)
Citation Context ... necessary to only use the one-best alignment. Initially this can be obtained from the compensated HMMs used to derive the features. The parameters can then be found using the cutting-plane algorithm =-=[60]-=-, which has been found to be an efficient method for training these forms of model. This has been used to train models for speech recognition in [14]. The initial segmentation from the compensated HMM... |

277 | From HMMs to segment models: A unified view of stochastic modeling for speech recognition. - Ostendorf, Digalakis, et al. - 1996 |

262 | Semi-tied covariance matrices for hidden Markov models,”
- Gales
- 1999
(Show Context)
Citation Context ...transcription [1], to be addressed. Though a number of modifications to the acoustic models have been made, for example speaker adaptation [2], adaptive training [3] and semi-tied covariance matrices =-=[4]-=-, the underlying model has remained a Hidden Markov Model (HMM) [5]. One of the major developments that has significantly improved the performance of ASR systems is the use of discriminative criteria ... |

254 | Semi-markov conditional random fields for information extraction.
- Sarawagi, Cohen
- 2004
(Show Context)
Citation Context ...t can be argued that once the segmentation has been obtained it can be converted into a frame-label sequence that could then be used for CRF training. This is the form examined in the Semi-Markov CRF =-=[38]-=-. that of the generative model. The optimal segmentation for the discriminative model is given by â = argmax{P(a|O1:T)P(w1:L|O1:T,a;α)} (12) a Since this “best” segmentation is a function of the model... |

250 | P.C.: Minimum phone error and I-smoothing for improved discriminative training. In:
- Povey, Woodland
- 2002
(Show Context)
Citation Context ...scriminative criteria for training HMMs, rather than using the Maximum Likelihood (ML) criterion. A number of criteria, such as Maximum Mutual Information (MMI) [6], [7] and Minimum Phone Error (MPE) =-=[8]-=-, [9], have been used to train the parameters of the HMM 1 . Initially these criteria were applied to small vocabulary speech recognition tasks. A number of techniques were then developed to enable th... |

228 |
Discriminative Learning for Minimum Error Classification
- Juang, Katagiri
- 1992
(Show Context)
Citation Context ...s form of training criterion is used with discriminative models it is also known as Conditional Maximum Likelihood (CML) training. Minimum Classification Error (MCE): is a smooth measure of the error =-=[19]-=-. This is normally based on a smooth function of the difference between the log-likelihood of the correct sequence and all other competing word sequences. ⎛ ⎞ Fmce(λ) = 1 R R∑ ⎜ [ ⎝ 1 + ∑ r=1 1 p(O (r... |

215 | Learning structural svms with latent variables.
- Yu, Joachims
- 2009
(Show Context)
Citation Context ...s section has considered summing over all possible segmentations, a, of the data. Though it is possible to define recursions for this task [17], the resulting parameter estimation is no longer convex =-=[37]-=-, and the decoding and training time can become slow depending on the exact nature of the feature-extraction process. Also as the optimisation approaches used to train discriminative model parameters ... |

211 | Weighted finite-state transducers in speech recognition,”
- Mohri, Pereira, et al.
- 2002
(Show Context)
Citation Context ...specifies a phone/word/sub-unit identity indicated for segment τ as ai τ , and range of frames, O {aτ}. The same notation can be used for phone, HMM state, and Weighted Finite State Transducer (WFST) =-=[32]-=- arc } is the sequence of segsequences. ai = {ai 1 ,...,ai |a| ment identities. Thus P(ai |w1:L) is the pronunciation probability when the segmentation is associated with phones. An N-gram language mo... |

208 | A Compact Model for Speaker-Adaptive Training,” in
- Anastasakos, McDonough, et al.
- 1996
(Show Context)
Citation Context ...(LVCSR) tasks, such as Broadcast News transcription [1], to be addressed. Though a number of modifications to the acoustic models have been made, for example speaker adaptation [2], adaptive training =-=[3]-=- and semi-tied covariance matrices [4], the underlying model has remained a Hidden Markov Model (HMM) [5]. One of the major developments that has significantly improved the performance of ASR systems ... |

180 |
Discriminative Training for Large Vocabulary Speech Recognition”, PhD Thesis,
- Povey
- 2003
(Show Context)
Citation Context ...inative criteria for training HMMs, rather than using the Maximum Likelihood (ML) criterion. A number of criteria, such as Maximum Mutual Information (MMI) [6], [7] and Minimum Phone Error (MPE) [8], =-=[9]-=-, have been used to train the parameters of the HMM 1 . Initially these criteria were applied to small vocabulary speech recognition tasks. A number of techniques were then developed to enable their u... |

171 | Dynamic Conditional Random Fields : Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. - Sutton, McCallum - 2004 |

123 |
Large scale discriminative training of hidden Markov models for speech recognition,
- Woodland, Povey
- 2002
(Show Context)
Citation Context ...ce of ASR systems is the use of discriminative criteria for training HMMs, rather than using the Maximum Likelihood (ML) criterion. A number of criteria, such as Maximum Mutual Information (MMI) [6], =-=[7]-=- and Minimum Phone Error (MPE) [8], [9], have been used to train the parameters of the HMM 1 . Initially these criteria were applied to small vocabulary speech recognition tasks. A number of technique... |

120 | Finding Consensus Among Words: Lattice-based Word Error Minimization.”,
- Mangu, Brill, et al.
- 1999
(Show Context)
Citation Context ...ith word sequence posteriors being produced using Bayes’ rule. Note in recent years MBR decoding, associated normally with the word-level cost function, has become popular in speech recognition [25], =-=[26]-=-, [27]. IV. MAXIMUM ENTROPY MARKOV MODELS The DBN in figure 1 may be modified to produce a discriminative (or direct model) by reversing the direction of the arcs from the states to the observations a... |

117 | An inequality for rational functions with applications to some statistical estimation problems.
- Gopalakrishnan, Kanevsky, et al.
- 1991
(Show Context)
Citation Context ...ormance of ASR systems is the use of discriminative criteria for training HMMs, rather than using the Maximum Likelihood (ML) criterion. A number of criteria, such as Maximum Mutual Information (MMI) =-=[6]-=-, [7] and Minimum Phone Error (MPE) [8], [9], have been used to train the parameters of the HMM 1 . Initially these criteria were applied to small vocabulary speech recognition tasks. A number of tech... |

114 | Hidden conditional random fields for phone classification,” in INTERSPEECH,
- Gunawardana, Mahajan, et al.
- 2005
(Show Context)
Citation Context ...CSR tasks, and tasks in challenging acoustic conditions, is still not satisfactory for many speech-enabled applications. This has led to interest in discriminative models for speech recognition [12], =-=[13]-=-, [14], [15] where the posterior of the word-sequence given the observation is directly modelled. This paper briefly reviews HMMs, discriminative training criteria, and the current forms of discrimina... |

110 |
The HTK Book Version 3.4”,
- Young, Kershaw, et al.
- 2006
(Show Context)
Citation Context ...T ) ( exp α T [ log(P (w)) l Tl( ˆ ] L∑ + α θ) T ) Ta(Ot(w,i, θ) ˆ , wi, λ) i=1 Training and inference can now be implemented in a similar fashion to the discriminative training implementation in HTK =-=[40]-=-. Initially a lattice is generated using the current model λ. This is then “model-marked” where time-stamps are added to the lattice at the model-level, this may be either at the phone or word level. ... |

93 |
Marginalized kernels for biological sequences.
- Tsuda, Kin, et al.
- 2002
(Show Context)
Citation Context ...dopted with dynamic kernels to give a systematic way of extracting features from the sequences. A number of kernels have been proposed for handling sequence data, including marginalised count kernels =-=[30]-=-, Fisher kernels [31], string kernels [32] and generative kernels [33]. An interesting class of these sequence kernels are based on generative models. Both Fisher kernels [31] and generative kernels [... |

87 |
Adaptation of maximum entropy capitalizer: Little data can help a lot,”
- Chelba, Acero
- 2004
(Show Context)
Citation Context ...n contrast to the majority of adaptation approaches for generative models which are based on maximum likelihood, discriminative model adaptation is usually based on conditional maximum likelihood. In =-=[45]-=-, two approaches for adapting log-linear models — MAP adaptation and minimum divergence training — are discussed. These approaches yield a general adaptation scheme that makes no assumption about the ... |

83 |
The Acoustic-Modeling Problem in Automatic Speech Recognition,
- Brown
- 1999
(Show Context)
Citation Context ...NING CRITERIA For ML to be the “best” training criterion, the data and models are assumed to satisfy a number of requirements, for example the quantity of training data available and modelcorrectness =-=[17]-=-. These requirements are not satisfied when modelling speech data. This has led to the use of discriminative training criteria, which are more closely linked to minimising the error rate, rather than ... |

83 | Large margin hidden Markov models for automatic speech recognition.
- Sha, Saul
- 2007
(Show Context)
Citation Context ...re the likelihood is approximated using the best-state sequence. In most state-of-the-art ASR systems, the parameters of the distributions in Equation 3 are trained using discriminative criteria [24]–=-=[26]-=- (see section III-C) rather than maximizing the likelihood of the observations [1]. An alternative approach, discussed next, is to change the model to directly discriminate between sentences.SPECIAL ... |

82 | Speech recognition using svms.
- Smith, Gales
- 2002
(Show Context)
Citation Context ...nd tasks in challenging acoustic conditions, is still not satisfactory for many speech-enabled applications. This has led to interest in discriminative models for speech recognition [12], [13], [14], =-=[15]-=- where the posterior of the word-sequence given the observation is directly modelled. This paper briefly reviews HMMs, discriminative training criteria, and the current forms of discriminative models ... |

78 | Graphical models and automatic speech recognition.
- Bilmes
- 2001
(Show Context)
Citation Context ... states left-to-right topology (left), and DBN (right). Note for the DBN the dependence of the state on the sentence has not been shown. Figure 1 shows the topology and Dynamic Bayesian Network (DBN) =-=[22]-=- associated with a typical HMM. The left diagram illustrates a standard phone topology, strictly left-to-right with three emitting states, the right diagram the DBN with conditional independence assum... |

77 | Explicit word error minimization in N-best list rescoring.
- Stolcke, Konig, et al.
- 1997
(Show Context)
Citation Context ...del, with word sequence posteriors being produced using Bayes’ rule. Note in recent years MBR decoding, associated normally with the word-level cost function, has become popular in speech recognition =-=[25]-=-, [26], [27]. IV. MAXIMUM ENTROPY MARKOV MODELS The DBN in figure 1 may be modified to produce a discriminative (or direct model) by reversing the direction of the arcs from the states to the observat... |

77 | Discriminative language modeling with conditional random fields and the perceptron algorithm,” in
- Roark, Saraclar, et al.
- 2004
(Show Context)
Citation Context ...g log-linear models for language modelling has been an active research area for many years, for example see [43], [53]. These exponential models allow a very rich set of features, for example lexical =-=[21]-=-, linguistic, and hierarchical features [54], [55], to be used. For the 7 The form of score-space described here can also be related to information geometry and more general forms of generative model ... |

70 | The concave-convex procedure (cccp).
- Yuille, Rangarajan
- 2002
(Show Context)
Citation Context ...ood score-space this expression is related to inference for factorial HMMs [16]. This optimal segmentation can then be integrated into the overall training procedure using concave-convex optimisation =-=[61]-=-. Extending SSVMs to larger vocabulary tasks is nontrivial. The number of possible constraints to be satisfied can becomes very large, impacting both the computational load and memory requirements. Th... |

65 | Speaker verification using sequence discriminant support vector machines,”
- Wan, Renals
- 2005
(Show Context)
Citation Context ...n applied in the speech processing area to simple small vocabulary speech recognition tasks [15], LVCSR tasks by making use of the acoustic code-breaking framework [35], [36] and speaker verification =-=[37]-=-, [38]. The kernel between two sequences, O (1) and O (2), has the form K(O (1) , O (2) ; λ) = φ(O (1) ; λ) T G -1 φ(O (2) ; λ) (23) where G defines the metric. An interesting aspect of these generati... |

62 | Rational kernels: Theory and algorithms.
- Cortes, Haffner, et al.
- 2004
(Show Context)
Citation Context ...tion and word sequences vary over the training and test samples, a sequence kernel is required, a range of which can be described in the rational kernel (both discrete and continuous) framework [12], =-=[49]-=-. More generally when sequence kernels are combined with feature-functions and static kernels the following form can be obtained (assuming segmentation a (r) and a at the word level, ai i = wi) ( k {O... |

54 | The Application of Hidden Markov Models in speech Recognition. Foundation and trends in Signal Processing.
- Gales, Young
- 2008
(Show Context)
Citation Context ... 2015 Neil Ave, Columbus OH, 43210, USA. (e-mail: fosler@cse.ohio-state.edu) Manuscript received October 1, 2011; revised ??, 2011. speaker adaptation, discriminative training, and noise compensation =-=[3]-=-. Though current state-of-the-art systems yield satisfactory recognition rates in some domains, performance is generally not good enough for speech applications to become ubiquitous. In discriminative... |

46 |
A Decision Theoretic Formulation of the Training Problem in Speech Recognition and a Comparison of Training by
- Nadas
- 1983
(Show Context)
Citation Context ..., where the likelihood is approximated using the best-state sequence. In most state-of-the-art ASR systems, the parameters of the distributions in Equation 3 are trained using discriminative criteria =-=[24]-=-–[26] (see section III-C) rather than maximizing the likelihood of the observations [1]. An alternative approach, discussed next, is to change the model to directly discriminate between sentences.SPE... |

45 | Investigations on error minimizing training criteria for discriminative training in automatic speech recognition,”
- Macherey, Haferkamp, et al.
- 2005
(Show Context)
Citation Context ... base the loss function on the specific task for which the classifier is being built [21]. A comparison of the above criteria on the Wall Street Journal (WSJ) task and a general framework is given in =-=[22]-=-. Both MCE and MPE were found to outperform MMI on this task. In addition to the above criteria there has also been some work on estimating model parameters based on maximising the margin [23]. To ena... |

42 | A segmental CRF approach to large vocabulary continuous speech recognition
- Zweig, Nguyen
(Show Context)
Citation Context ... approaches that have been applied to ASR which can be described within this framework: log-linear models [13]–[15], Structured Support Vector Machines (SSVMs) [16], HCRFs [9], Segmental CRFs (SCRFs) =-=[17]-=-, Conditional AugmentedModels (CAugs) [18], Maximum Entropy Markov Models (MEMMs) [19], Augmented CRFs (ACRFs) [20]. These models differ from each other in terms of the observation features considered... |

38 | Discriminative syntactic language modeling for speech recognition.
- Collins, Roark, et al.
- 2005
(Show Context)
Citation Context ...as been an active research area for many years, for example see [43], [53]. These exponential models allow a very rich set of features, for example lexical [21], linguistic, and hierarchical features =-=[54]-=-, [55], to be used. For the 7 The form of score-space described here can also be related to information geometry and more general forms of generative model [18]. For discrete cases it has also been co... |

37 |
Maximum entropy direct models for speech recognition,
- Kuo, Gao
- 2006
(Show Context)
Citation Context ... on LVCSR tasks, and tasks in challenging acoustic conditions, is still not satisfactory for many speech-enabled applications. This has led to interest in discriminative models for speech recognition =-=[12]-=-, [13], [14], [15] where the posterior of the word-sequence given the observation is directly modelled. This paper briefly reviews HMMs, discriminative training criteria, and the current forms of disc... |

36 |
A novel loss function for the overall risk criterion based discriminative training of HMM models,”in
- Kaiser, Horvat, et al.
- 2000
(Show Context)
Citation Context ... = 1 − 1 R R∑ r=1 P (w (r) ref|O (r) ; λ) (8) Minimum Bayes’ Risk (MBR): rather than trying to model the correct distribution, as in the MMI criterion, the expected loss during inference is minimised =-=[20]-=-, [21] Fmbr(λ) = 1 R R∑ ∑ P (w|O (r) ; λ)L(w, w (r) ref) (9) r=1 w where L(w, w (r) ref) is the loss function of word sequence w against the reference for sequence r, w (r) ref. There are a number of ... |

34 | Support vector machines for segmental minimum Bayes-risk decoding of continuous speech
- Venkataramani, Chakrabartty, et al.
- 2003
(Show Context)
Citation Context ...sequence data. This approach has been applied in the speech processing area to simple small vocabulary speech recognition tasks [15], LVCSR tasks by making use of the acoustic code-breaking framework =-=[35]-=-, [36] and speaker verification [37], [38]. The kernel between two sequences, O (1) and O (2), has the form K(O (1) , O (2) ; λ) = φ(O (1) ; λ) T G -1 φ(O (2) ; λ) (23) where G defines the metric. An ... |

33 | Large Margin Hidden Markov Models for speech recognition,
- Jiang, Li, et al.
- 2006
(Show Context)
Citation Context ...iven in [22]. Both MCE and MPE were found to outperform MMI on this task. In addition to the above criteria there has also been some work on estimating model parameters based on maximising the margin =-=[23]-=-. To enable these discriminative training criteria to be successfully applied to LVCSR tasks, a number of techniques have been developed to improve generalisation. These include:Acoustic de-weighting... |

32 |
Lattice-based discriminative training for large vocabulary speech recognition. In:
- Valtchev, Odell, et al.
- 1996
(Show Context)
Citation Context ...ks. In particular, schemes such as I-smoothing [8] and language model weakening [10] have been developed to improve generalisation and the use of lattices to compactly represent the denominator score =-=[11]-=-. Though large reductions in word error rate (WER) have been obtained on a range of tasks, the performance on LVCSR tasks, and tasks in challenging acoustic conditions, is still not satisfactory for m... |

29 | Speech recognition using augmented conditional random fields
- Hifny, Renals
- 2009
(Show Context)
Citation Context ..., Structured Support Vector Machines (SSVMs) [16], HCRFs [9], Segmental CRFs (SCRFs) [17], Conditional AugmentedModels (CAugs) [18], Maximum Entropy Markov Models (MEMMs) [19], Augmented CRFs (ACRFs) =-=[20]-=-. These models differ from each other in terms of the observation features considered, training criterion and how the latent variables are handled. In addition to models that directly map from the obs... |

27 | Augmented statistical models for speech recognition
- Layton, Gales
- 2006
(Show Context)
Citation Context ...sks, and tasks in challenging acoustic conditions, is still not satisfactory for many speech-enabled applications. This has led to interest in discriminative models for speech recognition [12], [13], =-=[14]-=-, [15] where the posterior of the word-sequence given the observation is directly modelled. This paper briefly reviews HMMs, discriminative training criteria, and the current forms of discriminative m... |

27 | Task dependent loss functions in speech recognition: Application to named entity extraction.
- Goel, Byrne
- 1999
(Show Context)
Citation Context ...rd sequence posteriors being produced using Bayes’ rule. Note in recent years MBR decoding, associated normally with the word-level cost function, has become popular in speech recognition [25], [26], =-=[27]-=-. IV. MAXIMUM ENTROPY MARKOV MODELS The DBN in figure 1 may be modified to produce a discriminative (or direct model) by reversing the direction of the arcs from the states to the observations and usi... |

26 | Shrinking Exponential Language Models,
- Chen
- 2009
(Show Context)
Citation Context ...of supra-segmental features are associated with the word (or phone) sequences. Applying log-linear models for language modelling has been an active research area for many years, for example see [43], =-=[53]-=-. These exponential models allow a very rich set of features, for example lexical [21], linguistic, and hierarchical features [54], [55], to be used. For the 7 The form of score-space described here c... |

24 |
Using Augmented Statistical Models and Score Spaces for Classification
- Smith
- 2003
(Show Context)
Citation Context ...3) where G defines the metric. An interesting aspect of these generative kernels is that estimating the decision boundary may be related to estimating the parameters of an Augmented Statistical model =-=[39]-=-, [36]. Though good performance has been obtained, it is non-trivial to apply this to tasks with large numbers of classes (without the use of schemes such as acoustic code-breaking). VII. CONDITIONAL ... |

24 | Discriminative classifiers with adaptive kernels for noise robust speech recognition
- Gales, Flego
- 2010
(Show Context)
Citation Context ... the model parameters, the features are modified to make them independent of the speaker or environment. This is simplest to do when the feature extraction process is based on generative models [15], =-=[48]-=-. This approach is discussed in more detail in section V-C. E. Kernel Representations The discussion of the model-parameters and featurefunctions have so far assumed that there is an explicit represen... |

22 |
Discriminative n-gram language modeling, Computer Speech and Language
- Roark, Saraclar, et al.
- 2006
(Show Context)
Citation Context ...ave been examined. Note for speech recognition the language model (or class prior), P (w), is not normally trained in conjunction with the acoustic model (though there has been some work in this area =-=[18]-=-). Typically the amount of text training data for the language model is far greater (orders of magnitude) than the available acoustic training data. Maximum Mutual Information (MMI): the following for... |

22 |
Exploiting generative models in disciminative classifiers
- Jaakkola, Haussler
- 1999
(Show Context)
Citation Context ...ernels to give a systematic way of extracting features from the sequences. A number of kernels have been proposed for handling sequence data, including marginalised count kernels [30], Fisher kernels =-=[31]-=-, string kernels [32] and generative kernels [33]. An interesting class of these sequence kernels are based on generative models. Both Fisher kernels [31] and generative kernels [33] make use of gener... |

22 | Efficient sampling and feature selection in whole sentence maximum entropy language models
- Chen, Rosenfeld
- 1999
(Show Context)
Citation Context ...sequences. This loss may be at the frame level or at a higher level, e.g. word or phone. One issue that can occur is that the normalisation term can be very expensive, or even intractable, to compute =-=[43]-=-. However for some criteria, the perceptron (17) + D. Adaptation For generative models adaptation to a particular speaker or environment condition is an essential part of current speech recognition sy... |

21 | Augmented Statistical Models for Classifying Sequence Data
- Layton
- 2006
(Show Context)
Citation Context ...es from this segment. It is possible to hypothesise a range of features that could be used. However it is more interesting to consider this process in the context of sequence kernels and score-spaces =-=[18]-=-. These sequence kernels map variable length sequences to a fixed length score-space in which the inner product can be computed. All the acoustic feature extraction schemes for feature extraction sati... |

20 | Interdependence of language models and discriminative training
- Schluter, Muller, et al.
- 1999
(Show Context)
Citation Context ..., available at http://htk.eng.cam.ac.uk/, supports many of the current state-ofthe-art techniques used in ASR. LVCSR tasks. In particular, schemes such as I-smoothing [8] and language model weakening =-=[10]-=- have been developed to improve generalisation and the use of lattices to compactly represent the denominator score [11]. Though large reductions in word error rate (WER) have been obtained on a range... |

19 | Structured log linear models for noise robust speech recognition,”
- Zhang, Ragni, et al.
- 2010
(Show Context)
Citation Context ...m of L1 or L2 regularisation, are introduced [17], [20], [44]. When combined with maximum margin training these regularisation terms result in discriminative models closely related to structured SVMs =-=[14]-=-. Furthermore for some feature-functions one can introduce a more informative prior on the discriminative model parameters by using non zero mean priors for α [16]. Conditional Maximum Likelihood [24]... |

18 | Conditional random fields for integrating local discriminative classifiers.
- Morris, Fosler-Lussier
- 2008
(Show Context)
Citation Context ...contain an implicit segmentation; for example, a bestpath frame labeling posterior with multiple labels per segment can give rise to a segmentation by collapsing repeated instances of labels together =-=[7]-=-. Single class labels can also be obtained using SVMs and sequence kernels [8]. However this paper will focus on the situations where there are sequences of labels associated with the observations.SP... |

16 |
Buried Markov Models: A Graphical-Modeling approach to Automatic Speech Recognition , Computer, Speech and Language, 17. Session: Variation, phonetic detail and phonological modeling Presentation preference: Poster or oral
- Bilmes
- 2003
(Show Context)
Citation Context ...e of dependencies. The easiest approach is to hypothesise possible dependencies and then select the dependencies that improve discrimination most. This is the approach adopted in Buried Markov Models =-=[29]-=-. One interesting aspect of handling speech data is that, since sequences of observations are being classified, the space of possible dependencies is very large making the choice of an appropriate hyp... |

14 | String kernels, fisher kernels and finite state automata
- Saunders, Shawe-Taylor, et al.
- 2003
(Show Context)
Citation Context ...ematic way of extracting features from the sequences. A number of kernels have been proposed for handling sequence data, including marginalised count kernels [30], Fisher kernels [31], string kernels =-=[32]-=- and generative kernels [33]. An interesting class of these sequence kernels are based on generative models. Both Fisher kernels [31] and generative kernels [33] make use of generative models to map t... |

13 | Maximum margin training of generative kernels
- Layton
- 2004
(Show Context)
Citation Context ...tures from the sequences. A number of kernels have been proposed for handling sequence data, including marginalised count kernels [30], Fisher kernels [31], string kernels [32] and generative kernels =-=[33]-=-. An interesting class of these sequence kernels are based on generative models. Both Fisher kernels [31] and generative kernels [33] make use of generative models to map the variable length sequences... |

13 |
Investigations on features for log-linear acoustic models in continuous speech recognition
- Wiesler, Nußbaum-Thom, et al.
- 2009
(Show Context)
Citation Context ...ISSUE ON FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION 2 models. There are a number of approaches that have been applied to ASR which can be described within this framework: log-linear models =-=[13]-=-–[15], Structured Support Vector Machines (SSVMs) [16], HCRFs [9], Segmental CRFs (SCRFs) [17], Conditional AugmentedModels (CAugs) [18], Maximum Entropy Markov Models (MEMMs) [19], Augmented CRFs (AC... |

13 | Derivative kernels for noise robust ASR
- Ragni, Gales
- 2011
(Show Context)
Citation Context ... ON FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION 2 models. There are a number of approaches that have been applied to ASR which can be described within this framework: log-linear models [13]–=-=[15]-=-, Structured Support Vector Machines (SSVMs) [16], HCRFs [9], Segmental CRFs (SCRFs) [17], Conditional AugmentedModels (CAugs) [18], Maximum Entropy Markov Models (MEMMs) [19], Augmented CRFs (ACRFs) ... |

13 | On the equivalence of gaussian hmm and gaussian hmm-like hidden conditional random fields
- Heigold, Schlüter, et al.
- 2007
(Show Context)
Citation Context ... or states, it will still generate T vectors for a sequence of T observations. For a particular form of feature function, see (24), HCRFs can be shown be equivalent to discriminative training of HMMs =-=[34]-=-. Segmental feature-functions in models such as Conditional Augmented Models (CAugs) [18], and Segmental CRF (SCRFs) [17], [35] can allow observations across a segment to contribute to the function (s... |

12 |
Hidden conditional random fields for phone recognition
- Sung, Jurafsky
- 2009
(Show Context)
Citation Context ...e is an implied assumption that the number of labels and observations are the same. 1 To address this problem it is possible to introduce latent variables into CRFs, yielding Hidden CRFs (HCRFs) [9], =-=[10]-=-, and make use of sequence kernels and score-spaces [11], [12]. Models that handle this type of data will be referred to as structured discriminative 1 CRFs (and related approaches) can be applied to ... |

12 | Acoustic modeling using continuous rational kernels
- Layton, Gales
- 2005
(Show Context)
Citation Context ...vations are the same. 1 To address this problem it is possible to introduce latent variables into CRFs, yielding Hidden CRFs (HCRFs) [9], [10], and make use of sequence kernels and score-spaces [11], =-=[12]-=-. Models that handle this type of data will be referred to as structured discriminative 1 CRFs (and related approaches) can be applied to ASR by using labels that contain an implicit segmentation; for... |

12 |
at al. Speech Recognition with Segmental Conditional Random Fields: A
- Zweig, Nguyen
- 2011
(Show Context)
Citation Context ...24), HCRFs can be shown be equivalent to discriminative training of HMMs [34]. Segmental feature-functions in models such as Conditional Augmented Models (CAugs) [18], and Segmental CRF (SCRFs) [17], =-=[35]-=- can allow observations across a segment to contribute to the function (similar to generative segmental HMMs [36]; the feature functions relate to the segmentation of the observations O {aτ}: P(w1:L|O... |

11 | Minimum Bayes risk estimation and decoding in large vocabulary continuous speech recognition
- Byrne
- 2006
(Show Context)
Citation Context ... 1 R R∑ r=1 P (w (r) ref|O (r) ; λ) (8) Minimum Bayes’ Risk (MBR): rather than trying to model the correct distribution, as in the MMI criterion, the expected loss during inference is minimised [20], =-=[21]-=- Fmbr(λ) = 1 R R∑ ∑ P (w|O (r) ; λ)L(w, w (r) ref) (9) r=1 w where L(w, w (r) ref) is the loss function of word sequence w against the reference for sequence r, w (r) ref. There are a number of loss f... |

11 |
Large vocabulary continuous speech recognition based on WFST structured classifiers and deep bottleneck features
- Kubo, Hori, et al.
(Show Context)
Citation Context ... exp ( α T [ ]) φ(ai ,w) (32) φ(w) Here the feature-function for the observations is a single element, the log-likelihood from the HMM. The Ngram language model log-probability can also be added e.g. =-=[56]-=-. The model parameters for these two elements are sometimes fixed and not updated. This is the basis of discriminative language models in section V-A. A summary of the features described here can be f... |

8 | Regularization, adaptation, and non-independent features improve hidden conditional random fields for phone classification
- Sung, Boulis, et al.
(Show Context)
Citation Context ... a q∈Qa ( [ ∑ |a| ∑ ]) exp α T τ=1 ∑ |a| τ=1 q ot t∈{aτ} φ(ot,qt,a i τ) ∑ t∈{aτ} φ(qt,qt−1,a i τ ) where Qa is the set of all state sequences where |q| = T and satisfies the segmentation defined by a =-=[33]-=-. If the segmentation of the data is at the word-level then a i τ = wτ . As there are latent variables (states) in an HCRF it is possible to associate these states with particular words in the feature... |

8 | Maximum conditional likelihood linear regression and maximum a posteriori for hidden conditional random fields speaker adaptation,” in ICASSP,
- Sung, Boulis, et al.
- 2008
(Show Context)
Citation Context ...roaches can be used for discriminative models, they do not take advantage of any structure in the features. Alternatively Linear transformation based approaches for log-linear models are described in =-=[46]-=-, [47]. These schemes use approaches similar to the linear transformations for HMMs. Assumptions are made about the relationships between features. To date they have only been applied to models where ... |

8 | Learning a Discriminative Weighted Finite-State Transducer for Speech Recognition
- Lehr, Shafran
(Show Context)
Citation Context ...arge training corpora and models are sparse feature representation and convex optimization. As DLMs typically use discrete features, e.g., long contextN gram of word/Part-Of-Speech (POS) counts [21], =-=[57]-=-, [58], the representation is usually sparse. Furthermore, as there are no latent variables associated with the DLM (or a single segmentation/latent variable value used), it is a convex optimisation p... |

7 | Discriminative adaptation for speaker verification
- Longworth, Gales
- 2006
(Show Context)
Citation Context ...ied in the speech processing area to simple small vocabulary speech recognition tasks [15], LVCSR tasks by making use of the acoustic code-breaking framework [35], [36] and speaker verification [37], =-=[38]-=-. The kernel between two sequences, O (1) and O (2), has the form K(O (1) , O (2) ; λ) = φ(O (1) ; λ) T G -1 φ(O (2) ; λ) (23) where G defines the metric. An interesting aspect of these generative ker... |

7 |
Round-robin duel discriminative language models
- Oba, Hori, et al.
- 2012
(Show Context)
Citation Context ...(oto T t) δ(a i ,vj)P(v|ot) δ(a i i,v1)φ(O{a i}) Example papers [9], [13], [34] [7], [19], [20], [35] [18], [44], [51], [52] δ(a i i,vj)φ(v,O{a i}) [17], [35] ∑ L τ=1 δ(wτ,dog) [17], [21], [43], [56]–=-=[58]-=- TABLE I SUMMARY OF FEATURE FUNCTIONS IN COMMON USE V. EXAMPLE APPLICATIONS A. Discriminative LMs and WFSTs As discussed in section IV-C, it is possible to use structured discriminative modelling appr... |

6 |
Speech recognition with flat direct models
- Nguyen, Heigold, et al.
- 2010
(Show Context)
Citation Context ...t to handle variable length data is still necessary. discriminative model feature-function, φ(O1:T,ω), are based on the sentence label ω. For some tasks, predicting the whole sentence ω is reasonable =-=[31]-=-, but as the vocabulary size and number of possible sentences increases, this approach becomes impractical. To address this issue structure 3 can be introduced into the statistical model, where the se... |

6 | Syntactic and Sub-Lexical Features for Turkish Discriminative Language Models
- Arisoy, Saraclar, et al.
- 2010
(Show Context)
Citation Context ...n an active research area for many years, for example see [43], [53]. These exponential models allow a very rich set of features, for example lexical [21], linguistic, and hierarchical features [54], =-=[55]-=-, to be used. For the 7 The form of score-space described here can also be related to information geometry and more general forms of generative model [18]. For discrete cases it has also been connecte... |

5 |
Training augmented models using svms,” IEICE Special Issue on Statistical Modelling for Speech Recognition
- Gales, Layton
- 2006
(Show Context)
Citation Context ...ce data. This approach has been applied in the speech processing area to simple small vocabulary speech recognition tasks [15], LVCSR tasks by making use of the acoustic code-breaking framework [35], =-=[36]-=- and speaker verification [37], [38]. The kernel between two sequences, O (1) and O (2), has the form K(O (1) , O (2) ; λ) = φ(O (1) ; λ) T G -1 φ(O (2) ; λ) (23) where G defines the metric. An intere... |

5 |
V.J.: A maximum entropy approach to natural language processing.
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...TION 3 B. Discriminative Models Discriminative models directly model the sentence (class) posterior given the observation sequence [27]. One of fairly broad-class is the maximum entropy, maxent model =-=[28]-=-, also known as a log-linear model. Here P(ω|O1:T;α) = 1 Z exp ( α T φ(O1:T,ω) ) (4) where Z is the normalisation term to ensure a valid probability mass function over all sentences, and α the discrim... |

5 | Feature Selection for Log-Linear Acoustic Models
- Wiesler, Richard, et al.
- 2011
(Show Context)
Citation Context ... issues as the feature-function can result in a very high-dimensional feature-space. To address this, regularisation terms, normally in the form of L1 or L2 regularisation, are introduced [17], [20], =-=[44]-=-. When combined with maximum margin training these regularisation terms result in discriminative models closely related to structured SVMs [14]. Furthermore for some feature-functions one can introduc... |

4 |
CTS decoding improvements at
- Saon, Povey, et al.
- 2003
(Show Context)
Citation Context ...isation “robust” parameter priors may be used when estimating the models. These priors may either be based on the ML parameter estimates [8] or, for example when using MPE training, the MMI estimates =-=[24]-=-. For MPE this was found to essential to achieve performance gains [8]. In both MMI and MPE (and for the ϱ = 1 MCE) the optimisation criterion is a function of the word sequence posterior. Thus the cr... |

4 | Extending noise robust structured support vector machines to larger vocabulary tasks - Zhang, Gales |

4 | Discriminative adaptation for log-linear acoustic models.
- Loof, Schluter, et al.
- 2010
(Show Context)
Citation Context ...s can be used for discriminative models, they do not take advantage of any structure in the features. Alternatively Linear transformation based approaches for log-linear models are described in [46], =-=[47]-=-. These schemes use approaches similar to the linear transformations for HMMs. Assumptions are made about the relationships between features. To date they have only been applied to models where the fe... |

4 | KernelMethods for Text-Independent Speaker Verication
- Longworth
- 2010
(Show Context)
Citation Context ...nd kst(.,.) is the static kernel. Here the score-space is the feature-space associated with the sequence kernel. This form of kernel combination has previously been discussed for speaker verification =-=[50]-=-. IV. MODEL FEATURES The previous section has assumed the existence of an appropriate feature-function: the selection of this function is central to the performance of these classifiers. Features can ... |

3 |
Augmented Statistical Moedls for Classifying Sequence Data
- Layton
- 2006
(Show Context)
Citation Context ... or word level. In training statistics are then accumulated given these fixed segment boundaries. In inference the best path is found given these fixed boundaries. This is discussed in more detail in =-=[41]-=-. VIII. PRELIMINARY CAUG EXPERIMENTS This section presents some preliminary experimental results on the TIMIT classification task taken from [14]. The experimental setup described in [13] was used. Mo... |

2 |
Efficient decodingwith continuous rational kernels using the expectation semiring
- Dalen, Ragni, et al.
- 2012
(Show Context)
Citation Context ...or features [44]. An interesting aspect of using structured generative models in this fashion is that feature-extraction can be made efficient using an expectation semi-ring within the WFST framework =-=[52]-=-. Similar in spirit to the score-space paradigm are other methods that utilize detections of longer-term acoustic events. In [17], a baseline HMM system hypothesizes linguistic units, which are then e... |

2 |
Integrating meta-information into examplarbased speech recognition with segmental conditional random fields
- Demuynck, Seppi, et al.
(Show Context)
Citation Context ...and wi is direct (assuming word segmentation a i i = wi): the feature fires if O {ai} is a valid representation of wi. It is also possible to integrate features derived from exemplarbased systems. In =-=[59]-=- features derived from a k-NN template list is used to derived a range of features based on the a DTW match including common word positions and counts and average template duration (warping factor). D... |