## An overview of discriminative training for speech recognition

Citations: | 7 - 0 self |

### BibTeX

@TECHREPORT{Vertanen_anoverview,

author = {Keith Vertanen},

title = {An overview of discriminative training for speech recognition},

institution = {},

year = {}

}

### OpenURL

### Abstract

This paper gives an overview of discriminative training as it pertains to the speech recognition problem. The basic theory of discriminative training will be discussed and an explanation of maximum mutual information (MMI) given. Common problems inherent to discriminative training will be explored as well as practicalities associated with implementing discriminative training for large vocabulary recognition. Alternatives to the MMI objective function such as minimum word error (MWE) and minimum phone error (MPE) will be discussed. The application of discriminative techniques for adaptation will be described. Finally, possible future avenues of research will be given. 1.

### Citations

592 | Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,”Comput
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...t recognizers is the use of linear transforms to adapt speaker independent models to better approximate a particular speaker. The most popular technique is maximum likelihood linear regression (MLLR) =-=[24]-=-. As the name implies, MLLR uses the maximum likelihood principle and has been found to yield significant performance improvements. One immediate concern is whether the gains made by MLLR on MLE train... |

179 |
Minimum phone error and I-smoothing for improved discriminative training
- Povey, Woodland
- 2002
(Show Context)
Citation Context ...g it less critical exactly when training is stopped. 4.5. I-smoothing As the H-criterion failed to improve accuracy for large amounts of training data, a variant technique I-smoothing was proposed in =-=[20]-=-. I-smoothing increases the weight of the MLE counts depending on the amounts of data available for each Gaussian. This is done by num num num 2 multiplying the numerator terms γ , θ ( O) , and θ ( O ... |

145 | A compact model for speaker-adaptive training
- Anastasakos
- 1996
(Show Context)
Citation Context ...aker independent recognizers are trained is usually collected from a wide range of speakers. The inter-speaker variability is a possible source of recognizer error. In speaker adaptive training (SAT) =-=[27]-=-, these differences are minimized by the application of transforms calculated on the training speakers. In discriminative speaker adaptive training (DSAT), similar to the speaker adaptation case, DLTs... |

129 |
Minimum classification error rate methods for speech recognition
- Juang, Chou, et al.
- 1997
(Show Context)
Citation Context ...errors made by the recognizer on the training set. The minimum classification error (MCE) objective function is designed to minimize these errors and have been shown to outperform MMIE on small tasks =-=[21]-=-. However, using MCE on large vocabulary tasks is problematic for long sentences and also cannot easily be implemented on lattices. In [20] the alternates minimum word error (MWE) and minimum phone er... |

99 |
Large Scale Discriminative Training of Hidden Markov Models for Speech Recognition
- Woodland, Povey
(Show Context)
Citation Context ...overall contribution made by mixture weights and transition probabilities is highly dependent on what type of sharing (if any) is being done between states or Gaussian mixture models. As discussed in =-=[10]-=-, updating mixture weights and transition probabilities for a decision-tree tied-state mixture Gaussian HMM caused only a small increase in performance. Other HMM systems based on tied mixture models ... |

97 |
An inequality for rational functions with applications to some statistical estimation problems
- Gopalakrishnan, Kanevsky, et al.
- 1991
(Show Context)
Citation Context ...criminative training cannot be optimized using the conventional Baum-Welch algorithm. The only known methods that converge for MMIE are steepest gradient descent and the extended Baum-Welch algorithm =-=[4]-=-. Given the high dimensionality of the parameter space, gradient descent may require a large number of iterations to obtain an optimal solution [3]. Thus extended Baum-Welch is the predominant algorit... |

71 | The Acoustic-Modelling Problem in Automatic Speech Recognition," unpublished Ph.D, thesis - Brown - 1987 |

71 | Large scale discriminative training for speech recognition
- Woodland, Povey
- 2000
(Show Context)
Citation Context ...word sequences for a given model with a certain level of pruning. -6jm jmsFigure 1: An example word lattice. The path in bold was the correct transcription of the utterance. Lattice-Based Training In =-=[19]-=-, an explanation of the use of lattices for computing the denominator statistics is given. First, word lattices are constructed using the initial numerator and denominator MLE-trained HMMs for each of... |

43 |
An Improved MMIE Training Algorithm for Speaker-Independent Small Vocabulary, Continuous Speech Recognition
- Normandin, Morgera
- 1991
(Show Context)
Citation Context ...onverge to an optimal solution while MLE will not [2, 12]. However, it is also possible to construct problems in which neither MMIE or MLE will converge to an optimal solution [13]. As pointed out in =-=[5]-=-, MMIE’s robustness to model incorrectness on toy problems does not necessarily indicate it will be robust for real problems. MMIE’s utility relies on how well it performs in practice. 2.3. Problems w... |

40 |
Hidden Markov Models, Maximum Mutual Information Estimation and the Speech Recognition Problem
- Normandin
- 1991
(Show Context)
Citation Context ...milar manner. Unfortunately, this results in a formula that is extremely sensitive to small valued parameters. A more robust approximation was given in [9] and has been shown to work well in practice =-=[7, 8]-=-. This approximation also has a drawback in that it can lead to instability as training proceeds as the objective function may decrease as it approaches its maximum [8]. In [6], an update formula is p... |

39 |
MMI training for continuous phoneme recognition on the TIMIT database
- Kapadia, Valtchev, et al.
- 1993
(Show Context)
Citation Context ...ariance matrices. The formula can be extended to full covariance matrices as shown in [1]. Experimentally full covariance matrices have been shown to improve results on continuous phoneme recognition =-=[7]-=-. The author is not aware of any recent experimental results on the use of full covariance matrices for large vocabulary tasks. This is probably due to an unacceptably large increase in parameters req... |

27 | On a Model-Robust Training Method for Speech Recognition
- Nadas, Nahamoo, et al.
- 1988
(Show Context)
Citation Context ...ypotheses less probable. -2wˆsFor certain simple types of estimation problems, it has been shown that give incorrect modeling assumptions, MMIE will converge to an optimal solution while MLE will not =-=[2, 12]-=-. However, it is also possible to construct problems in which neither MMIE or MLE will converge to an optimal solution [13]. As pointed out in [5], MMIE’s robustness to model incorrectness on toy prob... |

24 |
Maximum mutual information estimation of hmm parameters for continuous speech recognition using the n-best algorithm
- Chow
- 1990
(Show Context)
Citation Context ...lows the approximate denominator occupancies to be cheaply computed. One possibility is to use Nbest lists which are calculated once and used to constrain the search during subsequent MMIE iterations =-=[14]-=-. However, N-best lists are not good at representing the many possibilities that can be present in large vocabulary tasks. Lattice structures are a better candidate as they are able to succinctly repr... |

23 |
A Generalization of the Baum Algorithm to Rational Objective Functions
- Gopalakrishnan, Kanevsky, et al.
- 1989
(Show Context)
Citation Context ...assumptions, MMIE will converge to an optimal solution while MLE will not [2, 12]. However, it is also possible to construct problems in which neither MMIE or MLE will converge to an optimal solution =-=[13]-=-. As pointed out in [5], MMIE’s robustness to model incorrectness on toy problems does not necessarily indicate it will be robust for real problems. MMIE’s utility relies on how well it performs in pr... |

23 | Discriminative training of language models for speech recognition
- Kuo, Fosler-Lussier, et al.
- 2002
(Show Context)
Citation Context ... might benefit from additional HMM states. • Discriminative training for language models There has been little investigation into the application of discriminative techniques to language modeling. In =-=[29]-=-, modest gains were shown using a very small amount of training data. It would be interesting to know if similar techniques could provide gains for state-of-the-art language models trained on many meg... |

20 | Discriminative Training of Hidden Markov Models
- Kapadia
- 1998
(Show Context)
Citation Context ...ing to the transcription wr of observation Or. w The popularity of MLE is due to its ability to produce accurate systems that can be quickly trained using the globally convergent Baum-Welch algorithm =-=[1]-=-. MLE also offers the theoretical advantage that if certain modeling assumptions hold, no other training criteria will do better; MLE is a minimum variance, consistent estimator of the true model para... |

19 |
Discriminative Methods in HMM-Based Speech Recognition
- Valtchev
- 1995
(Show Context)
Citation Context ...adient descent and the extended Baum-Welch algorithm [4]. Given the high dimensionality of the parameter space, gradient descent may require a large number of iterations to obtain an optimal solution =-=[3]-=-. Thus extended Baum-Welch is the predominant algorithm used for parameter re-estimation in discriminative training. The details of extended Baum-Welch will be discussed in the next section. The expen... |

19 |
MMIE Training for Large Vocabulary Continuous Speech Recognition
- Normandin, Lacouture, et al.
- 1994
(Show Context)
Citation Context ... vocabulary tasks. Lattice structures are a better candidate as they are able to succinctly represent the many hypotheses. While other types of lattices such as looped lattices have been investigated =-=[15]-=-, we will focus on the word-based lattice. A word lattice (see figure 1) consists of a number of nodes placed along the time axis of a particular training utterance. Arcs between nodes correspond to t... |

18 |
Phonetic recognition using hidden Markov models and maximum mutual information training
- Merialdo
- 1988
(Show Context)
Citation Context ...mula for mixture weights can be obtained in a similar manner. Unfortunately, this results in a formula that is extremely sensitive to small valued parameters. A more robust approximation was given in =-=[9]-=- and has been shown to work well in practice [7, 8]. This approximation also has a drawback in that it can lead to instability as training proceeds as the objective function may decrease as it approac... |

17 |
Discriminative linear transforms for speaker adaptation
- Uebel, Woodland
- 2001
(Show Context)
Citation Context ...5], but results for MMI-MLLR were noticeably absent. A discriminative linear transform (DLT) for adaptation can be arrived at by interpolating the ML and MMI objective functions using the H-criterion =-=[26]-=-. While the iterative formula given is not guaranteed to increase with each step, with a properly chosen H-criterion constant, it works well in practice. For the task of recognizing non-native speaker... |

16 |
Optimal Splitting of HMM Gaussian Mixture Components with MMIE Training
- Normandin
- 1995
(Show Context)
Citation Context ...one using MMIE. The derivative of the MMI objective function with respect to the mixture component weights can be used as an indicator of how much a particular Gaussian would benefit from being split =-=[11]-=-. Using this derivative, an algorithm can be devised which iteratively splits the top components which would most benefit from the discrimination that an additional Gaussian might provide. For a conne... |

16 | Interdependence of language models and discriminative training
- Schlüter, Müller, et al.
- 1999
(Show Context)
Citation Context ... On the Wall Street Journal task, using a unigram for lattice generation and a trigram for recognition resulted in a 0.32% absolute increase in performance over using a trigram for lattice generation =-=[18]-=-. For the harder Hub5 task, it was shown in [19] that using a unigram for lattice generation and a trigram for recognition slightly improved performance over using a bigram for lattice generation. As ... |

1 |
Frame Discrimination of HMMs for Large Vocabulary Speech Recognition
- Povey, Woodland
- 2000
(Show Context)
Citation Context ...mination has met with mixed results. The original author showed FD could improve MMIE results for digit recognition [1]. However, for harder tasks such as Resource Management, North American Business =-=[16]-=- and broadcast news transcription [17], FD did not significantly improve accuracy for complex models. Using clever selection of the best Gaussians, FD had been shown to speed computation, but this adv... |