## Adaptive Training for Large Vocabulary Continuous Speech Recognition (2006)

Citations: | 6 - 2 self |

### BibTeX

@MISC{Yu06adaptivetraining,

author = {Kai Yu},

title = {Adaptive Training for Large Vocabulary Continuous Speech Recognition},

year = {2006}

}

### OpenURL

### Abstract

Summary In recent years, there has been a trend towards training large vocabulary continuous speech recognition (LVCSR) systems on a large amount of found data. Found data is recorded from spontaneous speech without careful control of the recording acoustic conditions, for example, conversational telephone speech. Hence, it typically has greater variability in terms of speaker and acoustic conditions than specially collected data. Thus, in addition to the desired speech variability required to discriminate between words, it also includes various non-speech variabil-ities, for example, the change of speakers or acoustic environments. The standard approach to handle this type of data is to train hidden Markov models (HMMs) on the whole data set as if all data comes from a single acoustic condition. This is referred to as multi-style training, for exam-ple speaker-independent training. Effectively, the non-speech variabilities are ignored. Though good performance has been obtained with multi-style systems, these systems account for all variabilities. Improvement may be obtained if the two types of variabilities in the found data are modelled separately. Adaptive training has been proposed for this purpose. In contrast to multi-style training, a set of transforms is used to represent the non-speech variabilities. A canonical

### Citations

8090 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...xistence of hidden variables in HMMs, direct optimisation of equation (2.10) with respect to M is nontrivial. One solution for this type of optimisation is the expectation maximisation (EM) algorithm =-=[21]-=-. 2.3.1 Expectation Maximization (EM) Algorithm The expectation maximisation (EM)algorithm is widely used for optimisation of statistical models with hidden variables [21]. The basic idea of the algor... |

2649 |
Introduction to Statistical Pattern Recognition”, 2nd edition
- Fukunaga
- 1990
(Show Context)
Citation Context ...th training and recognition are normalised using the same linear transforms so that the original acoustic space is projected to one or more uncorrelated sub-spaces. Linear discriminant analysis (LDA) =-=[27, 12]-=- and Heteroscedastic linear discriminant analysis (HLDA) [70] are widely used linear projection schemes. The HLDA transform is used as the linear projection scheme for the LVCSR systems in this work. ... |

1923 |
D: Pattern Classification
- Duda, Hart, et al.
- 2000
(Show Context)
Citation Context ... given the MAP estimate of transform LMAP(O|H, ˆ T ) and H(·) is the entropy function defined in equation (2.16). For all point estimates of ˆ � T , the entropy of the Dirac delta function remains −∞ =-=[24]-=-. As H δ(T − ˆ � T ) is a negative constant with infinite value, it can be ignored without affecting the rank ordering of the lower bound. The rank ordering of LMAP( ˆ T ) can be derived from KMAP( ˆ ... |

1489 |
Fundamentals of Speech Recognition
- Rabiner, Juang
- 1993
(Show Context)
Citation Context ...hemes are the same. First, the speech signal is split into discrete segments usually with 10ms shifting rate and 25ms window length. This reflects the short-term stationary property of speech signals =-=[95]-=-. These discrete segments are often referred to as frames. A feature vector will be extracted for each frame. A pre-emphasising technique is normally used during the feature extraction, where overlapp... |

1240 |
Statistical decision theory and Bayesian analysis. Springer series in Statistics
- Berger
- 1985
(Show Context)
Citation Context ...L distance is always positive unless the two distributions are the same, in which case the distance is zero. 4 Hyper-parameters estimated using this approach are also referred to as an ML-II estimate =-=[10, 96]-=-.sCHAPTER 5. BAYESIAN ADAPTIVE TRAINING AND ADAPTIVE INFERENCE 93 From the properties of Jensen’s inequality, the inequality in equation (5.8) becomes an equality when q(M) = p(M|O, H) (5.11) Maximisi... |

1164 |
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm
- Viterbi
- 1967
(Show Context)
Citation Context ...s desirable to find the best path. A widely used approach for LVCSR is to find the state sequence that has the highest probability to generate the observation sequences. This is the Viterbi algorithm =-=[116]-=-. Here the maximum likelihood of the observation sequence givensCHAPTER 2. ACOUSTIC MODELLING IN SPEECH RECOGNITION 34 one hidden state sequence is used to approximate the marginal likelihood over all... |

1113 |
Pattern recognition and neural networks
- Ripley
- 1996
(Show Context)
Citation Context ...L distance is always positive unless the two distributions are the same, in which case the distance is zero. 4 Hyper-parameters estimated using this approach are also referred to as an ML-II estimate =-=[10, 96]-=-.sCHAPTER 5. BAYESIAN ADAPTIVE TRAINING AND ADAPTIVE INFERENCE 93 From the properties of Jensen’s inequality, the inequality in equation (5.8) becomes an equality when q(M) = p(M|O, H) (5.11) Maximisi... |

903 | Monte Carlo Statistical Methods
- Robert, Casella
- 2004
(Show Context)
Citation Context ...pproximating intractable probabilistic integrals. The basic idea is to draw samples from the distribution and use the average integral function value to approximate the real probabilistic expectation =-=[99]-=-. Thus p(O|H) ≈ 1 N N� p(O|H, ˆ Tn) (5.65) where N is the total number of samples and ˆ Tn is the n th sample drawn from p(T ). In the limit as N → ∞ this will tend to the true integral [99]. n=1 Ther... |

757 |
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
- Davis, Mermelstein
- 1980
(Show Context)
Citation Context ...eech variabilities. These parametric vectors are often referred to as feature vectors or observations. There are two widely used feature extraction schemes: Mel-frequency Cepstral coefficients (MFCC) =-=[18]-=- and perceptual linear prediction (PLP) [53]. Both schemes are based on Cepstral analysis. The initial frequency analysis of the two schemes are the same. First, the speech signal is split into discre... |

693 |
Optimal Statistical Decisions
- DeGroot
- 1970
(Show Context)
Citation Context ...iary function for equation (2.67) becomes QMPE(M; ˆ M) = Qn(M; ˆ M) − Qd(M; ˆ M) + S(M; ˆ M) + log p(M|Φ) (2.68) One commonly used distribution for model parameters is the Normal-Wishart distribution =-=[20]-=- which was also used for maximum a posteriori (MAP) training in [42]. This prior distribution has a similar form as the standard auxiliary function. Ignoring the constants independent of the parameter... |

592 | Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,”Comput
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...gression (MLLR) Maximum likelihood linear regression (MLLR) uses the ML criterion to estimate a linear transform to adapt Gaussian parameters of HMMs. It was originally proposed to adapt mean vectors =-=[74]-=- and extended to variance adaptation later [41, 34]. To avoid confusion, the term “MLLR” will only refer to mean based linear transforms in this work. In MLLR, the mean of Gaussian component m is adap... |

490 | C.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains
- Gauvain, Lee
- 1994
(Show Context)
Citation Context ...− Qd(M; ˆ M) + S(M; ˆ M) + log p(M|Φ) (2.68) One commonly used distribution for model parameters is the Normal-Wishart distribution [20] which was also used for maximum a posteriori (MAP) training in =-=[42]-=-. This prior distribution has a similar form as the standard auxiliary function. Ignoring the constants independent of the parameters, the logarithm of the distribution is expressed as τ I log p(M|Φ) ... |

457 |
A model for reasoning about persistence and causation
- T, Kanazawa
- 1989
(Show Context)
Citation Context ...stance is not only dependent on the state at that time instance, but also the adaptation transform associated with the homogeneous block. Though 1 A DBN is a graph that shows statistical dependencies =-=[19]-=-. In DBNs, a circle represents a continuous variable, a square represents a discrete variable, blank ones represent observable variables, and shaded ones represent unobservable variables. Lack of an a... |

406 | Maximum Likelihood Linear Transformations For HMM-Based Speech Recognition
- Gales
- 1998
(Show Context)
Citation Context ...ing The linear transform based discriminative adaptive training has been previously investigated [60, 47]. The most commonly used linear transforms are mean transforms [74] and constrained transforms =-=[34]-=-. Discriminative adaptive training with the two types of transforms are reviewed in this section. A more detailed review can be found in [118]. 4.2.1 DAT with Mean Transform ML adaptive training with ... |

353 |
The population frequencies of the species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...on of the allocated probability mass is controlled by a discounting factor. Commonly used discountingsCHAPTER 2. ACOUSTIC MODELLING IN SPEECH RECOGNITION 33 approaches include Good-Turing discounting =-=[45, 66]-=-, Witten-Bell discounting [125] and absolute discounting [84]. • Back-off The basic idea of back-off is to make use of shorter histories which can be estimated more robustly, rather than assigning pro... |

275 | Variational algorithms for approximate bayesian inference
- Beal
- 2003
(Show Context)
Citation Context ... will be discussed in detail in section 2.6.3. Normally, the form of the prior distribution is determined in advance. In the Bayesian community, a conjugate prior to the likelihood is a common choice =-=[9]-=-. This is because when a conjugate prior is used, the posterior distribution of the parameters given the observations will have the same functional form as the prior. The estimation of the posterior d... |

228 |
The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression
- Witten, Bell
- 1991
(Show Context)
Citation Context ...ss is controlled by a discounting factor. Commonly used discountingsCHAPTER 2. ACOUSTIC MODELLING IN SPEECH RECOGNITION 33 approaches include Good-Turing discounting [45, 66], Witten-Bell discounting =-=[125]-=- and absolute discounting [84]. • Back-off The basic idea of back-off is to make use of shorter histories which can be estimated more robustly, rather than assigning probability mass to those unlikely... |

214 |
An Inequality with Applications to Statistical Estimation for Probabilistic Functions of a Markov Process and to a Model for Ecology
- Baum, Egon
- 1967
(Show Context)
Citation Context ...tion of the two state posterior distributions is a key stage in HMM parameter estimation. They can be efficiently computed using the forward-backward algorithm, also known as the Baum-Welsh algorithm =-=[8]-=-. This algorithm is an efficient re-arrangement of equation (2.22) and equation (2.23) by making use of two intermediate probabilities and the conditional independence assumption of HMMs. The forward ... |

197 |
Accurate Approximations for Posterior Moments and Marginal Densities
- Tierney, Kadane
- 1986
(Show Context)
Citation Context ...N SPEECH RECOGNITION 37 is to the real likelihood. Another approach to approximate the Bayesian integral in marginal likelihood calculation is to use a Laplacian approximation or normal approximation =-=[108, 57]-=-. In this approach, the integral in equation (2.96) is approximated by p(O|H) ≈ p(O| ˆ MMAP, H)p( ˆ MMAP|Otrn, Htrn)(2π) D 2 |ΣMAP| 1 2 (2.99) where D is the total number of parameters of M, ˆ MMAP is... |

196 | Some statistical issues in the comparison of speech recognition algorithms
- GiIlick, Cox
- 1989
(Show Context)
Citation Context ...ing NIST provided scoring toolkit sctk-1.2. The significance difference was reported using the Matched-Pair Sentence-Segment Word Error (MAPSSWE) test at a significance level of 5%, or 95% confidence =-=[43]-=-. 6.2 Discriminative Cluster Adaptive Training This section presents the development experiments of discriminative cluster adaptive training (CAT). All systems in this section are 16 Gaussian componen... |

185 | Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification - Atal - 1974 |

182 | Continuous speech recognition by statistical methods - Jelinek - 1976 |

181 | Semi-tied covariance matrices for hidden Markov models
- Gales
- 1999
(Show Context)
Citation Context ...e diagonal. The use of a diagonal covariance matrix may give poor modelling of correlation between different dimensions. Hence, many complicated covariance modelling techniques have been investigated =-=[35, 5]-=-. However, diagonal covariance matrices are still widely used because of the low computational cost and their successful use in state-ofthe-art LVCSR systems. In this thesis, only diagonal covariance ... |

181 | Tree-Based State Tying for High Accuracy Acoustic Modelling
- Young, Odell, et al.
- 1994
(Show Context)
Citation Context ...rossword tri-phones is about 100,000. It is hard to collect sufficient training data to robustly train all tri-phones. To solve this problem, parameter tying, or clustering, techniques are often used =-=[135, 134]-=-. The basic idea of the technique is to consider a group of parameters as sharing the same set of values. In training, statistics of the whole group is used to estimate the shared parameter. Tying can... |

179 |
Minimum phone error and I-smoothing for improved discriminative training
- Povey, Woodland
- 2002
(Show Context)
Citation Context ...bustness issue is effectively addressed by using full Bayesian approaches as shown in the experiments. In most state-of-the-art systems, discriminative training is used to obtain the best performance =-=[6, 64, 93]-=-. It takes into account the competing incorrect hypothesis during training and aims at directly reducing the recognition error. Discriminative training has been investigated within the linear transfor... |

163 |
Maximum Mutual Information Estimation of Hidden Markov Models Parameters for Speech Recognition
- Souza, V, et al.
- 1986
(Show Context)
Citation Context ...bustness issue is effectively addressed by using full Bayesian approaches as shown in the experiments. In most state-of-the-art systems, discriminative training is used to obtain the best performance =-=[6, 64, 93]-=-. It takes into account the competing incorrect hypothesis during training and aims at directly reducing the recognition error. Discriminative training has been investigated within the linear transfor... |

145 | A compact model for speaker-adaptive training
- Anastasakos
- 1996
(Show Context)
Citation Context ...e other training schemes that are more powerful to handle the non-speech variabilities in training data. Adaptive training is a powerful solution for building systems on non-homogeneous training data =-=[3]-=-. Rather than dealing with all the data as a single block, the training data is split into several homogeneous blocks, for example speaker side or data block with the same acoustic environment. Thus, ... |

129 |
Minimum classification error rate methods for speech recognition
- Juang, Chou, et al.
- 1997
(Show Context)
Citation Context ...bustness issue is effectively addressed by using full Bayesian approaches as shown in the experiments. In most state-of-the-art systems, discriminative training is used to obtain the best performance =-=[6, 64, 93]-=-. It takes into account the competing incorrect hypothesis during training and aims at directly reducing the recognition error. Discriminative training has been investigated within the linear transfor... |

116 |
Discriminative training for large vocabulary speech recognition
- Povey
- 2003
(Show Context)
Citation Context ...on may not yield the most appropriate estimate for recognition. It is then preferable to use other training criterion to explicitly aim at reducing the recognition error rate. Discriminative criteria =-=[86, 90]-=- are widely investigated to achieve this goal, which will be discussed in detail in section 2.4. • Estimation Error An assumption for ML training to be optimal is that there is sufficient training dat... |

109 | Mean and variance adaptation within the mllr framework
- Gales, Woodland
- 1996
(Show Context)
Citation Context ...ssion (MLLR) uses the ML criterion to estimate a linear transform to adapt Gaussian parameters of HMMs. It was originally proposed to adapt mean vectors [74] and extended to variance adaptation later =-=[41, 34]-=-. To avoid confusion, the term “MLLR” will only refer to mean based linear transforms in this work. In MLLR, the mean of Gaussian component m is adapted to a particular acoustic condition by ˆµ (m) = ... |

101 | Confidence measures for large vocabulary continuous speech recognition
- Wesse, Schluter, et al.
- 2001
(Show Context)
Citation Context ... 3. ADAPTATION AND ADAPTIVE TRAINING 51 threshold are ignored[122]. Word posterior probabilities from the recogniser are widely used to calculate the confidence scores for each word in the hypothesis =-=[123, 124]-=-. This technique is effective when the word error rate of the hypothesis supervision is high. Since confidence score based adaptation “eliminates” incorrect words from the hypothesis, in an ideal case... |

99 |
Large Scale Discriminative Training of Hidden Markov Models for Speech Recognition
- Woodland, Povey
(Show Context)
Citation Context ...hard to find a strong-sense auxiliary function due to the denominator term in the criterion. To allow the discriminative criterion to be optimised, an extended Baum-Welch algorithm (EBW) was proposed =-=[46, 86, 131]-=-, which extends the Baum-Eagon inequality to rational functions by using an additional smoothing term to ensure the convexity of the auxiliary function. This allows discriminative training to be done ... |

95 | Rapid speaker adaptation in eigenvoice space
- Kuhn, Junqua, et al.
(Show Context)
Citation Context ...ive the number of clusters. Given the basis eigenvoices and the meta-vectors for each speaker, initial weights for each speaker can be obtained using either a projection scheme or a ML based approach =-=[36, 68]-=-. This approach will naturally output a bias cluster, i.e. a cluster with a weight value of 1.0, which is the mean of all meta-vectors. During CAT training, the weight of this bias cluster can either ... |

89 | Fast Speaker Adaptation Using Constrained Estimation of Gaussian Mixtures - Digalakis, Rtiscbev, et al. - 1995 |

83 |
The N-Best Algorithm: An Efficient and Exact Procedure for Finding the N Most Likely Sentence Hypotheses
- Schwartz, Chow
- 1990
(Show Context)
Citation Context ...n of HMMs. As the assumption is not valid for adaptive HMM due to the additional dependency on transforms, Viterbi algorithm is not suitable for Bayesian adaptive inference. Instead, N-Best rescoring =-=[103]-=- is used in this work to reflect the nature of adaptive HMM. Though the N-Best rescoring may limit the performance gain, and loss, due to the limited number of candidate hypothesis sequences, given su... |

82 |
Speaker normalization using efficient frequency warping procedures
- Lee, Rose
- 1996
(Show Context)
Citation Context ...N) One major non-speech variability that affects the performance of a speech recognition system is the variability of the human voice among different speakers. Vocal tract length normalisation (VTLN) =-=[71]-=- is a technique that can reduce the mismatch between speakers. The basic idea is to map the actual speech signal to a normalised signal with less variability due to different vocal tract lengths of di... |

82 | Flexible Speaker Adaptation Using MaximumLikelihoodLinearRegression
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...nce need to be accumulated. This is the most efficient form and will be used in this work. It is interesting to compare the information strategy described above to the standard incremental adaptation =-=[73]-=-. In standard incremental adaptation, an ML transform is re-estimated for each utterance using the accumulated statistics from the previous and the current utterances. In this estimation, the state/co... |

77 |
The Dragon System - an Overview
- Baker
- 1975
(Show Context)
Citation Context ...ohn Pierce of Bell Labs said that ASR will not be a reality for several decades. However, the 1970s witnessed a significant theoretical breakthrough in speech recognition: hidden Markov models (HMMs) =-=[7, 61]-=-. In the following decades, HMMs were extensively investigated and became the most successful technique for acoustic modelling in speech recognition. The fast development of computer hardware and algo... |

77 | Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition - Kumar - 1997 |

75 | Structural maximum a posteriori linear regression for fast HMM adaptation
- Siohan, Myrvoll, et al.
- 2002
(Show Context)
Citation Context ...AT, due to the row-independent assumption in prior, the resultant covariance matrix predictive distribution is also diagonal. MAP Linear Regression (MAPLR) with single Gaussian prior was presented in =-=[13]-=-. The multiple component prior MAP estimate is a straightforward extension, yielding forms similar to that for CAT in equation (5.83). Given the ML sufficient statistics GML,d and kML,d in equation (3... |

71 | Large scale discriminative training for speech recognition
- Woodland, Povey
- 2000
(Show Context)
Citation Context ...n due to error-full hypothesis may be significantly reduced. In lattice based adaptation, an extended forward-backward algorithm is performed through the recognised lattice, or alternative hypotheses =-=[130, 131]-=-. This lattice forward-backward algorithm gives the posterior probability of each Gaussian component given all possible hypotheses in the lattice. Using this posterior probability to accumulate statis... |

70 |
An empirical Bayes approach to statistics
- Robbins
- 1955
(Show Context)
Citation Context ...st be estimated from the training data, i.e., to maximise the marginal likelihood in equation (2.81), with respect to the hyper-parameters Φ. This is the basic idea of the empirical Bayesian approach =-=[97, 98]-=-. In this case, it can be shown that the empirically estimated prior distribution must have the same form and hyper-parameters as the posterior distribution p(M|O, H) which is estimated on the trainin... |

64 | The 1994 HTK large vocabulary speech recognition system
- Woodland, Leggetter, et al.
- 1995
(Show Context)
Citation Context ...VN) One standard normalisation transform is to sphere the data, i.e., transform the data so that it has zero mean and unit variance at each dimension of the feature. Cepstral mean normalisation (CMN) =-=[4, 129]-=- and Cepstral variance normalisation (CVN) are simple techniques to achieve this goal. The idea is to normalise the mean and variance of each dimension of the observations.sCHAPTER 3. ADAPTATION AND A... |

63 |
Maximum likelihood estimation for multivariate mixture observations of markov chains
- Juang, Levinson, et al.
- 1986
(Show Context)
Citation Context ... times the state transition probability. Considering the distinct Gaussian component (sub-state) sequence as the hidden variable sequence, the Gaussian component posterior occupancy can be derived as =-=[65]-=- γjm(t) = � N−1 i=2 αi(t − 1)aijcjmbjm(ot)βj(t) p(O|H, ˆ Mk) (2.35) where jm denote the m th Gaussian component of state j, bjm(ot) is a Gaussian distribution N (ot; µ (jm) , Σ (jm) ) and cjm is the w... |

62 | The generation and use of regression class trees for MLLR adaptation
- Gales
- 1996
(Show Context)
Citation Context ...data, which is not flexible. Rather than specifying static classes, a dynamic scheme is often used to construct additional transforms as more adaptation data become available. A regression class tree =-=[72, 31]-=- is used to group Gaussian components so that the number of the transforms to be estimated can be dynamically chosen according to the amount of available adaptation data. The regression class tree is ... |

57 | Cluster Adaptive Training of Hidden Markov Models
- Gales
- 2000
(Show Context)
Citation Context ...n s where λ (sr) p λ (sr) P λ (sr) � = λ (sr) 1 , . . . , λ (sr) P � � T (3.43) is the interpolation weight for cluster p. In some systems a bias cluster is used where = 1 for all acoustic conditions =-=[36]-=-. The adapted mean for a particular acoustic condition s, can then be written as ˆµ (sm) = M (m) λ (srm) (3.44) where rm is the regression base class that component m belongs to. There are two kinds o... |

57 |
Minimum Bayes-risk automatic speech recognition. Computer Speech and Language
- Goel, Byrne
- 2000
(Show Context)
Citation Context ...servation sequence and the model parameters, l(H, Href) is a loss function of H given the reference or correct transcription Href. The minimum Bayesian risk (MBR) criterion was first used in decoding =-=[44]-=-. As the criterion is a good description of recognition error, it has also been adopted in discriminative training [23]. An MBR estimator finds the model parameters by minimising the Bayesian risk of ... |

57 |
Large vocabulary continuous speech recognition using htk
- Woodland, Odell, et al.
- 1994
(Show Context)
Citation Context ...ndaries. Hence, bi-phones have to be used to model the start and end phones at the word boundaries. In this work, cross-word tri-phones are considered as they yield good performance for LVCSR systems =-=[128]-=-. One issue with using tri-phones is that the number of possible acoustic units is significantly increased. For example, for a mono-phone set with 46 phones, the number of possible crossword tri-phone... |

51 |
RASTA-PLP speech analysis technique
- Hermansky, Morgan, et al.
- 1992
(Show Context)
Citation Context ...sequence. From the signal processing point of view, CMN is similar to the RASTA approach, where a high-pass filter is applied to a log-spectral representation of speech, such as Cepstral coefficients =-=[54]-=-. Equation (3.23) is also a high-pass filter in the Cepstral domain. This filtering will suppress the constant spectral components, which reflect the effect of convolutive noise factors in the input s... |

48 | R.Schluter, “Using word probabilities as confidence measures
- Wessel, Macherey
- 1998
(Show Context)
Citation Context ... 3. ADAPTATION AND ADAPTIVE TRAINING 51 threshold are ignored[122]. Word posterior probabilities from the recogniser are widely used to calculate the confidence scores for each word in the hypothesis =-=[123, 124]-=-. This technique is effective when the word error rate of the hypothesis supervision is high. Since confidence score based adaptation “eliminates” incorrect words from the hypothesis, in an ideal case... |