## Uncertainty decoding for noise robust speech recognition (2004)

Venue: | in Proc. Interspeech |

Citations: | 37 - 12 self |

### BibTeX

@TECHREPORT{Liao04uncertaintydecoding,

author = {Hank Liao},

title = {Uncertainty decoding for noise robust speech recognition},

institution = {in Proc. Interspeech},

year = {2004}

}

### Years of Citing Articles

### OpenURL

### Abstract

This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings

### Citations

8613 | Maximum Likelihood From Incomplete Data via the EMAlgorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...maximises this function [118]. One iterative method to find a local maximum is the Baum-Welch (BW) or forward-backward algorithm [9]. This is an example of the expectation maximisation (EM) algorithm =-=[21]-=-. Instead of maximising L, an auxiliary function Q is optimised which guarantees the log-likelihood will not decrease L( ˆ M) − L(M) ≥ Q(M; ˆ M) − Q(M; M) (2.27) which is derived using Jensen’s inequa... |

4513 | A tutorial on hidden markov models and selected applications in speech recognition
- Rabiner
- 1990
(Show Context)
Citation Context ...ynamic coefficients do not improve system accuracy. 2.3 Acoustic Modelling HMMs have proven to be a powerful means of representing time varying signals, such as speech, as a parametric random process =-=[72, 118]-=-. In ASR, an HMM is used to model the acoustics of each word, syllable or phone to generate the acoustic score p(S|W, M) in equation (2.2). The HMM is a first-order, discrete-time Markov chain, as dep... |

806 | Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
- Davis, Mermelstein
- 1980
(Show Context)
Citation Context ...extraction stage, the speech signal captured by the microphone is sampled and digitised into discrete samples over time. A popular feature representation is mel frequency cepstrum coefficients (MFCC) =-=[20]-=-, which arise from a homomorphic transform of the short-term spectrum expressed on a mel frequency scale. Figure 2.2 shows how they may be computed. Their use is motivated by both perceptual and perfo... |

628 | Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...e following sections. The first two may be classified as adaptation techniques, the second two normalisation schemes. 2.5.1 Maximum Likelihood Linear Regression ML linear regression (MLLR) adaptation =-=[46, 89]-=- estimates an affine transformation of the acoustic model parameters. The transformation maximises the likelihood of the available adaptation data. Since the amount of adaptation data is usually limit... |

614 |
Perceptual linear predictive (PLP) analysis of speech
- Hermansky
- 1990
(Show Context)
Citation Context ...fficient [152]. Although MFCC are a widely used speech parameterisation, its optimality has been questioned [63, 65, 67]. Alternatively, perceptual linear prediction (PLP) coefficients have been used =-=[64]-=- giving similar performance to MFCC [69]. An extensive review of speech signal representations can be found in Huang et al. [72] or Gold and Morgan [57].sCHAPTER 2. HIDDEN MARKOV MODEL SPEECH RECOGNIT... |

548 |
Switchboard. Telephone speech corpus for research and development
- Godfrey, Holliman, et al.
- 1992
(Show Context)
Citation Context ...le the best machine error rates have only advanced from 0.72% to 0.55% over the last decade [155]. For more difficult tasks the difference narrows: for example on telephone conversation transcription =-=[56]-=- the human word error rate is about 4% while state-of-the-art automatic transcription systems rates are still over three times worse [18, 32]. The difference between human and machine performance has ... |

519 | Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains
- Gauvain, Lee
- 1994
(Show Context)
Citation Context ...transforms can be estimated in an unsupervised fashion if the word error rate of the hypothesised transcription are around 20% or less. MLLR is often compared to maximum a posteriori (MAP) adaptation =-=[55]-=-. MAP produces an adapted model set that may be considered a weighted combination of well-trained, but mismatched, prior models and ML parameter estimates from limited matched test adaptation data. It... |

507 |
RASTA processing of speech
- Hermansky, Morgan
- 1994
(Show Context)
Citation Context ...robust front-ends, front-end compensation and model-based compensation. The first seeks speech features that are immune to noise. While performance may be acceptable in low-levels of background noise =-=[67]-=-, an inherently robust front-end has yet to be developed that can handle higher and varied noise levels. Hence, the focus of much research has turned to feature enhancement, or cleaning, whereby noise... |

495 | Suppression of acoustic noise in speech using spectral subtraction - Boll - 1979 |

427 | Maximum likelihood linear transformations for HMM-based speech recognition
- Gales
- 1998
(Show Context)
Citation Context ...e following sections. The first two may be classified as adaptation techniques, the second two normalisation schemes. 2.5.1 Maximum Likelihood Linear Regression ML linear regression (MLLR) adaptation =-=[46, 89]-=- estimates an affine transformation of the acoustic model parameters. The transformation maximises the likelihood of the available adaptation data. Since the amount of adaptation data is usually limit... |

349 | A Post-Processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction
- Fiscus
- 1997
(Show Context)
Citation Context ...ge model using 4- or 5-gram probabilities to yield the final transcription [121] or lead into a third pass using even more powerful cross– adaptation [53] and system combination methods such as ROVER =-=[33]-=- or CNC [31]. Efficient LVCSR decoding techniques with complex acoustic and language models continues to be an area of active research. A brief overview of decoding techniques is given in Aubert [6]. ... |

225 |
An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology
- Baum, Eagon
- 1967
(Show Context)
Citation Context ... known way to analytically solve for the acoustic model which globally maximises this function [118]. One iterative method to find a local maximum is the Baum-Welch (BW) or forward-backward algorithm =-=[9]-=-. This is an example of the expectation maximisation (EM) algorithm [21]. Instead of maximising L, an auxiliary function Q is optimised which guarantees the log-likelihood will not decrease L( ˆ M) − ... |

210 | Robust automatic speech recognition with missing and unreliable acoustic data
- Cooke, Green, et al.
- 2001
(Show Context)
Citation Context ...information” paradigm presented in the Algonquin framework [84, 85] is viewed in this work as a model-based compensation approach for the reasons outlined in section 4.4.4. For missing feature theory =-=[19, 119]-=-, data imputation with soft data is an observation uncertainty approach. In contrast, data marginalisation can be construed as a limited form of frontend uncertainty decoding, restricted to the spectr... |

196 | Semi-tied covariance matrices for hidden Markov models
- Gales
- 1999
(Show Context)
Citation Context ...ngle Gaussian output distribution. More advanced covariance modelling methods [130], that are not as expensive as using full covariances, can give improved results, e.g. semi-tied covariance matrices =-=[41]-=- discussed in section 2.3.4 1 Later it will be important to distinguish these parameters from those associated with mismatched test data O which are subscripted with o.sCHAPTER 2. HIDDEN MARKOV MODEL ... |

196 | and P.Woodland, “Tree-based state tying for high accuracy acoustic modelling
- Young, Odell
- 1994
(Show Context)
Citation Context ...o share training data. Which states are tied together may be determined using data-driven clustering, however unseen contexts cannot be clustered. Alternatively, state clustering using decision trees =-=[7, 111, 154]-=- built from expert phonetic knowledge avoids this issue. An example decision tree is shown in figure 2.8.sCHAPTER 2. HIDDEN MARKOV MODEL SPEECH RECOGNITION 16 Figure 2.8: Decision tree for triphone st... |

192 | Minimum phone error and I-smoothing for improved discriminative training
- POVEY, WOODLAND
(Show Context)
Citation Context ...stems described. A more complex system would typically use some form of feature projection scheme such as HLDA [87] or fMPE [116], advanced covariance modelling such as STC [41], and MMI [141] or MPE =-=[115]-=- training of model parameters—the use of such techniques were not investigated in these experiments.sCHAPTER 9. EXPERIMENTAL RESULTS ON RECORDED NOISY SPEECH 127 9.1.1 Predictive Model Compensation Ta... |

180 | Acoustical and environmental robustness in automatic spech recognition
- Acero
- 1993
(Show Context)
Citation Context ... simplified by combining the various additive and convolutional noise sources into single additive noise, z(τ), and linear channel noise, h(τ), variables. Doing so gives this standard, oft-used model =-=[1, 39, 106]-=- of the noisy acoustic environment in the time domain show in figure 3.2. The noisy signal is now given by y(τ) = x(τ) ∗ h(τ) + z(τ) (3.2) where y(τ) is the noise corrupted speech and x(τ) the “clean”... |

176 |
Hidden markov model decomposition of speech and noise
- Varga, Moore
- 1990
(Show Context)
Citation Context ...re have been successful applications. In general, to find the most likely combined state sequence of noise and speech states, for a DBN as shown in figure 4.2, requires a 3-dimensional Viterbi search =-=[143]-=-. The additional computational cost may be avoided if certain assumptions are made. This chapter will discuss various noise robustness techniques and such assumptions within this framework for noise r... |

149 | A compact model for speaker-adaptive training - Anastasakos, McDonough, et al. - 1996 |

144 | Speech recognition by machines and humans
- Lippmann
- 1997
(Show Context)
Citation Context ...all centre applications2 and desktop personal computer software3 . However, recognition accuracy is still far from human levels. Humans make mistakes at a rate of less than one hundredth of a percent =-=[97]-=- when recognising strings of digits, while the best machine error rates have only advanced from 0.72% to 0.55% over the last decade [155]. For more difficult tasks the difference narrows: for example ... |

137 |
Minimum classification error rate methods for speech recognition
- Juang, Chou, et al.
- 1997
(Show Context)
Citation Context ...MWE), uses the Levenshtein string error to compute the loss function making it close to evaluation using word error rate [16, 113] discussed later in section 2.4.3. Minimum classification error (MCE) =-=[75]-=- may be viewed as MBR training where the loss function is zero when the word sequence matches the reference and one otherwise. Hence MCE minimises the string-level error rate by reducing the posterior... |

128 |
The DARPA 1000word resource management database for continuous speech recognition
- Price, Fisher, et al.
- 1988
(Show Context)
Citation Context ... systems with advanced acoustic modelling. Hence, more extensive exploration of these robustness techniques was conducted on the 1000-word naval ARPA Resource Management (RM) command and control task =-=[117]-=-. For this work, noise is artificially added at the waveform level from the NATO NOISEX92 database [144]. The clean RM data was recorded in a sound-isolated room using a head mounted Sennheiser HMD414... |

124 |
Discriminative training for large vocabulary speech recognition
- Povey
- 2004
(Show Context)
Citation Context ... hypothesis. One form of MBR training, which may be called minimum word error (MWE), uses the Levenshtein string error to compute the loss function making it close to evaluation using word error rate =-=[16, 113]-=- discussed later in section 2.4.3. Minimum classification error (MCE) [75] may be viewed as MBR training where the loss function is zero when the word sequence matches the reference and one otherwise.... |

122 | The use of context in large vocabulary speech recognition
- Odell
- 1995
(Show Context)
Citation Context ...o share training data. Which states are tied together may be determined using data-driven clustering, however unseen contexts cannot be clustered. Alternatively, state clustering using decision trees =-=[7, 111, 154]-=- built from expert phonetic knowledge avoids this issue. An example decision tree is shown in figure 2.8.sCHAPTER 2. HIDDEN MARKOV MODEL SPEECH RECOGNITION 16 Figure 2.8: Decision tree for triphone st... |

118 |
Speech recognition in noisy environments: a survey
- Gong
- 1995
(Show Context)
Citation Context ...s to re-train them with data from the new environment. This may be referred to as matched or multipass training. While matched training usually yields the best results in a variety of papers surveyed =-=[47, 58, 151]-=-, it is not very practical since large amounts of noisy training data are required and the noise condition may vary. Artificial methods of corrupting the training data have been explored which also yi... |

114 | Mean and variance adaptation within the MLLR framework
- Gales
- 1996
(Show Context)
Citation Context ...h an adequate amount of data MAP outperforms MLLR [72]. This is because MAP has greater flexibility to individually update each acoustic model component [72, 152]. An ML model variance transformation =-=[49]-=- of this form may be estimated Σ (m) o = B (m) -T H (rm) (m) -1 B (2.57) where B (m) (m) -1 is the Choleski factor of the inverse of the unadapted model variance Σ s so (m) -1 Σ s = B (m) B (m)T (2.58... |

112 | Robust continuous speech recognition using parallel model compensation
- Gales, Young
- 1996
(Show Context)
Citation Context ...onents can remain unchanged, however anywhere from 25-1000 observations need to be generated per Gaussian in the system [47]. DPMC gave results equivalent to matched systems at levels below 20 dB SNR =-=[52]-=-. However, this iterative estimation is computationally expensive. 4.4.3 Vector Taylor Series Model Compensation As the discussion of PMC shows, deriving a corrupted speech output distribution, given ... |

100 | Token passing: a simple conceptual model for connected speech recognition system - Young, Russell, et al. - 1989 |

97 |
The Lombard reflex and its role on human listeners and automatic speech recognizers
- Junqua
- 1993
(Show Context)
Citation Context ...consonants become distorted [77]. It is well known that recognition performance degrades significantly for stressed speech, such as Lombard, angry or loud speech compared to neutrally produced speech =-=[15, 76]-=-, which recognisers are trained on. Attempts to address these effects have been beneficial [14, 132]; however in this work, these effects on speech production will not be directly addressed. In the mo... |

96 |
Matrix Algebra Useful for Statistics
- Searle
- 1982
(Show Context)
Citation Context ...W2W -1 s W1y + µ T W1W -1 s W2µ − 2y T W T 1 W -1 s W2µ = (y − µ) T (W -1 1 + W -1 2 ) -1 (y − µ) (A.16) using the following identity A(A + B) -1 B = B(A + B) -1 A = (A -1 + B -1 ) -1 found in Searle =-=[126]-=-. Equations (A.13) and (A.16) combine to form equation (A.12) and therefore (y − x) T W1(y − x) + (x − µ) T W2(x − µ) = (x − ˆx) T Ws(x − ˆx) + (y − µ) T (W -1 1 + W -1 2 ) -1 (y − µ) (A.17) N (y − x,... |

93 | Speech Recognition in Noisy Environments
- Moreno
- 1996
(Show Context)
Citation Context ... simplified by combining the various additive and convolutional noise sources into single additive noise, z(τ), and linear channel noise, h(τ), variables. Doing so gives this standard, oft-used model =-=[1, 39, 106]-=- of the noisy acoustic environment in the time domain show in figure 3.2. The noisy signal is now given by y(τ) = x(τ) ∗ h(τ) + z(τ) (3.2) where y(τ) is the noise corrupted speech and x(τ) the “clean”... |

92 |
Assessment for Automatic Speech Recognition: II. NOISEX-92:A Database and an Experiment to Study the Effect of Additive Noise on Speech Recognition Systems
- Varga, Steeneken
- 1993
(Show Context)
Citation Context ...ired and the noise condition may vary. Artificial methods of corrupting the training data have been explored which also yield good results. Samples of noise, such as those from the NOISEX-92 database =-=[144]-=-, can be added to the clean training data to generate noise-corrupted training data. This provides good results for levels of noise down to 6-10dB. However, matched training cannot easily address chan... |

87 |
Model-based techniques for noise robust speech recognition
- Gales
- 1995
(Show Context)
Citation Context ...ize of ∆ = 1 gives coefficients that are simply the difference between the previous and following frame. A large window size of ∆ = 2 gives a more robust estimate of dynamic coefficients. As noted in =-=[39]-=-, delta parameters may be considered an approximation to the first derivative of the static parameters; hence in the Continuous-Time approximation, time derivatives of the static coefficients may be u... |

86 | Hmm Adaptation Using Vector Taylor Series for Noisy Speech Recognition
- Acero, Deng, et al.
(Show Context)
Citation Context ...k ∗ = argmax P(k|ot) (4.17) k ˆst = ot + ˇµ (k∗ ) (4.18) SPLICE is not intrinsically tied to stereo data; with a prior clean speech GMM, a corrupted speech GMM may be estimated using VTS compensation =-=[2, 106]-=- and the biases computed from the two GMM. Limiting the update of the feature vector to only a bias form is efficient, however a MLLR-like affine transform would be more accurate as suggested in Deng ... |

84 | fMPE: Discriminatively trained features for speech recognition
- Povey, Kingsbury, et al.
(Show Context)
Citation Context ...sults, from an uncompensated system, were used during scoring for all the systems described. A more complex system would typically use some form of feature projection scheme such as HLDA [87] or fMPE =-=[116]-=-, advanced covariance modelling such as STC [41], and MMI [141] or MPE [115] training of model parameters—the use of such techniques were not investigated in these experiments.sCHAPTER 9. EXPERIMENTAL... |

82 |
Investigation of silicon-auditory models and generalization of linear discriminant analysis for improved speech recognition
- KUMAR
- 1997
(Show Context)
Citation Context ...variances are not optimal for ASR. Recent ASR systems [32, 53, 121, 134] have demonstrated the importance of modelling correlations between dimensions by either using full adaptation transforms, HLDA =-=[87]-=-, or STC [41] techniques. Thus it is interesting to examine how environmental noise affects these intraframe correlations. Figure 3.5 shows contours of equal probability for full bivariate Gaussian di... |

69 |
Maximum likelihood estimation for multivariate mixture observations of markov chains
- Juang, Levinson, et al.
- 1986
(Show Context)
Citation Context ...s,t st − ˆµ (jm) �� s st − ˆµ (jm) s �T t=1 γ(jm) s,t � T (2.33) (2.34) (2.35) (2.36)sCHAPTER 2. HIDDEN MARKOV MODEL SPEECH RECOGNITION 15 Derivations for these solutions can be found in Huang et al. =-=[71, 72]-=-. To compute diagonal variances, as discussed on page 2.3.1, the full covariance is diagonalised ˆΣ (jm) s = diag � Σ ˆ (jm) � s,full (2.37) By using a mixture of Gaussians with diagonal covariances, ... |

64 | The generation and use of regression class trees for MLLR adaptation - Gales - 1996 |

63 |
Large vocabulary continuous speech recognition using
- Woodland, Odell, et al.
- 1994
(Show Context)
Citation Context ...g C0, plus the first and second differentials. This yields a 39-dimensional feature vector. The WSJ SI284 training data was used to train a clean acoustic model in a similar manner to Woodland et al. =-=[148]-=-. There are 284 speakers from the WSJ0 and WSJ1 corpora yielding 66 hours of speech data. The acoustic models are decision tree clustered state, crossword triphones, with three-emitting states per HMM... |

61 | Cluster adaptive training of hidden Markov models - Gales - 2000 |

60 |
An improved approach to the hidden markov model decomposition of speech and noise
- Gales, Young
- 1992
(Show Context)
Citation Context ...hoice that assumes the sum of two log-normal distributions is approximately log-normal, however it cannot be applied with delta and delta-delta parameters due to the resulting complexity of the forms =-=[50]-=-. Another approximation is the log-add, which may be used to update the component means of the static dimensions µ l(m) y,i = log� exp(µ l(m) x,i ) + exp(µl z,i) � = µ l(m) x,i + log� 1 + exp(µ l z,i ... |

59 | Towards understanding spontaneous speech: Word accuracy vs. concept accuracy
- Boros, Eckert, et al.
- 1996
(Show Context)
Citation Context ...racy are sensible evaluation criteria for transcription tasks, for other ASR systems it may not be an optimal guide to performance. For example, dialogue system evaluations may quote concept accuracy =-=[12]-=- or task completion rate. 2.5 Adaptation and Normalisation Despite the amount of data used to train the acoustic models and efforts to produce speaker independent systems, there is still degradation w... |

59 | Posterior probability decoding, confidence estimation and system combina
- Evermann, Woodland
(Show Context)
Citation Context ...000 or more words. The number of words is rather arbitrary in the definitions, but they give a sense of task complexity. The optimal word sequence, or sometimes a word lattice [72], confusion network =-=[31]-=-, or list of possible transcriptions, is then passed to the application. The application may simply provide transcriptions, where post-processing could be required to add punctuation and capitalisatio... |

57 |
Speech Synthesis and Recognition
- Holmes, Holmes
- 2001
(Show Context)
Citation Context ...ly used speech parameterisation, its optimality has been questioned [63, 65, 67]. Alternatively, perceptual linear prediction (PLP) coefficients have been used [64] giving similar performance to MFCC =-=[69]-=-. An extensive review of speech signal representations can be found in Huang et al. [72] or Gold and Morgan [57].sCHAPTER 2. HIDDEN MARKOV MODEL SPEECH RECOGNITION 8 2.2.1 Dynamic Features The set of ... |

53 |
Bayesian Estimation Approach for Speech Enhancement Using Hidden Markov Models
- Ephraim, “A
- 1992
(Show Context)
Citation Context ...d noise speech statistics, by using the state rather than global statistics, for use in the enhancement process. Enhancement with auto-regressive, hidden Markov models of speech is studied in Ephraim =-=[28]-=-, Logan and Robinson [101], Seymour and Niranjan [128]. As discussed in [29], speech enhancement can be viewed as minimising the average distortion between an estimator of the clean speech vector ˇst ... |

53 | Analysis and Compensation of Speech under Stress and Noise for Environmental Robustness
- Hansen
- 1996
(Show Context)
Citation Context ..., channel distortions either due to the microphone or network with channel noise added, and finally possible noise at the near end of the speech recognition system. This is summarised in a model from =-=[62]-=- shown in figure 3.1. ���� � � y(τ)= x(τ) � Task Workload Stress Noise � zenv(τ) � � � + zenv(τ) ∗ hmic(τ) + zchan(τ) ∗ hchan(τ) + znear(τ) (3.1) Figure 3.1: Sources of noise and distortion that can e... |

52 | Should recognizers have ears
- Hermansky
- 1998
(Show Context)
Citation Context ...sumptions that are tolerable with clean speech, such as the conditional independence of observations, and the lack of explicit duration modelling may result in increased fragility to noise. Hermansky =-=[66]-=- contends that the fragility of ASR in realistic situations is due to excessive attention to spectral structure and poor modelling of the temporal structure of speech signals. A frequent comparison is... |

50 |
Multi-Style Training for Robust Isolated-Word
- Lippmann, Martin, et al.
- 1987
(Show Context)
Citation Context ... may be applied to remove unwanted, non-linguistic factors, such as speaker differences or the acoustic environment, from being included in the acoustic models [3, 22, 43, 44]. In multistyle training =-=[98]-=- the acoustic model is forced to represent all these factors; a speaker independent model may be considered a multistyle model. Adaptive training instead uses transforms to model the variation from di... |

50 | MMIE training of large vocabulary recognition systems. Speech Commun 22(4):303–314
- Valtchev, JJ, et al.
- 1997
(Show Context)
Citation Context ... discriminative training. Discriminative training focuses on estimating model parameters that minimise the error rate. An early form of discriminative training used a maximum mutual information (MMI) =-=[141]-=- criterion. MMI aims to optimise the posterior probability that a model generated a portion of the training utterance—this maximises the mutual information between the training data and the models. Wh... |

49 |
Robust speech recognition in additive and convolutional noise using parallel model combination
- Gales, Young
- 1995
(Show Context)
Citation Context ...combination (PMC) combines separate noise and speech models to form a corrupted speech model directly for use in the recognition process. It assumes the component posteriors remain unchanged in noise =-=[51]-=-. Therefore only the model component distributions need updating. In non-iterative forms of PMC, each clean speech model component is combined with the noise model via a mismatch function to yield an ... |