## Generation and Combination of Complementary Systems for Automatic Speech Recognition (2008)

### BibTeX

@MISC{Breslin08generationand,

author = {Catherine Breslin},

title = {Generation and Combination of Complementary Systems for Automatic Speech Recognition},

year = {2008}

}

### OpenURL

### Abstract

It has been found that using a combination of systems for large vocabulary continuous speech recognition (LVCSR) can outperform the use of a single system. For the combination to yield gains, the individual models must be complementary, i.e. they must make different errors. Previous work in ASR has mainly relied on an ad-hoc approach to finding complementary systems. Multiple systems are built, and those that perform well in combination are selected. The multiple diverse systems can be built in many ways, including the use of different frontends, injecting randomness, altering the model topology or using different training

### Citations

8983 | N.: The nature of statistical learning theory
- Vapnik
- 1995
(Show Context)
Citation Context ... techniques for combining complementary systems using logistic regression as a binary classifier. Logistic regression was used rather than more complicated classifiers such as support vector machines =-=[150]-=-, as initial experiments showed it to perform better. 5.1.3.1 Logistic Regression Logistic regression is a commonly used classifier for the case when there are two classes. It is a particular form of ... |

8092 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ....30) where ˆ M k is the current parameter set at iteration k, and ˆ M k+1 is the re-estimated set at iteration k + 1. The Baum-Welch algorithm is a form of the Expectation-Maximisation (EM) algorithm =-=[28]-=-. It has two steps, and is described in figure 2.5. The first step, the Estep, computes the auxiliary function while the second M-step estimates the updated model parameters. The steps are alternated ... |

2492 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...This can be done, for example, by adding noise onto the training data, initialising the parameters randomly, or by selecting random subsets of the training data. The latter method is known as bagging =-=[14]-=-. These are all general methods that can be used regardless of the system being built and the training algorithm. If the specific classifier in question is a decision tree, then another approach to in... |

2308 | A desicion-theoretic generalization of online learning and an application to boosting. EuroCOLT - Freund, Schapire - 1995 |

1631 | Experiments with a new boosting algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ...posterior probability of class membership P (H|o, M (s) ) replaces the hard decision δ(H (s) , H) in the previous section. A standard scheme that makes use of this form of weighted voting is AdaBoost =-=[37]-=-, discussed in more detail in section 4.1.2. 3.2.3 Mixtures and Products of Experts Majority voting and posterior combination are schemes for selecting the best from a set of hypotheses. Alternatively... |

1398 | Random forests
- Breiman
- 2001
(Show Context)
Citation Context ...the tree in a random manner, by randomly choosing a split from the top N, rather than grow the tree by choosing the best split each time. Repeated application of this algorithm builds a random forest =-=[13]-=-. The method works well as the splitting in a decision tree is a locally optimal split, and so very different decision trees can be built by small changes in the algorithm. Injecting randomness has th... |

1240 |
A.: On information and sufficiency
- KULLBACK, LEIBLER
- 1951
(Show Context)
Citation Context ...2) ) + KL(N (2) , N (1) � ) )) (6.3) (6.4)sCHAPTER 6. DIRECTED DECISION TREES 83 and N (µ (s) j , Σ (s) j ) is the distribution of state θj in the sth tree. The Kullback-Leibler divergence KL(N , N ) =-=[88]-=- is used as a measure of divergence between Gaussians from the two trees, but any suitable distance metric could be employed. The KL divergence between two Gaussians is KL(N (1) , N (2) ) = 1 � � |Σ l... |

1202 |
Binary Codes Capable of Correcting Deletions, Insertions and Reversals
- Levenshtein
- 1966
(Show Context)
Citation Context ....3.2, but at the word level rather than the frame level. 2.7.2 Levenshtein Alignment The alignment of two strings can be done using a dynamic programming algorithm. The Levenshtein, or edit, distance =-=[96]-=- is the number of insertions, substitutions and deletions required to transform one string into another. For example, in figure 2.7, to transform string S1 into string S0 requires one deletion (‘a’), ... |

1159 |
Information theory, inference and learning algorithms. Cambridge university press
- MacKay
- 2003
(Show Context)
Citation Context ...proach is to use numerical integration to estimate Z and its gradient, while a second approach is to estimate the gradient using sampling techniques such as Gibbs or Monte Carlo Markov Chain sampling =-=[103]-=-. However, these methods can be slow to converge and require many samples. Contrastive divergence training [20, 69] approximates the estimate of the gradient after many MCMC samples with the estimate ... |

942 | The EM Algorithm and Extensions - Mclachlan, Krishnan - 1996 |

758 |
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
- Davis, Mermelstein
- 1980
(Show Context)
Citation Context ...trum to boost the energy at higher frequencies. From this short-term spectrum, two popular representations for the speech signal can be extracted. These are mel frequency cepstral coefficients (MFCC) =-=[27]-=- and perceptual linear prediction (PLP) coefficients [66], both described below. 2.2.1 Feature Extraction Figure 2.2 shows the process of obtaining MFCC [27]. These make use of the mel-scale, given by... |

665 | The strength of weak learnability
- Schapire
- 1990
(Show Context)
Citation Context ... to multiclass problems. It takes advantage of the fact that a combination of weak classifiers, i.e. those which perform slightly better than random, can perform as well as a single strong classifier =-=[134]-=-. In its strict definition, boosting defines a training procedure for building a set of weak classifiers, each of which perform slightly better than random, and a final classification for combining th... |

587 |
Perceptual linear prediction (PLP) analysis of speech
- Hermansky
- 1990
(Show Context)
Citation Context ... short-term spectrum, two popular representations for the speech signal can be extracted. These are mel frequency cepstral coefficients (MFCC) [27] and perceptual linear prediction (PLP) coefficients =-=[66]-=-, both described below. 2.2.1 Feature Extraction Figure 2.2 shows the process of obtaining MFCC [27]. These make use of the mel-scale, given by � fmel = 1127 log 1 + fHz � 700 This scale takes account... |

508 | Training products of experts by minimizing contrastive divergence
- Hinton
- 2002
(Show Context)
Citation Context ...nition in that expanded space. This is similar to the generalised Viterbi decoding in [151]. 3.3.3.2 Products of Experts An alternative to the mixture model is to use the product of experts framework =-=[45, 69]-=-. Here the likelihoods from the models are producted together, effectively forming an intersection of the distributions. Again, considering the simplest case of only two experts and taking the product... |

490 | C.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains
- Gauvain, Lee
- 1994
(Show Context)
Citation Context ... be more compact, i.e. have smaller variances, than the original speaker independent model. This is known as speaker adaptive training (SAT) [3]. Another speaker adaptation approach is MAP adaptation =-=[50]-=- which performs an update of the model parameters, interpolating between a prior estimate of model parameters and the ML update from the adaptation data. The prior estimate is often the speaker indepe... |

489 | Factorial Hidden Markov Models
- Ghahramani, Jordan
- 1998
(Show Context)
Citation Context ... and for an asynchronous combination, the condition to be satisfied is � where O = {o1 · · · oT }. RT d p(O|M (1) · · · M (S) )dO = 1 (3.26) (1) (2)sCHAPTER 3. SYSTEM COMBINATION 49 The factorial HMM =-=[52]-=- is an asynchronous combination of models where the mean of a meta-state is the sum of means of the individual model states. When each state output distribution is a single Gaussian, the likelihood be... |

455 |
Objective criteria for the evaluation of clustering methods
- Rand
- 1971
(Show Context)
Citation Context ...s the similarity between the clustering in two decision trees. It is possible to compare the clusterings in decision trees directly using, for example, a cluster similarity measure similar to that in =-=[127]-=-. However, these use many pairwise comparisons between clustered elements and prove expensive in practice when there are many thousands of states, so it is useful to make use of the properties of deci... |

432 | An experimental comparison of three methods for constructing ensembles of decision trees
- Dietterich
- 1999
(Show Context)
Citation Context ...ling errors rather than on data which is truly hard to classify. In contrast, bagging and randomisation perform much better in this situation because the randomness overcomes the classification noise =-=[30]-=-. 4.1.3 Simultaneously Training Multiple Systems In cases where there are few classifiers, the task is small or the classifiers are simple, it can be possible to simultaneously train an ensemble of co... |

416 | Improving generalization with active learning
- Cohn, Atlas, et al.
- 1994
(Show Context)
Citation Context ...d be due to overtraining, outliers, or wrongly labelled data. It is impractical to do an exhaustive search to find the optimal subset of training data, and so recent efforts have used active training =-=[26]-=- to automatically select an optimal subset of data for training. Active training [26] is so called because it alters the learning algorithm from a passive one which has no control over its input to on... |

406 | Maximum Likelihood Linear Transformations For HMM-Based Speech Recognition
- Gales
- 1998
(Show Context)
Citation Context ...PTER 2. HMM-BASED STATISTICAL SPEECH RECOGNITION 34 initial speaker independent models. Hence, adaptation techniques must work well with limited data. MLLR is the most common technique for adaptation =-=[42, 46, 95]-=-. MLLR linearly transforms the mean [95] and/or variance [42] of a model to better represent a particular speaker. Transforms are normally tied across a number of model components, using a regression ... |

322 | A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER
- Fiscus
- 1997
(Show Context)
Citation Context ...One algorithm which aligns a small number of hypotheses is ROVER, discussed in section 3.3.1.2 below. In this algorithm, the strings are iteratively aligned using the dynamic alignment multiple times =-=[36]-=-. Thus, if there are S systems yielding S hypotheses, the dynamic programming alignment is performed S − 1 times. To align a small number of confusion networks the same process is performed, but the s... |

214 |
An Inequality with Applications to Statistical Estimation for Probabilistic Functions of a Markov Process and to a Model for Ecology
- Baum, Egon
- 1967
(Show Context)
Citation Context ...s matured markedly during this time. This is due in part to the increase in available computing power, and in part to more sophisticated modelling techniques. The introduction of the HMM in the 1970s =-=[8]-=-, and a statistical framework for ASR, has proven the most successful approach to date, and is the basis for current state-of-the-art speech recognisers. Initially, the focus of automatic speech recog... |

196 | Some statistical issues in the comparison of speech recognition algorithms
- GiIlick, Cox
- 1989
(Show Context)
Citation Context ...ng, as this allows the gains from combination to easily be seen. Combination was performed using confusion network combination. For all statistical significance tests, the matched-pairs test was used =-=[54]-=-. 8.1.3 Multi-Pass Decoding Framework Along with the singlepass unadapted framework, decoding was also performed in a multipass framework similar to that discussed in section 3.1. This allows for the ... |

181 | Semi-tied covariance matrices for hidden Markov models
- Gales
- 1999
(Show Context)
Citation Context ...ould have the same covariance. A global HLDA transform which doessCHAPTER 2. HMM-BASED STATISTICAL SPEECH RECOGNITION 11 not reduce the feature dimension is equivalent to a global semi-tied transform =-=[43]-=-. As for LDA, the HLDA transform is applied as in equation 2.12 while the parameters are estimated in an ML fashion [43], according to ÂHLDA = argmax A ⎧ ⎨ T� ⎩ J� t=1 j=1 � γj(t) log |A| 2 − log | ˜ ... |

179 |
Minimum phone error and I-smoothing for improved discriminative training
- Povey, Woodland
- 2002
(Show Context)
Citation Context ...ative training is the tendency to overtrain, and thus generalise poorly to unseen data. For this reason, smoothing to a prior model is often performed as part of the training, to prevent overtraining =-=[124]-=-. Smoothing with a static prior interpolates the parameter estimates with the parameters of a fixed, well-trained model. Smoothing with a dynamic prior interpolates the discriminative parameter estima... |

164 |
Hidden Markov Model Decomposition of Speech and Noise
- Varga, Moore
- 1990
(Show Context)
Citation Context ...cognition network so that it considers all possible state combinations between the models, and then performs recognition in that expanded space. This is similar to the generalised Viterbi decoding in =-=[151]-=-. 3.3.3.2 Products of Experts An alternative to the mixture model is to use the product of experts framework [45, 69]. Here the likelihoods from the models are producted together, effectively forming ... |

163 |
Maximum Mutual Information Estimation of Hidden Markov Models Parameters for Speech Recognition
- Souza, V, et al.
- 1986
(Show Context)
Citation Context ...t to the expected error rate, and allows the objective function in training to be matched with the evaluation criterion. 2.4.2.1 Maximum Mutual Information Maximum Mutual Information (MMI) Estimation =-=[6, 118, 149]-=- maximises the posterior probability of the correct hypothesis, given a fixed language model. This is equivalent to maximising the mutual information between the models and the acoustic observation se... |

145 | A compact model for speaker-adaptive training
- Anastasakos
- 1996
(Show Context)
Citation Context ...e inter-speaker variability. The resulting model set should be more compact, i.e. have smaller variances, than the original speaker independent model. This is known as speaker adaptive training (SAT) =-=[3]-=-. Another speaker adaptation approach is MAP adaptation [50] which performs an update of the model parameters, interpolating between a prior estimate of model parameters and the ML update from the ada... |

144 | Products of experts
- Hinton
- 1999
(Show Context)
Citation Context ...a set of hypotheses. Alternatively, individual model distributions can be combined to obtain a single score for each class, for use in further processing stages. Mixtures [11] and products of experts =-=[70]-=- are commonly used for combining scores, where each expert is a probability distribution. The likelihood of an observation given all models, p(o|H, M (1) · · · M (S) ), can be a weighted sum of likeli... |

132 | Speech Recognition by Machines and Humans
- Lippmann
- 1997
(Show Context)
Citation Context ...ample around 10% 1sCHAPTER 1. INTRODUCTION 2 character error rate (CER) on broadcast news Mandarin [48], which is significantly worse than the performance of human transcription on spontaneous speech =-=[97]-=-. ASR technology is now commercially deployed in a variety of applications. For example, Microsoft’s Vista operating system allows the user to interact with the computer via speech recognition, while ... |

127 | A new asr approach based on independent processing and recombination of partial frequency bands
- Bourlard, Dupont
(Show Context)
Citation Context ... of the streams of the observation vector are modelled separately. The separate streams can consist of different sources of information, including static and dynamic parameters [163], frequency bands =-=[12, 146]-=- or multiple sources of information such as speech and visual information [145]. The original feature vector may be rewritten as a concatenation of the feature vectors from each of the S streams: ot =... |

119 |
Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density HMMs
- Legetter, Woodland
- 1995
(Show Context)
Citation Context ...PTER 2. HMM-BASED STATISTICAL SPEECH RECOGNITION 34 initial speaker independent models. Hence, adaptation techniques must work well with limited data. MLLR is the most common technique for adaptation =-=[42, 46, 95]-=-. MLLR linearly transforms the mean [95] and/or variance [42] of a model to better represent a particular speaker. Transforms are normally tied across a number of model components, using a regression ... |

116 |
Discriminative training for large vocabulary speech recognition
- Povey
- 2003
(Show Context)
Citation Context ...Levenshtein distance, llev(W, Wref), is given by llev(W, Wref) = � 0 W = Wref 1 W �= Wref i.e. insertion, deletion or substitution (2.44) Another popular criterion, minimum phone error (MPE) training =-=[122]-=- fits within the MBR framework. Now the hypotheses H and Href are aligned into a set of phone pairs {P (k) , P (k) ref }. The loss function is then calculated at the phone level L(H, Href) = K� min l(... |

112 | The use of context in large vocabulary speech recognition
- Odell
- 1995
(Show Context)
Citation Context ...of an ASR system, hence reducing the total number of parameters to train. The clustering can be done at many levels, for example at the phone (HMM) level [7], by clustering state output distributions =-=[119]-=- or covariance matrices [84]. Left unvoiced consonant? x−i+n[2] c−i+m[2] Right context nasal? Left nasal? m−i+m[2] n−i+m[2] r−i+m[2] Left liquid? r−i+t[2] y−i+sh[2] y−i+t[2] f−i+c[2] p−i+t[2] Figure 2... |

103 | Finding consensus among words: Lattice-based word error minimization
- Mangu, Brill, et al.
- 1999
(Show Context)
Citation Context ... without explicitly having to perform the alignment. An alternative approach is to restrict the evidence space to make the alignment simpler. For example, pinched lattices [32] and confusion networks =-=[105]-=- are more compact representations of multiple hypotheses are discussed in more detail in section 2.7. 2.4.2.4 Optimisation of Discriminative Criteria The EM algorithm for optimising the ML criterion u... |

101 | Confidence measures for large vocabulary continuous speech recognition
- Wesse, Schluter, et al.
- 2001
(Show Context)
Citation Context ...entation of the most likely hypotheses, and can easily be used for further processing. One useful step is to convert the acoustic and language model scores in the lattice into posterior probabilities =-=[157]-=-. This can be done using a forward-backward pass, as discussed in section 2.3.2, but at the word level rather than the frame level. 2.7.2 Levenshtein Alignment The alignment of two strings can be done... |

99 |
Large Scale Discriminative Training of Hidden Markov Models for Speech Recognition
- Woodland, Povey
(Show Context)
Citation Context ...ECOGNITION 20 Care must be taken to ensure that He is a good representation of the most likely competing hypotheses so that the loss function can be calculated accurately. N-best lists [79], lattices =-=[159]-=- or pinched lattices [32] offer a convenient representation. These are discussed in more detail below, in section 2.7. Secondly, the HTK implementation of MPE training [163] optimises the expected pho... |

84 |
Buckwalter Arabic Morphological Analyzer version 1.0
- Buckwalter
- 2002
(Show Context)
Citation Context ...are missed from the words in transcription, and thus each grapheme has more pronunciations than, say, a word in English. Word pronunciations for Arabic can be generated, often by using a set of rules =-=[19]-=-. This leads to a large number of pronunciations per word - on average 4.3 compared to 1.1 for English. This issue needs to be addressed when building an Arabic system, and suggestions have included g... |

81 | On Contrastive Divergence Learning, in
- Perpinan, Hinton
(Show Context)
Citation Context ...e gradient using sampling techniques such as Gibbs or Monte Carlo Markov Chain sampling [103]. However, these methods can be slow to converge and require many samples. Contrastive divergence training =-=[20, 69]-=- approximates the estimate of the gradient after many MCMC samples with the estimate after just one sample. Contrastive divergence can be used for estimating the parameters of products of HMMs [18, 10... |

81 |
A Word Graph Algorithm for Large Vocabulary Continuous Speech Recognition
- Ortmanns, X
- 1997
(Show Context)
Citation Context ...ntains a mapping from words to sub-word units. The system might output a single best hypothesis, a list of the N best hypotheses, or another representation of likely hypotheses such as a word lattice =-=[121]-=-. Input Speech Frontend { . . . } o 1 Acoustic Models o T Dictionary Decoder Adaptation Language Models Figure 2.1: Speech Recognition System Architecture Output Hypothesis The recogniser can only han... |

77 |
Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition
- Kumar
- 1997
(Show Context)
Citation Context ...he matrix W −1 B where B is the between-class covariance and W is the average of the component variances, or within class covariance. LDA is commonly used in large vocabulary systems [100, 140]. HLDA =-=[89]-=- is an extension to LDA for heteroscedastic data, which relaxes the assumption of LDA that classes should have the same covariance. A global HLDA transform which doessCHAPTER 2. HMM-BASED STATISTICAL ... |

75 | FMPE: Discriminatively trained features for speech recognition
- Povey, Kingsbury, et al.
- 2005
(Show Context)
Citation Context ... other discriminative criteria, have been optimised using the EBW algorithm discussed in section 2.4.2.4. The MPE criterion has also be used successfully to discriminatively train a feature transform =-=[125]-=- and hence obtain a set of discriminative features. This is known as fMPE. 2.4.2.3 Implementation Detail Discriminative criteria typically involve a sum over all possible hypotheses, and so, in practi... |

74 |
Ensemble Methods
- Dietterich
(Show Context)
Citation Context ...2.4.2.2, can be used to improve the underlying models. Again, sophisticated algorithms can lead to issues with overtraining, and there is a limit to how much improvement they can achieve in practice. =-=[29]-=- suggests three theoretical reasons why an ensemble of classifiers may perform better than just one classifier alone: 1 http://mi.eng.cam.ac.uk/research/projects/AGILE/ 2 http://chil.server.de/servlet... |

70 | Estimating confidence using word lattices
- Kemp, Schaaf
- 1997
(Show Context)
Citation Context ...interpreted as probabilities, they can be mapped to a confidence measure which lies between zero and one. Other measures extracted from a lattice include the hypothesis density and acoustic stability =-=[85]-=-. The former assumes that high confidence regions in the lattice have fewer arcs, while the latter measures the stability of the hypothesis as the weighting between language and acoustic model is chan... |

70 | Explicit word error minimization in N-best list rescoring
- Stolcke, König, et al.
- 1997
(Show Context)
Citation Context ...SPEECH RECOGNITION 27 recognition systems is the word error rate (WER). This mismatch leads to suboptimal performance in terms of word error rate, and may be addressed by minimum Bayes’ risk decoding =-=[57, 105, 141, 158]-=- which allows the evaluation metric to be included as part of the decoding algorithm. More specifically, MBR decoding aims to find the best word sequence using ˆH = argmin ˜H � P (H|O, M)L(H, ˜ H) (2.... |

62 | The generation and use of regression class trees for MLLR adaptation
- Gales
- 1996
(Show Context)
Citation Context ...arly transforms the mean [95] and/or variance [42] of a model to better represent a particular speaker. Transforms are normally tied across a number of model components, using a regression class tree =-=[41]-=-. Regression class trees tie components of HMMs which are close in acoustic space, in order to apply a single transform over a number of components. This has some similarity to decision trees for para... |

57 |
Minimum Bayes-risk automatic speech recognition. Computer Speech and Language
- Goel, Byrne
- 2000
(Show Context)
Citation Context ...SPEECH RECOGNITION 27 recognition systems is the word error rate (WER). This mismatch leads to suboptimal performance in terms of word error rate, and may be addressed by minimum Bayes’ risk decoding =-=[57, 105, 141, 158]-=- which allows the evaluation metric to be included as part of the decoding algorithm. More specifically, MBR decoding aims to find the best word sequence using ˆH = argmin ˜H � P (H|O, M)L(H, ˜ H) (2.... |

55 |
Discriminative model combination
- Beyerlein
- 1998
(Show Context)
Citation Context ...nation below, could also be used. 3.3.1.6 Discriminative Model Combination Discriminative Model combination is a framework for weighted log-linear combination of models in a speech recognition system =-=[9]-=-. This framework generalises the posterior probability to a log-linear distribution. For combination of acoustic model log posteriors at the hypothesis level, the combined posterior becomes P (H|O, M ... |

55 | and P.C.Woodland, “Posterior probability decoding, confidence estimation and system combination
- Evermann
- 2000
(Show Context)
Citation Context ...e accuracy of the posterior probabilities obtained from the particular model set.sCHAPTER 2. HMM-BASED STATISTICAL SPEECH RECOGNITION 29 Depending on the task, this bias may need to be accounted for, =-=[34, 67]-=-. A fast CN generation algorithm has also been proposed [160]. 2.7 Aligning Multiple Hypotheses Sections 2.4.2.2 and 2.6.2 described training and decoding algorithms which require the use of multiple ... |

51 | Lightly supervised and unsupervised acoustic model training. Computer Speech and Language
- Lamel, Gauvain, et al.
- 2002
(Show Context)
Citation Context ...ard to obtain a large amount of untranscribed data, particularly for a task like broadcast news. This has led to unsupervised training where the system is trained on a large amount of unlabelled data =-=[90, 153, 155]-=-. Unsupervised training typically involves automatically transcribing the data and selecting those utterances where the recogniser is confident to add to the training set. Active and unsupervised trai... |