## A Survey of Discriminative and Connectionist Methods for Speech Processing (2002)

### BibTeX

@MISC{Aberdeen02asurvey,

author = {Douglas Aberdeen},

title = {A Survey of Discriminative and Connectionist Methods for Speech Processing},

year = {2002}

}

### OpenURL

### Abstract

Discriminative speech processing techniques attempt to compute the maximum a posterior probability of some speech event, such as a particular phoneme being spoken, given the observed data. Non-discriminative techniques compute the likelihood of the observed data assuming an event. Non-discriminative methods such as simple HMMs (hidden Markov models) achieved success despite their lack of discriminative modelling. This survey will look at enhancements to the HMM model which have improved their discrimination ability and hence their overall performance. This survey also reviews alternative discriminative methods, namely connectionist methods such as ANNs (arti cial neural networks). We will also draw comparisons between discriminative HMMs and connectionist models, showing that connectionist models can be viewed as a generalisation of discriminative HMMs. 1

### Citations

900 | An introduction to hidden Markov models
- Rabiner, Juang
- 1986
(Show Context)
Citation Context ...articularly those using hybrid approaches may o er signi cant advantages. Familiarity is assumed with the basics of both ANNs and HMMs. Many introductory texts can be founds on these topics including =-=[11, 39, 28]-=- for HMMs and [32] for ANNs. 2 The Speech Problem Speech processing can be thought of as the problem of choosing m ∗ = arg max P (m|Ou), (1) m where Ou = {Ou(1), . . . , Ou(Tu)} is a time sequence of ... |

483 |
Connectionist Speech Recognition- A Hybrid Approach
- Bourlard, Morgan
- 1994
(Show Context)
Citation Context ...ow real-time processing. Unfortunately the simpli cations deliberately make untrue assumptions about speech [40]. This is not just true of HMMs since ANN approaches typically make similar assumptions =-=[6]-=-. However, in ANNs we have the ability to relax these assumptions more readily than we do in HMMs. For example, to incorporate dependence on n previous models instead of just 1, we can add O(n) inputs... |

221 |
Automatic Speech Recognition: The Development of the SPHINX System
- Lee, Mahajan
- 1989
(Show Context)
Citation Context ...ty is seen when HMMs move from modelling context independent phones (61 for the TIMIT corpus) to triphones where around 5000 models are used even after the unlikely or unhelpful triphones are removed =-=[24, 20]-=-. 3 Discriminative methods for ANN and HMM training In this section we brie y describe two popular methods for performing discriminative training which can be applied to both ANNs and HMMs. We roughly... |

206 | An application of recurrent nets to phone probability estimation
- Robinson
- 1994
(Show Context)
Citation Context ...ch output represents a model. Interpreting network outputs as probabilities is explained in Section 4.1. This approach is used in Alphanets [10, 9, 34] and in several RNNs (Recurrent Neural Networks) =-=[45, 55, 12]-=-. Examples of approximating P (Ou|m) with ANNs are rare since we might expect a single network to share information more e ciently, requiring fewer parameters and consequently require less training da... |

202 |
Discriminative learning for minimum error classification
- Juang, Katagiri
- 1992
(Show Context)
Citation Context ...s given in [10]. 3.2 Minimum Classi cation Error Minimum Classi cation Error seeks to minimise exactly what we care about, the empirical error rate. It is introduced and described in a general way in =-=[21]-=- which also compares this criterion to standard error measures such as the mean squared error. A similar measure called Minimum Empirical Error was introduced in [2]. The basic idea is to construct a ... |

101 |
Review of Neural Networks for Speech Recognition
- Lippmann
- 1989
(Show Context)
Citation Context ...ng the overall number of parameters to be trained, and improving the use of training data [49]. • ANNs can relax the Markov assumption by considering multiple frames of data (past and future) at once =-=[27, 6]-=-. It is di cult to do this with HMMs since it is necessary to minimise the dimensionality of observations and the number of states to allow estimation of the parameters with minimal data. Time derivat... |

95 |
Large-vocabulary speaker-independent continuous speech recognition: The SPHINX system
- Lee
- 1988
(Show Context)
Citation Context ...el probabilities. ANNs usually assume a static pattern, however speech consists of a possibly continuous stream of data broken down into frames of around 10 ms, each with tens to hundreds of features =-=[23]-=-. The key di erence between the various connectionist and hybrid inspired approaches is how they deal with the time varying nature of speech. The natural way HMMs handle time varying signals is a stro... |

85 |
Modular Construction of Time Delay Neural Networks for Speech Recognition
- Waibel
- 1989
(Show Context)
Citation Context ... class over a nite period of time. They are trained using a modi ed form of error back propagation. Good results were obtained for classifying plosive consonants using TDNNs compared to standard ANNs =-=[53]-=-. They have also been used to approximately determine phone labels to use as discrete HMM symbols in [29]. This system recognised Dutch digits, discriminating between 21 phonemes. Results improved fro... |

70 | Global optimization of a neural network–Hidden Markov Model hybrid
- Bengio, Mori, et al.
- 1992
(Show Context)
Citation Context ...he training data, for example, the task may involve a restricted set of words, altering the distribution of phones. 5.2 Global Optimization of ANN/HMM Hybrids An alternative ANN/HMM approach taken by =-=[4]-=- views the ANN as mapping a high dimensionality set of frame data into a small set of continuous observations to be input to an HMM that estimates observation probabilities using a mixture of Gaussian... |

68 |
An overview of speaker recognition technology
- Furui
- 1994
(Show Context)
Citation Context ...rs from (7) only in whether the correct model m ∗ is included in the summation and the assumption of uniform priors P (mi). MCE is also very similar to the idea of distance normalization discussed in =-=[14]-=-. 63.2.1 Gradient Descent for MCE In practice it seems more common to use the log form of (9) [25, 43], which results in the following gradient for l(d∗(Ou)) with respect to an arbitrary set of param... |

62 | Learning complex, extended sequences using the principle of history compression
- Schmidhuber
- 1992
(Show Context)
Citation Context ...e ect on the classi cation being made. Unfortunately this method limits the amount of past context the network can be trained to consider to N frames. More complex methods exist avoid this limitation =-=[45, 47]-=-. In [45] an RNN was used to classify the 61 phone TIMIT database. The network was trained using the cross-entropy criterion as described in Section 4.1. This allowed the 61 outputs to be interpreted ... |

53 | Feedforward Neural Network Methodology - Fine - 1999 |

41 | REMAP: Recursive estimation and maximization of a posteriori probabilities in connectionist speech recognition
- Bourlard, Konig, et al.
- 1995
(Show Context)
Citation Context ...on over the observations P (Ou(t)|i). Section 4 noted that an advantage of ANNs over HMMs is their ability to model an arbitrary distribution, non-linear in the inputs. A large body of work including =-=[7, 5, 6, 31]-=- is devoted to 14Figure 4: Using an ANN to generate HMM observation likelihoods. this idea. Essentially the techniques of Section 4.1 are applied to estimate observation likelihoods p(Ou(t)|i), and t... |

41 |
Highperformance connected digit recognition using maximum mutual information estimation
- Normandin, Cardin, et al.
- 1994
(Show Context)
Citation Context ... into an HMM update gradient, or HMM training can be run as normal and then gradient descent on the discriminative objective function can be performed as corrective training [36, 19]. Alternatively 5=-=[43, 37, 16]-=- discuss methods which extend the Baum-Welch updates to rational objective functions, which are applicable to the objective functions outlined here. All of these methods require the derivative of the ... |

34 | Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions
- Bourlard, Morgan
- 1998
(Show Context)
Citation Context ...the face of excellent empirical results from purely HMM approaches. At the current time there seems to be little interest in pure ANN approaches however there is interest in hybrid ANN/HMM approaches =-=[50, 17, 7]-=-. In this survey we brie y present basic approaches to discriminative training techniques for HMMs and ANNs. We also compare these approaches, nding strong similarities between them. Then we look at m... |

33 |
Alpha-nets: a recurrent “neural” network architecture with a hidden markov model interpretation
- Bridle
- 1990
(Show Context)
Citation Context ...ch looks like the computation performed in a linear node of an ANN except that the weights are dependent on the current observation. 4.5 Alphanets The concepts of the previous section are extended by =-=[8, 10]-=-, resulting in Alphanets, and the work of [34]. Both reach the conclusion that HMMs can be cast exactly as an RNN if we allow multiplication and division units as well as the standard summation units.... |

30 |
Neural network classi ers estimate Bayesian a posteriori probabilities
- Richard, Lippmann
- 1991
(Show Context)
Citation Context ...wing simple cost function for the observation Ou(t) J = −logP (m ∗ t |Ou(t), θ) (15) ∂J ∂P (m∗ 1 = − t |Ou(t), θ) P (m∗ t |Ou(t), θ) . This is the equation for the Normalized-Likelihood cost function =-=[44]-=-. It simply measures the log probability of utterance Ou assuming we know (or can estimate) the correct model m ∗ t. Minimising this quantity will maximise the posterior probability of the correct mod... |

25 |
markov models: A guided tour
- Hidden
- 1988
(Show Context)
Citation Context ...articularly those using hybrid approaches may o er signi cant advantages. Familiarity is assumed with the basics of both ANNs and HMMs. Many introductory texts can be founds on these topics including =-=[11, 39, 28]-=- for HMMs and [32] for ANNs. 2 The Speech Problem Speech processing can be thought of as the problem of choosing m ∗ = arg max P (m|Ou), (1) m where Ou = {Ou(1), . . . , Ou(Tu)} is a time sequence of ... |

23 |
A Generalization of the Baum Algorithm to Rational Objective Functions
- Gopalakrishnan, Kanevsky, et al.
- 1989
(Show Context)
Citation Context ... into an HMM update gradient, or HMM training can be run as normal and then gradient descent on the discriminative objective function can be performed as corrective training [36, 19]. Alternatively 5=-=[43, 37, 16]-=- discuss methods which extend the Baum-Welch updates to rational objective functions, which are applicable to the objective functions outlined here. All of these methods require the derivative of the ... |

22 |
Estimation of hidden Markov model parameters by minimizing empirical error rate
- Ljolje, Ephraim, et al.
- 1990
(Show Context)
Citation Context ...and described in a general way in [21] which also compares this criterion to standard error measures such as the mean squared error. A similar measure called Minimum Empirical Error was introduced in =-=[2]-=-. The basic idea is to construct a distance measure between the probability of the correct choice and the probability of all other choices d∗(Ou) = P (Ou|m ∗ ) − ⎡ ⎣ 1 M − 1 ∑ m i̸=m ∗ ⎤ P (Ou|mi) η⎦ ... |

22 |
Combining hidden Markov models and neural network classi ers
- Niles, Silverman
- 1990
(Show Context)
Citation Context ...ut class [30], which is inherently discriminative. Add to this the fact several authors have shown that it is possible to specify an ANN architecture exactly equivalent to discriminative HMM training =-=[8, 34, 56]-=-, and discriminative techniques begin to look, in theory, synonymous to connectionist approaches. ANNs were studied intensively for speech processing in the late 1980s and early 1990s before losing po... |

20 |
et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium
- Garofolo
- 1993
(Show Context)
Citation Context ... determined using the Viterbi algorithm. 3.3 Results Comparison Where possible we have provided comparative experimental results for the methods described in this survey, mostly on the TIMIT database =-=[15]-=-. However, due to factors such as varying de nitions of accuracy and the varying levels of problem di culty, the results should not be compared across di erent sections. A German speech database was u... |

14 | On supervised learning from sequential data with applications for speech recognition - Schuster - 1999 |

13 | An Alphanet Approach to Optimising Input Transformations for Continuous Speech Recognition - Bridle, Doddi - 1991 |

13 |
Markov Models, Maximum Mutual Information, and the Speech Recognition Problem
- Hidden
- 1991
(Show Context)
Citation Context ...d here can be incorporated into an HMM update gradient, or HMM training can be run as normal and then gradient descent on the discriminative objective function can be performed as corrective training =-=[36, 19]-=-. Alternatively 5[43, 37, 16] discuss methods which extend the Baum-Welch updates to rational objective functions, which are applicable to the objective functions outlined here. All of these methods ... |

13 | Discriminative Training for Continuous Speech Recognition
- Reichl, Ruske
- 1995
(Show Context)
Citation Context ... HMM training In this section we brie y describe two popular methods for performing discriminative training which can be applied to both ANNs and HMMs. We roughly follow the notation and structure of =-=[43]-=- which presents both methods in a consistent framework. 3.1 Maximum Mutual Information The basic idea of MMI estimation is to maximise the extent to which knowing the data helps us to know which model... |

13 | Improved neural Network Training of Inter-Word Context Units for Connected Digit recognition
- and, Vuuren
- 1998
(Show Context)
Citation Context ...ch output represents a model. Interpreting network outputs as probabilities is explained in Section 4.1. This approach is used in Alphanets [10, 9, 34] and in several RNNs (Recurrent Neural Networks) =-=[45, 55, 12]-=-. Examples of approximating P (Ou|m) with ANNs are rare since we might expect a single network to share information more e ciently, requiring fewer parameters and consequently require less training da... |

9 |
Shared Distribution Hidden Markov Models for Speech Recognition
- Hwang, Huang
- 1993
(Show Context)
Citation Context ...ty is seen when HMMs move from modelling context independent phones (61 for the TIMIT corpus) to triphones where around 5000 models are used even after the unlikely or unhelpful triphones are removed =-=[24, 20]-=-. 3 Discriminative methods for ANN and HMM training In this section we brie y describe two popular methods for performing discriminative training which can be applied to both ANNs and HMMs. We roughly... |

7 |
Connectionist Speaker Normalization and Its Applications To Speech Recognition
- Huang, Lee, et al.
- 1991
(Show Context)
Citation Context ...ate symbols for discrete HMMs [22, 3]. • Pre-processing ANNs can perform arbitrary non-linear transformations of the input. This can perform tasks such as removing noise, or adapting to a new speaker =-=[18, 42, 57]-=-. • Hierarchical Mixtures of Experts Various expert classi ers including those discussed already can be combined through the use of a hierarchy of gating networks [49, 41] trained with the EM algorith... |

7 |
Psycho-acoustics and speech perception
- Pols
(Show Context)
Citation Context ...incorporated into HMM features to provide context, however the derivatives contain less information than the complete frames. • ANNs can consider categorical inputs, encoding psycho-acoustic features =-=[38]-=- and features from many sources at once, such as visual cues [7]. • ANNs can model arbitrary state durations, unlike HMMs in which durations follow an exponential model. This is important for normalis... |

6 |
The gradient projection method for the training of hidden markov models
- Huo, Chan
- 1993
(Show Context)
Citation Context ...taken to maintain stochastic constraints. For example, the sum of transition probabilities out of a state must sum to one. This can be achieved by mapping parameters in R to probabilities [34, 8]. In =-=[19]-=- it is pointed out that this method may introduce extra local maxima, which is undesirable since gradient methods only guarantee convergence to one of these local maxima. Alternatively, Lagrange multi... |

6 | cmu sphinx-3 english broadcast news transcription system
- Seymore, Stanley, et al.
- 1998
(Show Context)
Citation Context ...p sizes. On the TIMIT database with 39 phones [12] demonstrates an RNN (recurrent neural network) system trained with MMI with a frame by frame accuracy of 75.1%. This is compared with the CMU Sphinx =-=[51]-=- HMM system which achieved 73.8%. On a Cantonese digit test set MCE improved results from 82.9% to 90.0%. This system used a small RNN for each digit. The same system applied to English digits resulte... |

5 |
The adaptive Time delay Neural Network Characterization and Application to Pattern Recognition, Prediction and signal processing
- Lin
- 1994
(Show Context)
Citation Context ... discriminating between 21 phonemes. Results improved from 90% to 93% over a HMM with 200 discrete symbols. A drawback of TDNNs is the xed amount of memory for each node. This is somewhat recti ed in =-=[26]-=- where TDNNs are extended to automatically adapt the value of Tl. TDNNs are further reviewed in [27, 17, 9]. 4.3 Recurrent Neural Networks RNNs avoid the main problem of TDNNs by allowing all previous... |

4 |
92 /MFlop/s, Ultra-Large-Scale Neural-Network training on a PIII cluster
- Aberdeen, Baxter, et al.
- 2000
(Show Context)
Citation Context ...ave several hundred inputs, including frames for context, 500 4000 hidden units, and 61 outputs, requiring in the order of 10 6 parameters [31]. Training such networks provides interesting challenges =-=[1]-=-. Once such a network has been trained some form of search is needed to compute the most likely phone sequence. 10Figure 1: A Time-Delay Neural Network with 7 inputs and 2 frames of memory, 5 hidden ... |

4 |
recognition using hidden markov models
- Speech
- 1990
(Show Context)
Citation Context ...articularly those using hybrid approaches may o er signi cant advantages. Familiarity is assumed with the basics of both ANNs and HMMs. Many introductory texts can be founds on these topics including =-=[11, 39, 28]-=- for HMMs and [32] for ANNs. 2 The Speech Problem Speech processing can be thought of as the problem of choosing m ∗ = arg max P (m|Ou), (1) m where Ou = {Ou(1), . . . , Ou(Tu)} is a time sequence of ... |

4 | Training mixture density HMMs with SOM and LVQ
- Kurimo
- 1997
(Show Context)
Citation Context ...ng units [12]. • Vector Quantization Learning Vector Quantizers based on SelfOrganizing Feature maps and other ANN approaches can be used to process observations to generate symbols for discrete HMMs =-=[22, 3]-=-. • Pre-processing ANNs can perform arbitrary non-linear transformations of the input. This can perform tasks such as removing noise, or adapting to a new speaker [18, 42, 57]. • Hierarchical Mixtures... |

4 | Compernolle, TDNN labeling for a HMM recognizer
- Ma, Van
- 1990
(Show Context)
Citation Context ...results were obtained for classifying plosive consonants using TDNNs compared to standard ANNs [53]. They have also been used to approximately determine phone labels to use as discrete HMM symbols in =-=[29]-=-. This system recognised Dutch digits, discriminating between 21 phonemes. Results improved from 90% to 93% over a HMM with 200 discrete symbols. A drawback of TDNNs is the xed amount of memory for ea... |

4 |
Maximum likelihood criterion in language modeling
- Ney, Martin
- 1999
(Show Context)
Citation Context ...o enumerate however some of the better descriptions are found in [43, 40, 54]. MMI techniques can be applied to the language modelling phase of 3speech systems as well as the low level signal models =-=[33]-=-. In information theory mutual information is de ned as I(X, Y ) = H(X) − H(X|Y ), where (5) H(X) = − ∑ P (x) log P (x) x∈X which is the entropy of the discrete random variable X. Another interpretati... |

3 |
Auditory models with Kohonen SOFM and LVQ for speaker independent phoneme recognition
- Anderson
- 1994
(Show Context)
Citation Context ...ng units [12]. • Vector Quantization Learning Vector Quantizers based on SelfOrganizing Feature maps and other ANN approaches can be used to process observations to generate symbols for discrete HMMs =-=[22, 3]-=-. • Pre-processing ANNs can perform arbitrary non-linear transformations of the input. This can perform tasks such as removing noise, or adapting to a new speaker [18, 42, 57]. • Hierarchical Mixtures... |

3 |
Spech Recognition and Understanding. Recent Advances, chapter Neural Networks or Hidden Markov Models for Automatic Speech Recognition: Is there a Choice
- Bridle
- 1992
(Show Context)
Citation Context .... Thus maximising mutual information can be re-cast as minimising cross entropy, which can be thought of as minimising the di erence between the distribution of the data, and the data given the model =-=[9]-=-. 3.1.1 Gradient Descent for MMI Suppose we have some parameterised approximator (or possibly an approximator for each model) which computes P (Ou|m, θ), where θ represent the parameters of the system... |

3 |
Spech Recognition and Understanding. Recent Advances, chapter Neural Networks for Continuous Speech Recognition
- Fallside
- 1992
(Show Context)
Citation Context ...ch output represents a model. Interpreting network outputs as probabilities is explained in Section 4.1. This approach is used in Alphanets [10, 9, 34] and in several RNNs (Recurrent Neural Networks) =-=[45, 55, 12]-=-. Examples of approximating P (Ou|m) with ANNs are rare since we might expect a single network to share information more e ciently, requiring fewer parameters and consequently require less training da... |

3 |
Connectionist and hybrid models for automatic speech recognition
- Haton
- 1997
(Show Context)
Citation Context ...the face of excellent empirical results from purely HMM approaches. At the current time there seems to be little interest in pure ANN approaches however there is interest in hybrid ANN/HMM approaches =-=[50, 17, 7]-=-. In this survey we brie y present basic approaches to discriminative training techniques for HMMs and ANNs. We also compare these approaches, nding strong similarities between them. Then we look at m... |

3 |
Big dumb neural nets: a working brute force approach to speech recognition
- Morgan
- 1994
(Show Context)
Citation Context ... as probabilities. More speci cally, how would we construct a network to compute the posterior probabilities P (mi|Ou)? A standard method for doing this is to use a softmax distribution at the output =-=[45, 31]-=-. Assume that the network is learning to estimate the MAP probability P (mi|Ou) then for each possible model m1, . . . mM we de ne exp(yi) P (mi|Ou) � ∑ . M j=1 exp(yj) Given an arbitrary cost functio... |

3 |
Neural networks : an introduction. Physics of neural networks
- Mller, Reinhardt, et al.
- 1995
(Show Context)
Citation Context ...brid approaches may o er signi cant advantages. Familiarity is assumed with the basics of both ANNs and HMMs. Many introductory texts can be founds on these topics including [11, 39, 28] for HMMs and =-=[32]-=- for ANNs. 2 The Speech Problem Speech processing can be thought of as the problem of choosing m ∗ = arg max P (m|Ou), (1) m where Ou = {Ou(1), . . . , Ou(Tu)} is a time sequence of speech frames asso... |

3 |
An adaptive gradient-search based algorithm for discriminative training of HMMs
- Nogueiras-Rodriíguez, no, et al.
- 1998
(Show Context)
Citation Context ...approach of training a single large network to approximate all the probabilities (see Section 4.1). In [25] a single network is trained for each P (Ou|m, θ) (see Section 4.3.1). 3.2.2 MCE for HMMs In =-=[43, 35]-=- a gradient descent version of MCE estimation is used. Denoting the state of HMM m occupied at time t as q m t , the gradient for the statespeci c observation densities is ∂l(d∗(Ou)) ∂P (Ou|j, m) = l(... |

3 |
The em algorithm
- Russel
- 1998
(Show Context)
Citation Context ...• Hierarchical Mixtures of Experts Various expert classi ers including those discussed already can be combined through the use of a hierarchy of gating networks [49, 41] trained with the EM algorithm =-=[46]-=-. • Predictive Networks ANNs can be used to predict extra features. For example, they can be trained as autoregressive models given previous observations and the current state [7]. • Language Modellin... |

3 | Bi-directional recurrent neural networks for speech recognition
- Schuster
- 1996
(Show Context)
Citation Context ...+c where c is the number of future frames to consider. An alternative is to extend RNNs to allow all frames, past and present, to be considered. This architecture is called the the Bi-Directional RNN =-=[48, 49, 50]-=-. BRNNs have two sets of state vectors, one for the forward time direction and one for the reverse time direction. At time t separate hidden layers compute the next forward and backward state vectors,... |

3 |
Encyclopedia of Electrical and Electronics Engineering, chapter Nerual networks for speech processing
- Schuster
- 1998
(Show Context)
Citation Context ... trained properly, ANNs can directly estimate the discriminative MAP P (m|Ou) criterion (see Section 4.1). • ANN systems can be 2 5 times faster than traditional techniques for equivalent performance =-=[49]-=-. • A single ANN can be trained to do the same job as multiple HMMs, decreasing the overall number of parameters to be trained, and improving the use of training data [49]. • ANNs can relax the Markov... |

3 |
A hybrid ANN-HMM asr system with nn based adaptive preprocessing
- Warakagoda
- 1996
(Show Context)
Citation Context ...by the class probabilities. P (Ou|m ∗ ) ∑ . M i=1 P (mi)P (Ou|mi) MMI estimation methods are discussed and applied in too many papers to enumerate however some of the better descriptions are found in =-=[43, 40, 54]-=-. MMI techniques can be applied to the language modelling phase of 3speech systems as well as the low level signal models [33]. In information theory mutual information is de ned as I(X, Y ) = H(X) −... |

3 |
Digit-specific feature extraction for multi-speaker isolated digit recognition using neural networks
- Zhang, Millar
- 1994
(Show Context)
Citation Context ...ate symbols for discrete HMMs [22, 3]. • Pre-processing ANNs can perform arbitrary non-linear transformations of the input. This can perform tasks such as removing noise, or adapting to a new speaker =-=[18, 42, 57]-=-. • Hierarchical Mixtures of Experts Various expert classi ers including those discussed already can be combined through the use of a hierarchy of gating networks [49, 41] trained with the EM algorith... |