## Probability Estimation By Feed-Forward Networks In Continuous Speech Recognition (1991)

Venue: | In Proceedings IEEE Workshop on Neural Networks for Signal Processing |

Citations: | 7 - 3 self |

### BibTeX

@INPROCEEDINGS{Renals91probabilityestimation,

author = {Steve Renals and Nelson Morgan},

title = {Probability Estimation By Feed-Forward Networks In Continuous Speech Recognition},

booktitle = {In Proceedings IEEE Workshop on Neural Networks for Signal Processing},

year = {1991},

pages = {309--318}

}

### OpenURL

### Abstract

We review the use of feed-forward networks as estimators of probability densities in hidden Markov modelling. In this paper we are mostly concerned with radial basis functions (RBF) networks. We note the isomorphism of RBF networks to tied mixture density estimators; additionally we note that RBF networks are trained to estimate posteriors rather than the likelihoods estimated by tied mixture density estimators. We show how the neural network training should be modified to resolve this mismatch. We also discuss problems with discriminative training, particularly the problem of dealing with unlabelled training data and the mismatch between model and data priors. L&H Speechproducts, Ieper, B-8900, Belgium ii INTRODUCTION In continuous speech recognition we wish to estimate P(W W 1 jX T 1 , M), the posterior probability of a word sequence W W 1 = w 1 , ..., wW given the acoustic evidence X T 1 = x 1 , ..., x T and the parameters of the models used Q. This probability canno...

### Citations

436 |
Multivariable functional interpolation and adaptive networks
- Broomhead, Lowe
- 1988
(Show Context)
Citation Context ...t density N j (xj�� j , S j ) contributes to output PDF f k (xjq k , Q). RADIAL BASIS FUNCTIONS The radial basis functions (RBF) network was originally introduced as a means of function interpolat=-=ion [16, 10]-=-. A set of K approximating functions, f k (x) is constructed from a set of J basis functions f(x): f k (x) = J X j=1 a kj f j (x) 1sksK (3) This equation defines a RBF network with J RBFs (hidden unit... |

391 |
A maximum likelihood approach to continuous speech recognition
- Bahl, Jelinek, et al.
- 1983
(Show Context)
Citation Context ... also be specified. The transition probabilities and the parameters of the output PDFs are frequently estimated using a maximum likelihood training procedure, the forward-backward algorithm (see e.g. =-=[2]-=-). This procedure is optimal if the true model is in the space of models being searched 1 . However, this is not the case for speech recognition. What is desired is not the best possible model of each... |

265 |
Radial basis functions for multivariable interpolation: a review
- POWELL
- 1987
(Show Context)
Citation Context ...t density N j (xj�� j , S j ) contributes to output PDF f k (xjq k , Q). RADIAL BASIS FUNCTIONS The radial basis functions (RBF) network was originally introduced as a means of function interpolat=-=ion [16, 10]-=-. A set of K approximating functions, f k (x) is constructed from a set of J basis functions f(x): f k (x) = J X j=1 a kj f j (x) 1sksK (3) This equation defines a RBF network with J RBFs (hidden unit... |

165 |
Maximum mutual information estimation of hidden Markov model parameters for speech recognition
- Bahl, Brown, et al.
- 1986
(Show Context)
Citation Context ...ble to attempt a global optimisation in which all the parameters of the HMM are optimised simultaneously according to some discriminative criterion. Such an approach was first proposed by Bahl et al. =-=[1]-=- who presented a training scheme for continuous HMMs in which the mutual information between the acoustic evidence and the word sequence was maximised using gradient descent. More recently, Bridle int... |

99 |
Links between Markov models and multilayer perceptrons
- Bourland, Wellekens
- 1990
(Show Context)
Citation Context ...be shown that a "1-from--n" classifier trained using a relative entropy (or a least mean squares) objective function outputs the posterior probabilities,sP(q l jx), of each class given the i=-=nput data [6]-=-. However, the likelihoods P(xjq l ) are required; the prior probabilities, p(q l ) are given by the allowable sentence models constructed from the basic HMMs using a phone-structured lexicon and the ... |

98 |
Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimates of parameters
- Bridle
- 1989
(Show Context)
Citation Context ...ed output class (HMM distribution). Bridle has demonstrated that minimising this error function is equivalent to maximising the mutual information between the acoustic evidence and HMM state sequence =-=[9]-=-. If we wish to interpret the weights as mixture coefficients, then we must ensure that they are non-negative and sum to 1. This may be achieved using a normalised exponential (softmax) transformation... |

71 |
The Acoustic-Modelling Problem in Automatic Speech Recognition," unpublished Ph.D, thesis
- Brown
- 1987
(Show Context)
Citation Context ...his computation may be efficiently performed using a dynamic programming algorithm. When used at recognition time this is referred to as Viterbi decoding. 1 And if some other conditions are satisfied =-=[11]. We have -=-used discriminatively trained classifiers to estimate the output PDFs [5, 14, 17]. It may be shown that a "1-from--n" classifier trained using a relative entropy (or a least mean squares) ob... |

68 |
Semi-continuous hidden markov models for speech recognition
- Huang, Jack
- 1989
(Show Context)
Citation Context ... We survey this problem and discuss some possible solutions. TIED MIXTURE HMM Tied mixture density (or semi-continuous) HMMs have proven to be powerful PDF estimators in continuous speech recognition =-=[13, 3]-=-. This method may be regarded as intermediate between discrete vector-quantised methods and separate continuous PDF estimates for each state. If a unified formalism for both discrete and continuous HM... |

31 |
ALPHA-NETS: A recurrent `neural' network architecture with a hidden Markov model interpretation
- Bridle
- 1990
(Show Context)
Citation Context ...ontinuous HMMs in which the mutual information between the acoustic evidence and the word sequence was maximised using gradient descent. More recently, Bridle introduced the "alphanet" repre=-=sentation [8] of HMMs, -=-in which the computation of the HMM "forward" probabilities a jt = P(X t 1 , q(t) = j) is performed by the forward dynamics of a recurrent network. Alphanets may be discriminatively trained ... |

30 |
Connectionist Viterbi training: a new hybrid method for continuous speech recognition
- Franzini, Lee, et al.
- 1990
(Show Context)
Citation Context ...labelled data are updated by performing a Viterbi segmentation after each epoch of discriminative training. Such an approach has been referred to as embedded MLP [5] or connectionist Viterbi training =-=[12]-=-. It should be noted that the transition probabilities are still optimised by a maximum likelihood criterion (or the Viterbi approximation to it). It may be proved that performing a Viterbi segmentati... |

18 |
A continuous speech recognition system embedding MLP into HMM
- Bourlard, Morgan
- 1990
(Show Context)
Citation Context ...When used at recognition time this is referred to as Viterbi decoding. 1 And if some other conditions are satisfied [11]. We have used discriminatively trained classifiers to estimate the output PDFs =-=[5, 14, 17]. It may b-=-e shown that a "1-from--n" classifier trained using a relative entropy (or a least mean squares) objective function outputs the posterior probabilities,sP(q l jx), of each class given the in... |

9 |
Continuous speech recognition using PLP analysis with multilayer perceptrons
- Morgan, Hermansky, et al.
(Show Context)
Citation Context ...When used at recognition time this is referred to as Viterbi decoding. 1 And if some other conditions are satisfied [11]. We have used discriminatively trained classifiers to estimate the output PDFs =-=[5, 14, 17]. It may b-=-e shown that a "1-from--n" classifier trained using a relative entropy (or a least mean squares) objective function outputs the posterior probabilities,sP(q l jx), of each class given the in... |

5 |
On the interaction between true source, training, and testing language models
- Paul, Baker, et al.
- 1991
(Show Context)
Citation Context ...results in a huge number of parameters that would require an unrealistic amount of training data to estimate them significantly. This problem has also been raised in the context of language modelling =-=[15]. Since th-=-e ideal theoretical solution is not accessible in practice, it is usually better to dispose of the poor estimate of the priors obtained using the training data, replacing them with "prior" p... |

2 |
Tied mixture continuous parameter modeling for continuous speech recognition
- Bellegarda, Nahamoo
- 1990
(Show Context)
Citation Context ... We survey this problem and discuss some possible solutions. TIED MIXTURE HMM Tied mixture density (or semi-continuous) HMMs have proven to be powerful PDF estimators in continuous speech recognition =-=[13, 3]-=-. This method may be regarded as intermediate between discrete vector-quantised methods and separate continuous PDF estimates for each state. If a unified formalism for both discrete and continuous HM... |

1 |
Giovammi Flammia, and Ralf Kompe. Global optimization of a neural network - hidden Markov model hybrid
- Bengio, Mori
- 1990
(Show Context)
Citation Context ...sumes uniform priors rather than those specified by the language model. Initial work in using global optimisation methods for continuous speech recognition has been performed by Bridle [7] and Bengio =-=[4]; both of -=-these involved training the parameters of the HMM by a maximum likelihood process, using the "alphanets" method to optimise the input parameters via some (linear or non-linear) transform. PR... |

1 |
An alphanet aproach to optimising input transformations for continuous speech recognition
- Bridle, Dodd
- 1991
(Show Context)
Citation Context ...irable as it assumes uniform priors rather than those specified by the language model. Initial work in using global optimisation methods for continuous speech recognition has been performed by Bridle =-=[7] and Bengi-=-o [4]; both of these involved training the parameters of the HMM by a maximum likelihood process, using the "alphanets" method to optimise the input parameters via some (linear or non-linear... |

1 |
A comparative study of continuous speech recognition using neural networks and hidden Markov models
- Renals, McKelvie, et al.
- 1991
(Show Context)
Citation Context ...When used at recognition time this is referred to as Viterbi decoding. 1 And if some other conditions are satisfied [11]. We have used discriminatively trained classifiers to estimate the output PDFs =-=[5, 14, 17]. It may b-=-e shown that a "1-from--n" classifier trained using a relative entropy (or a least mean squares) objective function outputs the posterior probabilities,sP(q l jx), of each class given the in... |