## An Application of Recurrent Nets to Phone Probability Estimation (1994)

Venue: | IEEE Transactions on Neural Networks |

Citations: | 207 - 8 self |

### BibTeX

@ARTICLE{Robinson94anapplication,

author = {Tony Robinson},

title = {An Application of Recurrent Nets to Phone Probability Estimation},

journal = {IEEE Transactions on Neural Networks},

year = {1994},

volume = {5},

pages = {298--305}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper presents an application of recurrent networks for phone probability estimation in large vocabulary speech recognition. The need for efficient exploitation of context information is discussed

### Citations

4595 | A tutorial on hidden markov models and selected applications in speech recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...ng the probabilities of phone strings and then searching all possible phone strings for the most probable legal word string. The most popular tool for this task is the Hidden Markov Model (HMM) (e.g. =-=[6, 7]-=-). Individual models are created for each phone and these are concatenated to form word models. Each HMM phone model can be matched to any segment of speech and the likelihood of the model generating ... |

3016 |
Learning internal representations by error propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ..., 3]), however this paper is interested in the kind that map one sequence on to another. This form of recurrent net is potentially very powerful as it is capable of emulating any finite state machine =-=[4]-=-. Specifically, the aim of the network is to perform the mapping from a sequence of frames of parameterised speech to a sequence of phone labels associated with those frames. There are noticeable corr... |

1691 | Finding structure in time
- Elman
- 1990
(Show Context)
Citation Context ...keep past outputs and also to be realistic in computational requirements. It uses a mixture of unsupervised and supervised learning to form the state vector and is related to the simple recurrent net =-=[31]-=- and the principle of history compression [32]. This form of net was demonstrated for small problems, but has never been tested on larger problems. Of the three algorithms, back-propagation through ti... |

487 |
Connnectionist Speech Recognition: A Hybrid Approach
- Bourlard, Morgan
- 1994
(Show Context)
Citation Context ... authors that when used for classification these networks approximate the posterior probability of class occupancy [13, 14, 15, 16, 17]. For a full discussion of this result to speech recognition see =-=[18, 19]-=-. 2.3 Hybrid connectionist / Markov model systems The use of MLPs allows a large window of parameterised speech to be used directly for the estimation of phone class probabilities [20]. Indeed, it can... |

438 | A Learning Algorithm for Continually Running Fully Recurrent Neural Networks
- Williams, Zipser
- 1980
(Show Context)
Citation Context ...l time slots. ffl The infinite input duration net was proposed to overcome the constraint of finite length sequences, and was also formulated independently by other researchers at about the same time =-=[29, 30]. This met-=-hod is often called "Real-Time Recurrent Learning", but is too expensive in computation and storage for most problems. ffl Finally, the state compression net was constructed to make it unnec... |

421 | A maximum likelihood approach to continuous speech recognition. 1983
- Bahl, Jelinek
- 1994
(Show Context)
Citation Context ...ng the probabilities of phone strings and then searching all possible phone strings for the most probable legal word string. The most popular tool for this task is the Hidden Markov Model (HMM) (e.g. =-=[6, 7]-=-). Individual models are created for each phone and these are concatenated to form word models. Each HMM phone model can be matched to any segment of speech and the likelihood of the model generating ... |

360 |
Increased rates of convergence through learning rate adaptation
- Jacobs
- 1988
(Show Context)
Citation Context ...ame unstable if either N or OE were set too high and the best performance was obtained with N set to the smallest value which resulted in convergence. This is similar to the method proposed by Jacobs =-=[36]-=- except that a stochastic gradient signal is used and both the increase and decrease in the scaling factor is geometric (as opposed to an arithmetic increase and geometric decrease). Considerable effo... |

354 |
Interpolated Estimation of Markov Source Parameters from Sparse Data
- Jelinek, Mercer
- 1980
(Show Context)
Citation Context ... parameters for each model may be robustly estimated. Clustering and smoothing techniques can enable a reasonable compromise to be made at the expense of model accuracy and storage requirements (e.g. =-=[8, 9]-=-). However, the problem remains of the number of models increasing exponentially with increasing number of contextual variables which limits the applicability of this technique. Acoustic context is ha... |

351 | Connectionist Learning Procedures
- Hinton
- 1989
(Show Context)
Citation Context ...-1 to +1 and the target values were \Sigma0:8. This was later replaced by the cross-entropy objective function which considers each output to be the estimator of the probability of independent events =-=[40]. The late-=-st development is to replace the set of sigmoidal output non-linearities with the normalised exponential or "softmax" output function [14]. This is a suitable activation function for a one-f... |

346 |
Phoneme Recognition Using Time-Delayed Neural Networks", A.T.R
- Waibel, Hanazawa, et al.
- 1987
(Show Context)
Citation Context ...ic information considered. Weight sharing allows encoding of prior knowledge and gives better scaling properties at the expense of imposing restrictions on the diversity of the computations performed =-=[26]-=-. Along with the non-connectionist probability estimation methods, these techniques are restricted to a finite length window on the acoustic data. 2.4 Recurrent nets for phone probability estimation T... |

297 |
Back propagation through time: what is does and how to do it
- Werbos
- 1990
(Show Context)
Citation Context ...ed to learning sequence mappings of finite duration. The structure is a minor variation on the original recurrent net training algorithm [4] and is now commonly called "Back-Propagation Through T=-=ime" [28]-=-. The training procedure is to expand the network in time, i.e. to consider the recurrent net for all time slots as a single very large network with input and output at each time slot and shared weigh... |

295 | Hidden Markov Models for speech recognition
- D, Ariki, et al.
- 1990
(Show Context)
Citation Context ...ic vectors are independent (block diagonal covariance matrix), or that all acoustic parameters are independent (diagonal covariance matrix), but this clearly limits the modelling power available (e.g =-=[10]-=-). Careful choice of the method used to increase the information content of the acoustic vector is clearly important. Empirically it has been shown that first (and second) order differences taken over... |

282 |
Neural network classifiers estimate Bayesian a posteriori probabilities
- Richard, Lippmann
- 1991
(Show Context)
Citation Context ...layer perceptrons (MLPs) are a suitable candidate as it has been shown by a number of authors that when used for classification these networks approximate the posterior probability of class occupancy =-=[13, 14, 15, 16, 17]-=-. For a full discussion of this result to speech recognition see [18, 19]. 2.3 Hybrid connectionist / Markov model systems The use of MLPs allows a large window of parameterised speech to be used dire... |

272 |
Speaker-independent Phone Recognition Using Hidden Markov Models
- Lee, Hon
- 1989
(Show Context)
Citation Context ...s in parentheses are the evaluation over the smaller "core test set" which only includes sentence prompts not used in the training set. The first HMM results on this task were provided by Le=-=e and Hon [43]. This sys-=-tem used multiple codebooks and right-context HMMs with 39 symbols. Recognition accuracy for this system is shown as entry "SPHINX". Mapping the recurrent net output to an equivalent symbol ... |

242 |
Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition
- Bridle
- 1990
(Show Context)
Citation Context ...layer perceptrons (MLPs) are a suitable candidate as it has been shown by a number of authors that when used for classification these networks approximate the posterior probability of class occupancy =-=[13, 14, 15, 16, 17]-=-. For a full discussion of this result to speech recognition see [18, 19]. 2.3 Hybrid connectionist / Markov model systems The use of MLPs allows a large window of parameterised speech to be used dire... |

222 |
Automatic Speech Recognition: The Development of the SPHINX
- Lee
- 1989
(Show Context)
Citation Context ... parameters for each model may be robustly estimated. Clustering and smoothing techniques can enable a reasonable compromise to be made at the expense of model accuracy and storage requirements (e.g. =-=[8, 9]-=-). However, the problem remains of the number of models increasing exponentially with increasing number of contextual variables which limits the applicability of this technique. Acoustic context is ha... |

209 |
A course in phonetics
- Ladefoged
- 1975
(Show Context)
Citation Context ...nits is the phone, although diphone, triphone and syllable based approaches are being pursued. The phoneme is a semantic category, it being the smallest unit that is used to distinguish meaning (e.g. =-=[5]-=-). A phone is the acoustic category corresponding to the phoneme. The specific phone that is used in any instance is dependent on contextual variables such as speaking rate. By specifying the pronunci... |

192 |
independent isolated word recognition using dynamic features of speech spectrum
- Furui, “Speaker
- 1986
(Show Context)
Citation Context ...nd) order differences taken over a window length of a few frames are a reasonable choice for the parameterisation of acoustic context and yield substantial improvements in speech recognition accuracy =-=[11]-=-. As a result this parameterisation has been widely adopted by the speech recognition community. Difference coefficients are a simple linear function of the acoustic vectors lying within a rectangular... |

147 |
Speech database development: design and analysis of the acoustic-phonetic corpus
- Lamel, Kasel, et al.
- 1986
(Show Context)
Citation Context ...at are necessary to this standard implementation to obtain a state-of-the-art recogniser. 4.1 A large task: TIMIT The TIMIT database is the largest phonetically labelled database publically available =-=[35]-=-. It consists of 420 speakers in the training set and 210 speakers in the test set, and each speaker utters ten sentences of which eight are usable for speaker independent phone recognition. Large spe... |

132 |
The DARPA 1000-Word Resource Management Database
- Price, Fisher, et al.
- 1988
(Show Context)
Citation Context ...%(30.3%) Table 1: Comparison with other TIMIT phone recognisers A standard database for large vocabulary speech recognition in the last few years has been the DARPA 1000 word Resource Management task =-=[48]-=-. The speaker independent part of this database has 109 speakers in the augmented training set and 30 different speakers in each of four test sets. Each speaker utters 20 or 30 sentences, giving a tot... |

126 | D.: Gradient-based learning algorithms for recurrent networks and their computational complexity
- Williams, Zipser
- 1995
(Show Context)
Citation Context ...or performing phone probability estimation. 1 Introduction The aim of this paper is to describe the application of a recurrent net to phone recognition. There are several forms of recurrent net (e.g. =-=[1, 2, 3]-=-), however this paper is interested in the kind that map one sequence on to another. This form of recurrent net is potentially very powerful as it is capable of emulating any finite state machine [4].... |

99 |
Links between Markov models and multilayer perceptrons
- Bourland, Wellekens
- 1990
(Show Context)
Citation Context ...layer perceptrons (MLPs) are a suitable candidate as it has been shown by a number of authors that when used for classification these networks approximate the posterior probability of class occupancy =-=[13, 14, 15, 16, 17]-=-. For a full discussion of this result to speech recognition see [18, 19]. 2.3 Hybrid connectionist / Markov model systems The use of MLPs allows a large window of parameterised speech to be used dire... |

84 |
Linear discriminant analysis for improved large vocabulary continuous speech recognition
- Haeb-Umbach, Ney
- 1992
(Show Context)
Citation Context ...g within a rectangular window. Automatic optimisation of the linear function may be achieved using linear discriminant analysis and this has also been shown to yield increased recognition performance =-=[12]-=-. However, long term contextual information such as the speaker dependence of the acoustic realisation of phonemes will not be adequately modelled by a linear transformation to a small subspace. Metho... |

68 | Connectionist probability estimators in HMM speech recognition
- Renals, Morgan, et al.
- 1994
(Show Context)
Citation Context ... authors that when used for classification these networks approximate the posterior probability of class occupancy [13, 14, 15, 16, 17]. For a full discussion of this result to speech recognition see =-=[18, 19]-=-. 2.3 Hybrid connectionist / Markov model systems The use of MLPs allows a large window of parameterised speech to be used directly for the estimation of phone class probabilities [20]. Indeed, it can... |

63 | Learning complex, extended sequences using the principle of history compression
- Schmidhuber
- 1992
(Show Context)
Citation Context ...computational requirements. It uses a mixture of unsupervised and supervised learning to form the state vector and is related to the simple recurrent net [31] and the principle of history compression =-=[32]-=-. This form of net was demonstrated for small problems, but has never been tested on larger problems. Of the three algorithms, back-propagation through time was chosen as being the most efficient in s... |

61 | Optimization of Backpropagation Algorithm for Training Multilayer Perceptrons
- Schiffmann, Joost, et al.
- 1993
(Show Context)
Citation Context ... this training procedure and the result was found to give better performance than the other methods that can be found in the literature. A survey of "speed-up" techniques reached a similar c=-=onclusion [37]-=-. However, the parameters quoted above are task dependent and a more robust learning scheme with fewer free parameters would be desirable. 4.4 The selection of acoustic features The acoustic features ... |

59 |
A Probabilistic Approach to the Understanding and Training of Neural Network Classifiers
- Gish
- 1990
(Show Context)
Citation Context |

58 |
Phonological structures for speech recognition
- Cohen
- 1989
(Show Context)
Citation Context ...nto words. There are often many valid phonetic variations on the pronunciation of any word, and this paper uses a pronunciation set developed using the single most probable phone string for each case =-=[49]-=- 1 . A set of Markov models was created from these pronunciations and a word-pair grammar that is supplied with the database. The grammar has a perplexity (average branching factor) of 60. Unlike the ... |

56 |
Continuous Speech Recognition using Multi-Layer Perceptrons with Hidden Markov Models
- Morgan, Bourlard
- 1990
(Show Context)
Citation Context ...gnition see [18, 19]. 2.3 Hybrid connectionist / Markov model systems The use of MLPs allows a large window of parameterised speech to be used directly for the estimation of phone class probabilities =-=[20]-=-. Indeed, it can be seen that any linear transformation may be built into the first layer of a MLP by modifying the weights before the non-linearity. The use of multiple layers allows the independence... |

49 |
Supervised learning of probability distributions by neural networks
- Baum, Wilczek
- 1988
(Show Context)
Citation Context |

43 |
Mmi training for continuous phoneme recognition on the timit database
- Kapadia, Valtchev, et al.
- 1993
(Show Context)
Citation Context ...xts and is tabulated as system "htk". Kapadia et al. show that Maximum Mutual Information (MMI) training of HMMs can provide significantly better results than the standard Maximum Likelihood=-= training [45]. System "mmi" -=-uses monophone models only. Digalakis et al. provide a Stochastic Segment Model (SSM) for this task [46]. Results are presented for 61 and 39 symbols under the entries "ssm61" and "ssm3... |

37 |
A Recurrent Error Propagation Network Speech Recognition System
- Robinson, Fallside
- 1991
(Show Context)
Citation Context ...phone. More complex modelling of phone durations is possible but was found to be ineffectual for phone recognition, although significant when a word grammar is imposed on the possible phone sequences =-=[42]-=-. The standard Viterbi algorithm is used to find the maximum likelihood state sequence (e.g. [7]). 5 Phone Recognition Results The phone sequence produced by the recogniser is scored with a standard d... |

35 |
Static and dynamic error propagation networks with application to speech coding
- Robinson, Fallside
- 1987
(Show Context)
Citation Context ...esented at both the phone and word levels, along with a discussion of the work that still needs to be done. 3 Basic theory The form of the recurrent net used here was first described by the author in =-=[27]-=-. This paper took the basic equations for a linear dynamical system and replaced the linear matrix operators with non-linear feedforward networks. After merging computations, the resulting structure i... |

33 |
Alpha-nets: a recurrent “neural” network architecture with a hidden markov model interpretation
- Bridle
- 1990
(Show Context)
Citation Context ...st model with as many layers as there are frames of speech allocated to the model. Performing gradient ascent in the log likelihood of the model gives standard Maximum Likelihood trained models (e.g. =-=[22]-=-). At the other extreme the phone class probability estimators are trained independently of the HMM transition probabilities. This is similar to Viterbi training of HMMs in that only the most probable... |

33 | The htk tied-state continuous speech recognizer
- Woodland, Young
- 1993
(Show Context)
Citation Context ...on dictionary. A comprehensive summary of the latest results on this task can be found in [51]. The results presented here are significantly better than the best monophone HMM system reported to date =-=[52]-=-, although not as good as the best triphone based HMM systems. Triphone modelling allows the parameters of a phone model to depend on the two adjacent 1 This set of pronunciations may be found on the ... |

28 | Dynamic recurrent neural networks
- Pearlmutter
- 1990
(Show Context)
Citation Context ...or performing phone probability estimation. 1 Introduction The aim of this paper is to describe the application of a recurrent net to phone recognition. There are several forms of recurrent net (e.g. =-=[1, 2, 3]-=-), however this paper is interested in the kind that map one sequence on to another. This form of recurrent net is potentially very powerful as it is capable of emulating any finite state machine [4].... |

22 |
Combining hidden Markov models and neural network classi ers
- Niles, Silverman
- 1990
(Show Context)
Citation Context ...the advantage that discriminative training can be used (e.g. [20]). There are several intermediate positions in which gradient descent techniques can be used for discriminative training of HMMs (e.g. =-=[23, 24, 25]-=-) and posterior state occupancy probabilities can be used as targets for connectionist training. There are also a variety of architectures worth considering for use as connectionist probability estima... |

20 |
Fast Algorithms for phone classification and recognition using Segment-based Models
- Digalakis, Ostendorf, et al.
- 1992
(Show Context)
Citation Context ...ovide significantly better results than the standard Maximum Likelihood training [45]. System "mmi" uses monophone models only. Digalakis et al. provide a Stochastic Segment Model (SSM) for =-=this task [46]. Results are presen-=-ted for 61 and 39 symbols under the entries "ssm61" and "ssm39" respectively. Ljolje provides a single mixture Gaussian triphone based HMM with durational constraints and trigram p... |

18 |
Fallside F. A Dynamic Connectionist Model of Phoneme Recognition
- Robinson
- 1988
(Show Context)
Citation Context ...irect minimisation of the objective function. 4 Application considerations The first application of recurrent networks to the recognition of phones in continuous speech was presented by the author in =-=[34]-=-. This section aims to detail the changes that are necessary to this standard implementation to obtain a state-of-the-art recogniser. 4.1 A large task: TIMIT The TIMIT database is the largest phonetic... |

15 |
Several improvements to a recurrent error propagation network phone recognition system
- Robinson
- 1991
(Show Context)
Citation Context ...t using this compression function under limited storage conditions, but it is not clear whether this is merely due to reducing quantisation noise, or whether the processed input is easier to classify =-=[39]-=-. 4.5 The use of a minimum entropy objective function Originally the least mean squares objective function was used. The range of outputs was -1 to +1 and the target values were \Sigma0:8. This was la... |

14 | Connectionist Probability Estimation in the Decipher Speech Recognition System
- Renals, Morgan, et al.
- 1992
(Show Context)
Citation Context ...be exploited. Experimenters with connectionist word recognition report that connectionist probability estimators yield better results than the equivalent HMM based on mixtures of Gaussian likelihoods =-=[21]-=-. There are two extremes in approaches to building hybrid connectionist/HMM systems. At one end, a standard HMM can be considered as a connectionist model with as many layers as there are frames of sp... |

13 |
An Alphanet Approach to Optimising Input Transformations for Continuous Speech Recognition
- Bridle, Doddi
- 1991
(Show Context)
Citation Context ...the advantage that discriminative training can be used (e.g. [20]). There are several intermediate positions in which gradient descent techniques can be used for discriminative training of HMMs (e.g. =-=[23, 24, 25]-=-) and posterior state occupancy probabilities can be used as targets for connectionist training. There are also a variety of architectures worth considering for use as connectionist probability estima... |

12 |
A look at phonetic discrimination using connectionist models with recurrent links. SCIMP working paper 82018, Institute for Defense Analysis
- Kuhn
- 1987
(Show Context)
Citation Context ...l time slots. ffl The infinite input duration net was proposed to overcome the constraint of finite length sequences, and was also formulated independently by other researchers at about the same time =-=[29, 30]. This met-=-hod is often called "Real-Time Recurrent Learning", but is too expensive in computation and storage for most problems. ffl Finally, the state compression net was constructed to make it unnec... |

8 |
Competitive training in hidden Markov models
- Young
- 1990
(Show Context)
Citation Context ...the advantage that discriminative training can be used (e.g. [20]). There are several intermediate positions in which gradient descent techniques can be used for discriminative training of HMMs (e.g. =-=[23, 24, 25]-=-) and posterior state occupancy probabilities can be used as targets for connectionist training. There are also a variety of architectures worth considering for use as connectionist probability estima... |

8 |
A Comparison of Preprocessors for the Cambridge Recurrent Error Propagation Network Speech Recognition System
- Robinson, Fallside
(Show Context)
Citation Context ...nction); and a normalised power spectrum from an FFT grouped into 20 mel scale bins. Many other acoustic features have been evaluated on this system including FFT, filterbank and LPC based techniques =-=[38]-=-. The conclusion from these studies is that recurrent nets are reasonably robust to the choice of input representation. The inclusion of the fundamental frequency and degree of voicing do not make a l... |

7 |
Soft weight-sharing
- Nowlan, Hinton
- 1992
(Show Context)
Citation Context ...r understanding of the network states and weights could yield more compact networks and faster training. The use of prior knowledge of good weight values has been shown to yield better generalisation =-=[53]-=-. Many of the ideas developed for HMM based systems are also applicable to this scheme. For example, the use of context dependent models has been shown to increase performance [54, 55]. In conclusion,... |

2 |
Practical network design and implementation
- Robinson
- 1992
(Show Context)
Citation Context ...or performing phone probability estimation. 1 Introduction The aim of this paper is to describe the application of a recurrent net to phone recognition. There are several forms of recurrent net (e.g. =-=[1, 2, 3]-=-), however this paper is interested in the kind that map one sequence on to another. This form of recurrent net is potentially very powerful as it is capable of emulating any finite state machine [4].... |

2 |
The state space and "ideal input" representations of recurrent networks
- Robinson
- 1992
(Show Context)
Citation Context ...there is no obvious "meaning" that can be assigned to their values. From information storage principles all units would be uncorrelated, although in practice a large degree of correlation is=-= observed [33]-=-. ffl This method takes no more computation per pass than training feedforward networks. The error vector for every output in the sequence is traced to the start of the sequence during the single back... |

1 |
The use of state tying in continous speech recognition
- Young, Woodland
- 1993
(Show Context)
Citation Context ...INX". Mapping the recurrent net output to an equivalent symbol set gives entry "rn39a". A state-of-the-art standard HMM system is provided by the publically available HTK system of Youn=-=g and Woodland [44]. This imp-=-lementation uses state tying to allow adequate training data to be assigned to rare contexts and is tabulated as system "htk". Kapadia et al. show that Maximum Mutual Information (MMI) train... |

1 |
New developments in phone recognition using an ergodic hidden Markov model," Technical memorandum TM-11222-910829-12, A T & T Bell Laboratories
- Ljolje
- 1991
(Show Context)
Citation Context ... for 61 and 39 symbols under the entries "ssm61" and "ssm39" respectively. Ljolje provides a single mixture Gaussian triphone based HMM with durational constraints and trigram phon=-=otactic constraints [47]-=-. Although again 39 symbols are used, this subset is harder to recognise than the first 39 phone set due to the treatment of stops. The recognition rates for the HMM and recurrent net are given under ... |