## Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions (1998)

### Cached

### Download Links

- [www.idiap.ch]
- [publications.idiap.ch]
- [ftp.idiap.ch]
- DBLP

### Other Repositories/Bibliography

Venue: | in Adaptive Processing of Sequences and Data Structures, ser. Lecture Notes in Artificial Intelligence (1387 |

Citations: | 34 - 3 self |

### BibTeX

@INPROCEEDINGS{Bourlard98hybridhmm/ann,

author = {Herve Bourlard and Nelson Morgan},

title = {Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions},

booktitle = {in Adaptive Processing of Sequences and Data Structures, ser. Lecture Notes in Artificial Intelligence (1387},

year = {1998},

pages = {389--417},

publisher = {Springer Verlag}

}

### OpenURL

### Abstract

### Citations

4567 | A tutorial on hidden Markov models and selected applications in speech recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...ed by pronunciations from a lexicon), and sentence models consist of concatenations of word models (constrained by a grammar). Theory and methodology for HMMs are described in many sources, including =-=[29]-=-. Brie y, the fundamental equation relevant for this process is a restatement ofBayes' rule as applied to speech recognition 3 : P (M jX� ) = p(XjM� )P (M j ) p(Xj ) in which is the parameter set and ... |

629 |
Perceptual linear predictive (PLP) analysis of speech
- Hermansky
- 1990
(Show Context)
Citation Context ...ween 500 and 4000 hidden units that receive input from several hundred acoustic variables (e.g., 9 frames of acoustic context consisting of 12th order Perceptual Linear Prediction coe cients (PLP-12) =-=[15]-=- and log energy, along with their derivatives, or 26 features per frame). 6 The output typically corresponds to simple context-independent acoustic classes such as phones de ned for the TIMIT phonetic... |

548 |
An inequality and associated maximization technique in statistical estimation for probabilistic functions of a markov process
- Baum
- 1972
(Show Context)
Citation Context ...zation (EM) algorithm (often referred to as BaumWelch or forward-backward algorithm) in which the estimators for the data likelihoods conditioned on each word model (p(XjM� )) are iteratively trained =-=[3]-=-. In the case of Viterbi approximation [equation (3)], the full likelihood is approximated by the likelihood of the most probable path through the states in the models, as given by the DP procedure. T... |

483 |
Connectionist Speech Recognition- A Hybrid Approach
- Bourlard, Morgan
- 1994
(Show Context)
Citation Context ...ng Bayes' rule. Several authors have shown that the outputs of ANNs used in classi cation mode can be interpreted as estimates of a posteriori probabilities of output classes conditioned on the input =-=[6,12,32]-=-. The proof given in [32], is repeated here. For continuous-valued acoustic input vectors, the Mean Square Error (MSE) criterion which is usually minimized during ANN training can be expressed as foll... |

282 | Neural network classifiers estimate Bayesian a posterior probabilities - Richard, Lippmann - 1991 |

242 |
Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition
- Bridle
- 1990
(Show Context)
Citation Context ...models of HMMs - Connectionist structures can also be used to represent standard HMM-based algorithms. Examples include the Viterbi network [21], which implements a Viterbi decoder, and the Alpha-Net =-=[9]-=-, which simulates the forward recurrence of the forward-backward HMM algorithm. { Global optimization through nonlinear transformation - Networks can provide a general nonlinear transformation of the ... |

180 |
Hidden Markov model decomposition of speech and noise
- Varga, Moore
- 1990
(Show Context)
Citation Context ...b-band recombination at word level using an MLP. sub-unit starting points. Hence, an approach suchasthetwo-level dynamic programming is required. Alternatively, a particular form of HMM decomposition =-=[38]-=-, referred to as HMM recombination, can also be used [8]. Finally, multiple-pass approaches can be used in which lattices are generated by a simpler system and then rescored by one or more multi-strea... |

140 |
How do humans process and recognize speech
- Allen
- 1994
(Show Context)
Citation Context ... noise), the whole feature vector is corrupted, and typically the performance of the recognizer is severely impaired. The work of Fletcher and his colleagues (see the insightful review of his work in =-=[1]-=-) suggests that human decoding of the linguistic message is based on decisions within narrow frequency sub-bands that are processed quite independently of each other. Recombination of decisions from t... |

136 | A new ASR approach based on independent processing and recombination of partial frequency bands
- Bourlard, Dupont
- 1996
(Show Context)
Citation Context ...h could be the dynamic merging of asynchronous temporal sequences (possibly with di erent frame rate), such as visual and acoustic inputs.sAlthough work has been done on multi-band speech recognition =-=[8]-=- as well as for ASR based on multiple time scales [10], only the multi-band results will be brie y described here. 5.2 Approach In the following we brie y present the approach presently used to recomb... |

108 |
The ‘neural’ phonetic typewriter
- Kohonen
- 1988
(Show Context)
Citation Context ... objective is closely linked to the recognition error rate. { Preprocessing - Many researchers have used feature map representations, related to one of the formulations from Kohonen and collaborators =-=[19]-=-, to generate feature representations for a speech recognizer. In other designs, researchers have experimented with networks to provide mappings from noisy to clean data [34] or from a new speaker to ... |

101 |
Review of Neural Networks for Speech Recognition
- Lippmann
- 1989
(Show Context)
Citation Context ...nd the assumption of a particular HMM state [36][20]. { ANN models of HMMs - Connectionist structures can also be used to represent standard HMM-based algorithms. Examples include the Viterbi network =-=[21]-=-, which implements a Viterbi decoder, and the Alpha-Net [9], which simulates the forward recurrence of the forward-backward HMM algorithm. { Global optimization through nonlinear transformation - Netw... |

89 |
The use of a one-stage dynamic programming algorithm for connected word recognition
- Ney
- 1984
(Show Context)
Citation Context ...or instance, vowels are typically shortened in rapid speech, while some consonants may remain nearly the same length. The most common global decoding approach is some form of dynamic programming (DP) =-=[26]-=-, in which timewarping of the input against possible speech representations results in the most likely sequence of sound categories to match the input. There are many variations to this process, but i... |

78 | predictive hidden Markov models and the speech signal - Poritz, “Linear - 1982 |

70 | Global optimization of a neural network–Hidden Markov Model hybrid
- Bengio, Mori, et al.
- 1992
(Show Context)
Citation Context ...linear transformation of the observation vectors for an otherwise standard HMM-based system. This permits a global optimization of the input transformation together with a global training of the HMMs =-=[4]-=-. { Minimum classi cation error optimization - The Generalized Probabilistic Descent/Minimum Classi cation Error (GPD/MCE) training method [18] is a general framework for classi er optimization. It is... |

67 | Connectionists probability estimators in HMM speech recogntion
- Renals, Morgan, et al.
- 1994
(Show Context)
Citation Context ...posterior distribution).sWe and others have performed numerous experiments that haveveri ed these two points. In some of them, a xed HMM was used and alternate probability estimators were substituted =-=[25,6,30,22]-=-. When these experiments were controlled for the number of parameters, there have been signi cant improvements using the approaches described here. Some of this quantitative evidence will be brie y su... |

58 |
A probabilistic approach to the understanding and training of neural network classifiers
- Gish
- 1990
(Show Context)
Citation Context ...ng Bayes' rule. Several authors have shown that the outputs of ANNs used in classi cation mode can be interpreted as estimates of a posteriori probabilities of output classes conditioned on the input =-=[6,12,32]-=-. The proof given in [32], is repeated here. For continuous-valued acoustic input vectors, the Mean Square Error (MSE) criterion which is usually minimized during ANN training can be expressed as foll... |

41 | REMAP: Recursive estimation and maximization of a posteriori probabilities in connectionist speech recognition
- Bourlard, Konig, et al.
- 1995
(Show Context)
Citation Context ...as recently been further explored yielding: 1. Better discriminant systems and a new hybrid HMM/ANN approach referred to as REMAP (Recursive Estimation and Maximization of A Posteriori probabilities) =-=[7]-=- based on conditional transition probabilities p(q`jqk�xn) estimated by a particular form of ANN. 2. Better understanding of the general hybrid HMM/ANN theory and its relationship to what had been don... |

41 |
Bourlard, “Neural networks for statistical recognition of continuous speech
- Morgan, A
- 1995
(Show Context)
Citation Context ...posterior distribution).sWe and others have performed numerous experiments that haveveri ed these two points. In some of them, a xed HMM was used and alternate probability estimators were substituted =-=[25,6,30,22]-=-. When these experiments were controlled for the number of parameters, there have been signi cant improvements using the approaches described here. Some of this quantitative evidence will be brie y su... |

39 |
New discriminative training algorithms based on a generalized probabilistic descent method,” presented at the
- Katagiri, Lee, et al.
- 1991
(Show Context)
Citation Context ...ransformation together with a global training of the HMMs [4]. { Minimum classi cation error optimization - The Generalized Probabilistic Descent/Minimum Classi cation Error (GPD/MCE) training method =-=[18]-=- is a general framework for classi er optimization. It is based on the incorporation of a smooth classi cation error function into a gradient search optimization objective. The optimization objective ... |

38 |
Generalization and parameter estimation in feedforward nets: Some experiments
- Morgan, Bourlard
- 2009
(Show Context)
Citation Context ... addition to merely halting the training based on performance for an independent validation set, a training procedure can be used in which the learning rate is also adjusted to improve generalization =-=[23]-=-. Speci cally, the learning rate is reduced (typically by a factor of 2) when cross-validation indicates that a given rate is no longer useful. Additionally, wehaveempir ically noted that after the rs... |

38 | E cient search using posterior phone probability estimates
- Renals, Hochberg
- 1995
(Show Context)
Citation Context ...d) likelihoods: 1. Recently, itwas observed that the availability of posterior probabilities (before division by priors) allowed a more e cient pruning for large vocabulary speech recognition systems =-=[31]-=-. 2. Given the way they are usually computed, the magnitude of the likelihoods depends on the size of the feature space. On the other hand, a posteriori probabilities are independent of the dimension ... |

32 | Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems
- Hennebert, Ris, et al.
- 1951
(Show Context)
Citation Context ...on of ANN targets using a forward-backward recurrence and the M-step is the ANN training. This is a generalized EM algorithm since the M-step is not exact. As for standard HMM systems, it is shown in =-=[14]-=-, that new forward and backward recurrences can be de ned in which the ANN outputs are used to compute and maximize P (XjM) P (X) also yielding global discrimination. = P (M jX) P (M) (19)s5 Multi-Str... |

30 |
Neural network classi ers estimate Bayesian a posteriori probabilities
- Richard, Lippmann
- 1991
(Show Context)
Citation Context ...ng Bayes' rule. Several authors have shown that the outputs of ANNs used in classi cation mode can be interpreted as estimates of a posteriori probabilities of output classes conditioned on the input =-=[6,12,32]-=-. The proof given in [32], is repeated here. For continuous-valued acoustic input vectors, the Mean Square Error (MSE) criterion which is usually minimized during ANN training can be expressed as foll... |

28 | Modeling asynchrony in speech using elementary single-signal decomposition - Tomlinson, Russell, et al. - 1997 |

26 |
A Hybrid Segmental Neural Net/Hidden Markov Model System for Continuous Speech Recognition
- Zavaliagkos, Zhao, et al.
- 1994
(Show Context)
Citation Context ...M-based system using Gaussian mixture estimators could be further improved by smoothing with estimates from an ANN. Similar results have been observed in other laboratories as well. 6. In work at BBN =-=[39]-=-, the subsystems were combined in a di erent way (by taking a list of the most likely N sentences as estimated by a pure HMM system, and reordering them based on phonetic segment probabilities as esti... |

19 | Using multiple time scales in a multi-stream speech recognition system
- Dupont, Bourlard
- 1997
(Show Context)
Citation Context ...al sequences (possibly with di erent frame rate), such as visual and acoustic inputs.sAlthough work has been done on multi-band speech recognition [8] as well as for ASR based on multiple time scales =-=[10]-=-, only the multi-band results will be brie y described here. 5.2 Approach In the following we brie y present the approach presently used to recombine several sources of information represented by di e... |

19 |
Improvement in connected digit recognition using linear discriminant analysis and mixture densities
- Haeb-Umbach, Geller, et al.
- 1993
(Show Context)
Citation Context ...ically 3-5 frames in total) with Linear Discriminant Analysis (LDA), which nds a linear transformation that maximizes the between-class variance while minimizing the within-class variance (see, e.g., =-=[13]-=-). The neural network can be seen as a generalization of these approaches that permits arbitrary weights and a nonlinear transformation of the input data. { Discrimination: ANNs can easily accommodate... |

18 |
Multi-lingual assessment of speaker independent large vocabulary speech-recognition systems: The SQALE-project
- Steeneken, Leeuwen
- 1995
(Show Context)
Citation Context ...) has been evaluated under both the North American ARPA program and the European LRE SQALE project (20,000 word vocabulary, speaker independent continuous speech recognition). In the SQALE evaluation =-=[35]-=- the system was found to perform slightly better than any other leading European system and required an order of magnitude less CPU resources to complete the test. Another striking result is that the ... |

17 |
Large vocabulary recognition using linked predictive neural networks
- Tebelskis
(Show Context)
Citation Context ...stead are trained (according to a MSE criterion) as an autoregressive (AR) model to predict a feature vector given some previous number of feature vectors and the assumption of a particular HMM state =-=[36]-=-[20]. { ANN models of HMMs - Connectionist structures can also be used to represent standard HMM-based algorithms. Examples include the Viterbi network [21], which implements a Viterbi decoder, and th... |

16 | Recent improvements to the ABBOT large vocabulary csr system
- Hochberg, Renals, et al.
- 1995
(Show Context)
Citation Context ... the US and Europe), we have focused on using a simple Multilayer Perceptron (MLP) that is illustrated in Figure 4, though similar results have beenachieved at other labs with structures such as RNNs =-=[16]-=-. It is deceptively simple, consisting of a single large hidden layer, typically with between 500 and 4000 hidden units that receive input from several hundred acoustic variables (e.g., 9 frames of ac... |

13 | Stochastic perceptual speech models with durational dependence
- Bilmes, Morgan, et al.
- 1996
(Show Context)
Citation Context ...ults are also reminiscent of earlier experiments in which weshowed that the combination of phone models with models trained to emphasize transitions signi cantly improved robustness to additive noise =-=[5]-=-. 6 Other Connectionist Approaches This paper has focused on the hybrid HMM/ANN approach, in which some kind of network (typically an MLP, RBF, RNN, or TDNN) trained for classi cation by an MSE or rel... |

9 |
Speaker independent isolated word recognizer using dynamic features of speech spectrum
- Furui
- 1986
(Show Context)
Citation Context ...echanism for incorporating acoustic context into the statistical formulation. Of course, ANNs are not the only way to incorporate such context. Manycurrent systems use rst and second time derivatives =-=[11,28]-=- computed over a span of a few frames, allowing very limited acoustical context modeling. Some systems transform a context window of a few adjacent frames (typically 3-5 frames in total) with Linear D... |

9 | A neural network based, speaker independent, large vocabulary, continuous speech recognition system: the Wernicke project
- Robinson, Almeida, et al.
- 1993
(Show Context)
Citation Context ...d achieve similar performance as tied-mixture estimators using much more detailed models of context and an order of magnitude more parameters. For further discussion about this with more results, see =-=[33]-=-. In [22], similar conclusions were also drawn for quite a di erent example, connected digit recognition for a standard TI database. In this case, string error for a moderate-sized MLP (about 11000 pa... |

8 |
Connected digit recognition using connectionist probability estimators and mixture-gaussian densities
- Lubensky, Asadi, et al.
- 1994
(Show Context)
Citation Context ...posterior distribution).sWe and others have performed numerous experiments that haveveri ed these two points. In some of them, a xed HMM was used and alternate probability estimators were substituted =-=[25,6,30,22]-=-. When these experiments were controlled for the number of parameters, there have been signi cant improvements using the approaches described here. Some of this quantitative evidence will be brie y su... |

7 |
Connectionist Speaker Normalization and Its Applications To Speech Recognition
- Huang, Lee, et al.
- 1991
(Show Context)
Citation Context ...eature representations for a speech recognizer. In other designs, researchers have experimented with networks to provide mappings from noisy to clean data [34] or from a new speaker to an old speaker =-=[17]-=-.s{ Postprocessing - As noted earlier, many researchers have used lattice generation (or N-best utterance lists) as an intermediate step in order to test new processing methods without having to embed... |

4 | Improving state-of-the-art continuous speech recognition systems using the n-best paradigm with neural networks
- Austin, Zavaliagkos, et al.
- 1992
(Show Context)
Citation Context ...g likelihood-based system. For instance, in one approach, the combination of HMMs with a neural network (referred to as \segmental neural network") provided some improvements over the original system =-=[2]-=-. In that case, an N-best paradigm is used to generate the N-best utterance hypotheses that are then rescored by a neural network taking complete phonetic segments into account. { Finally, there are t... |

4 |
On Hidden Markov Models in Isolated Word Recognition
- Poritz, Richter
- 1986
(Show Context)
Citation Context ...echanism for incorporating acoustic context into the statistical formulation. Of course, ANNs are not the only way to incorporate such context. Manycurrent systems use rst and second time derivatives =-=[11,28]-=- computed over a span of a few frames, allowing very limited acoustical context modeling. Some systems transform a context window of a few adjacent frames (typically 3-5 frames in total) with Linear D... |

2 |
Speech recognition using hidden control neural network architecture
- Levin
- 1990
(Show Context)
Citation Context ...d are trained (according to a MSE criterion) as an autoregressive (AR) model to predict a feature vector given some previous number of feature vectors and the assumption of a particular HMM state [36]=-=[20]-=-. { ANN models of HMMs - Connectionist structures can also be used to represent standard HMM-based algorithms. Examples include the Viterbi network [21], which implements a Viterbi decoder, and the Al... |

1 |
Big Dumb Deural Nets (BDNN): a working brute force approach to speech recognition
- Morgan
- 1994
(Show Context)
Citation Context ...ing patterns with very few epochs� the resulting networks can be used to estimate emission probabilities for HMMs in largesand di cult tasks in continuous speech recognition. This was demonstrated in =-=[24]-=-, where we described a 1.6 million-weight network that was trained on 6 million frames of speech from the Wall Street Journal pilot data. This simple estimator was then used to get 16% error on the 50... |

1 |
A cepstral noise reduction multi-layer network
- Sorenson
(Show Context)
Citation Context ...Kohonen and collaborators [19], to generate feature representations for a speech recognizer. In other designs, researchers have experimented with networks to provide mappings from noisy to clean data =-=[34]-=- or from a new speaker to an old speaker [17].s{ Postprocessing - As noted earlier, many researchers have used lattice generation (or N-best utterance lists) as an intermediate step in order to test n... |

1 |
Modelling asynchrony inspeech using elementary single-signal decomposition
- Tomlinson, Russell, et al.
- 1997
(Show Context)
Citation Context ...cently, a similar approach was successfully used to merge acoustic streams with di erent time scale properties (e.g., respectively capturing phonetic and syllabic dynamics) [10]. In other experiments =-=[37]-=-, it was also shown that a similar approach could also be used to better capture the possible asynchrony between frequency bands. These multi-stream results are also reminiscent of earlier experiments... |

1 | Improving state-ofthe -art continuous speech recognition systems using the N-best paradigm with neural networks - Austin, Zavaliagkos, et al. - 1992 |