## The application of hidden Markov models in speech recognition

### Cached

### Download Links

- [mi.eng.cam.ac.uk]
- [svr-www.eng.cam.ac.uk]
- [mi.eng.cam.ac.uk]
- [svr-www.eng.cam.ac.uk]
- [mi.eng.cam.ac.uk]
- [www-i6.informatik.rwth-aachen.de]
- DBLP

### Other Repositories/Bibliography

Venue: | Foundations and Trends in Signal Processing |

Citations: | 35 - 7 self |

### BibTeX

@INPROCEEDINGS{Gales_theapplication,

author = {Mark Gales and Steve Young},

title = {The application of hidden Markov models in speech recognition},

booktitle = {Foundations and Trends in Signal Processing},

year = {},

pages = {2007}

}

### OpenURL

### Abstract

Hidden Markov Models (HMMs) provide a simple and effective framework for modelling time-varying spectral vector sequences. As a consequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs. Whereas the basic principles underlying HMM-based LVCSR are rather straightforward, the approximations and simplifying assumptions involved in a direct implementation of these principles would result in a system which has poor accuracy and unacceptable sensitivity to changes in operating environment. Thus, the practical application of HMMs in modern systems involves considerable sophistication. The aim of this review is first to present the core architecture of a HMM-based LVCSR system and then describe the various refinements which are needed to achieve state-of-the-art performance. These refinements include feature projection, improved covariance modelling, discriminative parameter estimation, adaptation and normalisation, noise compensation and multi-pass system combination. The review concludes with a case study of LVCSR for Broadcast News and Conversation transcription in order to illustrate the techniques described. 1

### Citations

117 | Speech recognition with dynamic bayesian networks
- Zweig
- 1998
(Show Context)
Citation Context ...n in Figure 3.1. Also shown in Figure 3.1 is an alternative, complementary, graphical representation called a dynamic Bayesian network (DBN) which emphasises the conditional dependencies of the model =-=[17, 200]-=- and which is particularly useful for describing a variety of extensions to the basic HMM structure. In the DBN notation used here, squares denote discrete variables; circles continuous variables; sha... |

107 | Token Passing: a Simple Conceptual Model for Connected Speech Recognition Systems
- Young, Russell, et al.
- 1989
(Show Context)
Citation Context ... space. To deal with this, a number of different architectural approaches have evolved. For Viterbi decoding, the search space can either be constrained by maintaining multiple hypotheses in parallel =-=[173, 191, 192]-=- or it can be expanded dynamically as the search progresses [7, 69, 130, 132]. Alternatively, a completely different approach can be taken where the breadth-first approach of the Viterbi algorithm is ... |

28 | Improved Discriminative Training using Phone Lattices
- Zheng, Stolcke
- 2005
(Show Context)
Citation Context ...umber of frames. This can cause generalisation issues. To address this minimum phone frame error (MPFE) may be used where the phone loss is weighted by the number of frames associated with each frame =-=[198]-=-. This is the same as the Hamming distance described in [166].s236 Parameter Estimation It is also possible to base the loss function on the specific task for which the classifier is being built [22].... |

16 | Boosting gaussian mixtures in an LVCSR system
- Zweig
- 2000
(Show Context)
Citation Context ... scheme is not directly applicable given the potentially vast number of classes in speech recognition. Despite these issues, there have been a number of applications of boosting to speech recognition =-=[37, 121, 154, 201]-=-. An alternative approach, again based on weighting the data, is to modify the form of the decision tree so that leaves are concentrated in regions where the classifier performs poorly [19]. Although ... |

15 |
On using MLP features
- Zhu, Chen, et al.
(Show Context)
Citation Context ...oc approach to determining the systems to be combined. Initially a number of candidates are constructed. For example: triphone and quinphone models; SAT and GD models; MFCC, PLP, MLP-based posteriors =-=[199]-=- and Gaussianised front-ends; multiple and single pronunciation dictionaries; random decision trees [162]. On a held-out testset, the performance of the combinations is evaluated and the “best” (possi... |

11 | Kitamura ”A Viterbi algorithm for a trajectory model derived from HMM with explicit relationship between static and dynamic features
- Zen, Tokuda, et al.
- 2004
(Show Context)
Citation Context ...ajor issues with both these improvements is that the Viterbi algorithm cannot be directly used with the model to obtain the statesequence. A frame-delayed version of the Viterbi algorithm can be used =-=[196]-=- to find the state-sequence. However, this is still more expensive than the standard Viterbi algorithm. HMM trajectory models can also be used for speech recognition [169]. This again makes use of the... |

9 | Discriminative cluster adaptive training
- Yu, Gales
- 2006
(Show Context)
Citation Context ...tyle approach. Kernelised versions of both EigenVoices [113] and Eigen-MLLR [112] have also been studied. Finally, the use of discriminative criteria to obtain the cluster parameters has been derived =-=[193]-=-. 5.4 Maximum a Posteriori (MAP) Adaptation Rather than hypothesising a form of transformation to represent the differences between speakers, it is possible to use standard statistical approaches to o... |

9 | Bayesian adaptive inference and adaptive training
- Yu, Gales
- 2007
(Show Context)
Citation Context ...e simply rescored [133]. An alternative use of lattices is to obtain confidence scores, which may then be used for confidence-based MLLR [172]. A different approach using N-best lists was proposed in =-=[118, 194]-=- whereby a separate transform was estimated for each hypothesis and used to rescore only that hypothesis. In [194], this is shown to be a lower-bound approximation where the transform parameters are c... |

9 | Discriminatively trained region dependent feature transforms for speech recognition
- Zhang, Matsoukas, et al.
- 2006
(Show Context)
Citation Context ...elled [54, 67, 93, 95], rather than using Bayes’ rule as above to obtain the posterior of the correct word sequence. In addition, these discriminative criteria can be used to train feature transforms =-=[138, 197]-=- and model parameter transforms [159] that are dependent on the observations making them time varying. 4.1.1 Maximum Mutual Information One of the first discriminative training criteria to be explored... |

3 | Unsupervised Training with Directed Manual Transcription for Recognising Mandarin Broadcast Audio
- Yu, Gales, et al.
- 2007
(Show Context)
Citation Context ... can be whilst still obtaining worthwhile gains from discriminative training. This problem can be partly overcome by using the recognition output to guide the selection of data to manually transcribe =-=[195]-=-.s5 Adaptation and Normalisation A fundamental idea in statistical pattern classification is that the training data should adequately represent the test data, otherwise a mismatch will occur and recog... |

2 |
The use of syntax and multiple alternatives in the vodis voice operated database inquiry system. Computer Speech and Language
- Young, Russell, et al.
- 1991
(Show Context)
Citation Context ... space. To deal with this, a number of different architectural approaches have evolved. For Viterbi decoding, the search space can either be constrained by maintaining multiple hypotheses in parallel =-=[173, 191, 192]-=- or it can be expanded dynamically as the search progresses [7, 69, 130, 132]. Alternatively, a completely different approach can be taken where the breadth-first approach of the Viterbi algorithm is ... |