## Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition (2012)

Venue: | IEEE Transactions on Audio, Speech, and Language Processing |

Citations: | 78 - 35 self |

### BibTeX

@INPROCEEDINGS{Dahl12context-dependentpre-trained,

author = {George E. Dahl and Student Member and Dong Yu and Senior Member and Li Deng and Alex Acero},

title = {Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition},

booktitle = {IEEE Transactions on Audio, Speech, and Language Processing},

year = {2012}

}

### OpenURL

### Abstract

Abstract—We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8 % and 9.2 % (or relative error reduction of 16.0 % and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively. Index Terms—Speech recognition, deep belief network, context-dependent phone, LVSR, DNN-HMM, ANN-HMM I.

### Citations

754 |
Learning representations by back-propagating errors
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...ights resulting from the unsupervised pre-training algorithm to initialize the weights of a deep, but otherwise standard, feed-forward neural network and then simply use the backpropagation algorithm =-=[61]-=- to fine-tune the network weights with respect to a supervised criterion. Pre-training followed by stochastic gradient descent is our method of choice for training deep neural networks because it ofte... |

521 | Training products of experts by minimizing contrastive divergence
- Hinton
(Show Context)
Citation Context ...so we are forced to use an approximation. Since RBMs are in the intersection between Boltzmann machines and product of experts models, they can be trained using contrastive divergence as described in =-=[67]-=-. The one-step contrastive divergence approximation for the gradient w.r.t. the visible-hidden weights is: − ∂ℓ(θ) ≈ 〈vihj〉 data − 〈vihj〉1 (11) ∂wij where 〈.〉1 denotes the expectation over one-step re... |

462 |
Connectionist speech recognition: a hybrid approach
- Bourlard, Morgan
- 1994
(Show Context)
Citation Context ... estimated from the training set, and p(xt) is independent of the word sequence and thus can be ignored. Although dividing by the prior probability p(qt) (called scaled likelihood estimation by [38], =-=[40]-=-, [41]) may not give improved recognition accuracy under some conditions, we have found it to be very important in alleviating the label bias problem, especially when the training utterances contain l... |

457 | A fast learning algorithm for deep belief nets
- Hinton, Osindero, et al.
(Show Context)
Citation Context ...dden layers. The resulting deep belief nets learn a hierarchy of nonlinear feature detectors that can capture complex statistical patterns in data. The deep belief net training algorithm suggested in =-=[24]-=- first initializes the weights of each layer individually in a purely unsupervised1 way and then fine-tunes the entire network using labeled data. This semi-supervised approach using deep models has p... |

348 |
Reducing the Dimensionality of Data with Neural Networks
- Hinton, Salakhutdinov
(Show Context)
Citation Context ...repeated. The regularization effect from using information in the distribution of inputs can allow highly expressive models to be trained on comparably small quantities of labeled data. Additionally, =-=[34]-=-, [33], and others have also reported experimental evidence consistent with pre-training aiding the subsequent optimization, typically performed by stochastic gradient descent. Thus, pre-trained neura... |

278 | Hidden Markov Models for Speech Recognition
- Huang, Ariki, et al.
- 1990
(Show Context)
Citation Context ...ever, we do not take that approach, but instead we try to improve the earlier hybrid approaches by replacing more traditional neural nets with deeper, pre-trained neural nets and by using the senones =-=[48]-=- (tied triphone states) of a GMM-HMM tri-phone model as the output units of the neural network, in line with stateof-the-art HMM systems. Although this work uses the hybrid approach, as alluded to abo... |

248 |
Information processing in dynamical systems: Foundations of harmony theory
- Smolensky
- 1986
(Show Context)
Citation Context ...unable to show that our proposed system performs well because of some sort of avoidance of the potential issues we discuss above. A. Restricted Boltzmann Machines Restricted Boltzmann Machines (RBMs) =-=[66]-=- are a type of undirected graphical model constructed from a layer of binary stochastic hidden units and a layer of stochastic visible units that, for the purposes of this work, will either be Bernoul... |

180 | P.C.: Minimum phone error and I-smoothing for improved discriminative training
- Povey, Woodland
- 2002
(Show Context)
Citation Context ...in discriminative training (see an overview in [4]; e.g., maximum mutual information (MMI) estimation [5], minimum classification error (MCE) training [6], [7], and minimum phone error (MPE) training =-=[8]-=-, [9]), in large-margin techniques (such as large margin estimation [10], [11], large margin hidden Markov model (HMM) [12], large-margin MCE [13]–[16], and boosted MMI [17]), as well as in novel acou... |

141 | Tandem connectionist feature extraction for conventional HMM systems
- Hermansky, Ellis, et al.
- 2000
(Show Context)
Citation Context ...he-art HMM systems. Although this work uses the hybrid approach, as alluded to above, much recent work using neural networks in acoustic modeling uses the so-called TANDEM approach, first proposed in =-=[49]-=-. The TANDEM approach augments the input to a GMM-HMM system with features derived from the suitably transformed output of one or more neural networks, typically trained to produce distributions over ... |

131 |
Minimum Classification Error Rate Methods for Speech Recognition
- Juang, Chou, et al.
- 1997
(Show Context)
Citation Context ...]). There have been some notable recent advances in discriminative training (see an overview in [4]; e.g., maximum mutual information (MMI) estimation [5], minimum classification error (MCE) training =-=[6]-=-, [7], and minimum phone error (MPE) training [8], [9]), in large-margin techniques (such as large margin estimation [10], [11], large margin hidden Markov model (HMM) [12], large-margin MCE [13]–[16]... |

116 |
Discriminative training for large vocabulary speech recognition
- Povey
- 2003
(Show Context)
Citation Context ...scriminative training (see an overview in [4]; e.g., maximum mutual information (MMI) estimation [5], minimum classification error (MCE) training [6], [7], and minimum phone error (MPE) training [8], =-=[9]-=-), in large-margin techniques (such as large margin estimation [10], [11], large margin hidden Markov model (HMM) [12], large-margin MCE [13]–[16], and boosted MMI [17]), as well as in novel acoustic ... |

115 | A unified architecture for natural language processing: Deep neural networks with multitask learning - Collobert, Weston - 2008 |

110 | What is the best multi-stage architecture for object recognition
- Jarrett, Kavukcuoglu, et al.
- 2009
(Show Context)
Citation Context ...sing labeled data. This semi-supervised approach using deep models has proved effective in a number of applications, including coding and classification for speech, audio, text, and image data ( [25]–=-=[29]-=-). These advances triggered interest in developing acoustic models based on pretrained neural networks and other deep learning techniques for ASR. For example, context-independent pre-trained, deep ne... |

109 | and Variance Adaptation within the MLLR Framework”, Computer Speech and Language Volume 10
- Gales, “Mean
- 1996
(Show Context)
Citation Context ...ining phase. Inspiration for such algorithms may come from the ANN-HMM literature (e.g. [72], [73]) or the many successful adaptation techniques developed in the past decades for GMM-HMMs (e.g., MLLR =-=[74]-=-, MAP [75], joint compensation of distortions [76], variable parameter HMMs [77]). Third, the training in this study used the embedded Viterbi algorithm, which is not optimal. We believe additional im... |

103 | Exponential family harmoniums with an application to information retrieval
- Welling, Rozen-Zhi, et al.
- 2005
(Show Context)
Citation Context ...ition the acoustic input is typically represented with real-valued feature vectors. The Gaussian-Bernoulli restricted Boltzmann machine (GRBM) only requires a slight modification of equation (3) (see =-=[68]-=- for a generalization of RBMs to any distribution in the exponential family). The GRBM energy function we use in this work is given by: E(v, h) = 1 2 (v − b)T (v − b) − c T h − v T Wh, (12) Note that ... |

102 | Semantic Hashing - Salakhutdinov, Hinton - 2007 |

90 |
Learning deep architectures for AI. Foundations and Trends
- Bengio
(Show Context)
Citation Context ... that might be capable of learning rich, distributed representations of their input is also based on formal and informal arguments by other researchers in the machine learning community. As argued in =-=[62]-=- and [63], insufficiently deep architectures can require an exponential blow-up in the number of computational elements needed to represent certain functions satisfactorily. Thus one primary motivatio... |

83 | Hidden conditional random fields for phone classification
- Gunawardana, Mahajan, et al.
- 2005
(Show Context)
Citation Context ... Way, Redmond, WA, 98034 USA (email: deng@microsoft.com). A. Acero is with the Speech Research Group, Microsoft Research, One Microsoft Way, Redmond, WA, 98034 USA (email: alexac@microsoft.com). CRFs =-=[21]-=-, [22], and segmental CRFs [23]). Despite these advances, the elusive goal of human level accuracy in realworld conditions requires continued, vibrant research. Recently, a major advance has been made... |

80 |
Scaling Learning Algorithms towards AI
- Bengio, LeCunn
- 2007
(Show Context)
Citation Context ...ht be capable of learning rich, distributed representations of their input is also based on formal and informal arguments by other researchers in the machine learning community. As argued in [62] and =-=[63]-=-, insufficiently deep architectures can require an exponential blow-up in the number of computational elements needed to represent certain functions satisfactorily. Thus one primary motivation for usi... |

79 | Extracting and composing robust features with denoising autoencoders
- Vincent, Larochelle, et al.
- 2008
(Show Context)
Citation Context ...e especially pronounced in deep autoencoders. Deep belief network pre-training was the first pre-training method to be widely studied, although many other techniques now exist in the literature (e.g. =-=[35]-=-). After [34] showed that deep auto-encoders could be trained effectively using deep belief net pre-training, there was a resurgence of interest in using deeper neural networks for applications. Altho... |

67 | Acoustic modeling using deep belief networks
- Mohamed, Dahl, et al.
- 2012
(Show Context)
Citation Context ...t-independent phone and context class used previously in hybrid architectures. This second difference also distinguishes our work from earlier uses of DNN-HMM hybrids for phone recognition [30]–[32], =-=[59]-=-. Note that [59], which also appears in this issue, is the context-independent version of our approach and builds the foundation for our work. The work in this paper focuses on context-dependent DNN-H... |

62 | Connectionist probability estimation in HMM speech recognition
- S, Morgan
- 1992
(Show Context)
Citation Context ...ich show substantial improvements in recognition accuracy for a difficult LVSR task over discriminatively-trained pure CD-GMM-HMM systems. Our work differs from earlier context-dependent ANNHMMs [42] =-=[41]-=- in two key respects. First, we used deeper, more expressive neural network architectures and thus employed the unsupervised DBN pre-training algorithm to make sure training would be effective. Second... |

53 |
Continuous speech recognition using multilayer perceptrons with hidden Markov models
- Morgan, Bourlard
- 1990
(Show Context)
Citation Context ...iscriminative nature, ANN-HMMs have two additional advantages: the training can be performed using the embedded Viterbi algorithm and the decoding is generally quite efficient. Most early work (e.g., =-=[39]-=- [38]) on the hybrid approach used context-independent phone states as labels for ANN training and considered small vocabulary tasks. ANN-HMMs were later extended to model context-dependent phones and... |

51 |
Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition
- Sha, Saul
(Show Context)
Citation Context ...ication error (MCE) training [6], [7], and minimum phone error (MPE) training [8], [9]), in large-margin techniques (such as large margin estimation [10], [11], large margin hidden Markov model (HMM) =-=[12]-=-, large-margin MCE [13]–[16], and boosted MMI [17]), as well as in novel acoustic models (such as conditional random fields (CRFs) [18]–[20], hidden Copyright (c) 2010 IEEE. Personal use of this mater... |

46 | Why does unsupervised pre-training help deep learning
- Erhan, Bengio, et al.
(Show Context)
Citation Context ...2] and have achieved very competitive performance. Using pre-training to initialize the weights of a deep neural network has two main potential benefits that have been discussed in the literature. In =-=[33]-=-, evidence was presented that is consistent with viewing pretraining as a peculiar sort of data-dependent regularizer whose effect on generalization error does not diminish with more data, even when t... |

43 | Deep belief networks for phone recognition
- Mohamed, Dahl, et al.
- 2009
(Show Context)
Citation Context ...eural networks and other deep learning techniques for ASR. For example, context-independent pre-trained, deep neural network HMM hybrid architectures have recently been proposed for phone recognition =-=[30]-=-–[32] and have achieved very competitive performance. Using pre-training to initialize the weights of a deep neural network has two main potential benefits that have been discussed in the literature. ... |

39 |
MMI training for continuous phoneme recognition on the TIMIT database
- Kapadia, Valtchev, et al.
- 1993
(Show Context)
Citation Context ...ags behind human level performance (e.g., [2], [3]). There have been some notable recent advances in discriminative training (see an overview in [4]; e.g., maximum mutual information (MMI) estimation =-=[5]-=-, minimum classification error (MCE) training [6], [7], and minimum phone error (MPE) training [8], [9]), in large-margin techniques (such as large margin estimation [10], [11], large margin hidden Ma... |

39 | Modeling pixel means and covariance using factorized third-order Boltzmann machines
- Ranzato, Hinton
- 2010
(Show Context)
Citation Context ... conditional distribution over the input space given the latent state (as diagonal covariance GMMs also do). A more powerful first layer model, namely the mean-covariance restricted Boltzmann machine =-=[84]-=- significantly enhanced the performance of context-independent DNN-HMMs for phone recognition in [32]. We therefore view applying similar models to LVSR as an enticing area of future work.40 IEEE TRA... |

36 | 3D Object Recognition with Deep Belief Nets
- Nair, Hinton
- 1347
(Show Context)
Citation Context ...ork using labeled data. This semi-supervised approach using deep models has proved effective in a number of applications, including coding and classification for speech, audio, text, and image data ( =-=[25]-=-–[29]). These advances triggered interest in developing acoustic models based on pretrained neural networks and other deep learning techniques for ASR. For example, context-independent pre-trained, de... |

35 | N.: Using MLP features in SRI’s conversational speech recognition system - Zhu, Stolcke, et al. - 2005 |

34 | Phone recognition with the mean-covariance restricted Boltzmann machine
- Dahl, Ranzato, et al.
- 2010
(Show Context)
Citation Context ... networks and other deep learning techniques for ASR. For example, context-independent pre-trained, deep neural network HMM hybrid architectures have recently been proposed for phone recognition [30]–=-=[32]-=- and have achieved very competitive performance. Using pre-training to initialize the weights of a deep neural network has two main potential benefits that have been discussed in the literature. In [3... |

34 |
Continuous speech recognition by connectionist statistical methods
- Bourlard, Morgan
- 1993
(Show Context)
Citation Context ...osed in the literature (see the comprehensive survey in [37]). Among these techniques, the ones most relevant to this work are those that use the ANNs to estimate the HMM stateposterior probabilities =-=[38]-=-–[45], which have been referred to as ANN-HMM hybrid models in the literature. In these ANNHMM hybrid architectures, each output unit of the ANN is trained to estimate the posterior probability of a c... |

33 | Investigation of fullsequence training of deep belief networks
- Mohamed, Yu, et al.
- 2010
(Show Context)
Citation Context ...ing an objective function based on the full sequence, as we have already demonstrated on the TIMIT dataset withDRAFT ACCEPTED BY IEEE TRANS. ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 11 some success =-=[31]-=-. In addition, we view the treatment of the time dimension of speech by DNN-HMM and GMM-HMMs alike as a very crude way of dealing with the intricate temporal properties of speech. The weaknesses in ho... |

32 |
Discriminative learning in sequential pattern recognition — a unifying review for optimization-oriented speech recognition
- He, Deng, et al.
(Show Context)
Citation Context ...ech recognition (ASR) systems in real usage scenarios lags behind human level performance (e.g., [2], [3]). There have been some notable recent advances in discriminative training (see an overview in =-=[4]-=-; e.g., maximum mutual information (MMI) estimation [5], minimum classification error (MCE) training [6], [7], and minimum phone error (MPE) training [8], [9]), in large-margin techniques (such as lar... |

31 | Connectionist speech recognition of broadcast news
- ROBINSON, COOK, et al.
- 2002
(Show Context)
Citation Context ...in the literature (see the comprehensive survey in [37]). Among these techniques, the ones most relevant to this work are those that use the ANNs to estimate the HMM stateposterior probabilities [38]–=-=[45]-=-, which have been referred to as ANN-HMM hybrid models in the literature. In these ANNHMM hybrid architectures, each output unit of the ANN is trained to estimate the posterior probability of a contin... |

30 | Structured speech modeling - Deng, Yu, et al. - 2006 |

29 | Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems - Hennebert, Ris, et al. - 1997 |

29 | The curse of highly variable functions for local kernel machines
- Bengio, Delalleau, et al.
- 2006
(Show Context)
Citation Context ...ioning of the input space and use separately parameterized simple models for each region are doomed to have similar generalization issues when trained on rapidly varying functions. In a related vein, =-=[65]-=- also proves an analogous “curse of rapidly-varying functions” for a large class of local kernel machines that include both supervised learning algorithms (e.g., SVMs with Gaussian kernels) and many s... |

29 | A segmental CRF approach to large vocabulary continuous speech recognition
- Zweig, Nguyen
(Show Context)
Citation Context ...onsuming compared to training CD-GMM-HMMs). Performance on this task was evaluated using sentence accuracy (SA) instead of word accuracy for a variety of reasons. In order to compare our results with =-=[70]-=-, we would need to compute sentence accuracy anyway. The average sentence length is 2.1 tokens, so sentences are typically quite short. Also, the users care most about whether they can find the busine... |

28 | On Adaptive Decision Rules and Decision Parameter Adaptation for Automatic Speech Recognition
- Lee, Huo
- 2000
(Show Context)
Citation Context ...e. Inspiration for such algorithms may come from the ANN-HMM literature (e.g. [72], [73]) or the many successful adaptation techniques developed in the past decades for GMM-HMMs (e.g., MLLR [74], MAP =-=[75]-=-, joint compensation of distortions [76], variable parameter HMMs [77]). Third, the training in this study used the embedded Viterbi algorithm, which is not optimal. We believe additional improvement ... |

27 |
Discriminative training for large vocabulary speech recognition using minimum classification error
- McDermott, Hazen, et al.
- 2007
(Show Context)
Citation Context ...here have been some notable recent advances in discriminative training (see an overview in [4]; e.g., maximum mutual information (MMI) estimation [5], minimum classification error (MCE) training [6], =-=[7]-=-, and minimum phone error (MPE) training [8], [9]), in large-margin techniques (such as large margin estimation [10], [11], large margin hidden Markov model (HMM) [12], large-margin MCE [13]–[16], and... |

27 | Large-margin minimum classification error training for large-scale speech recognition tasks
- Yu, Deng, et al.
(Show Context)
Citation Context ... [6], [7], and minimum phone error (MPE) training [8], [9]), in large-margin techniques (such as large-margin estimation [10], [11], large-margin hidden Markov model (HMM) [12], large-margin MCE [13]–=-=[16]-=-, and boosted MMI [17]), as well as in novel acoustic models (such as conditional random Manuscript received September 08, 2010; revised January 04, 2011, March 13, 2011; accepted March 14, 2011. Date... |

25 |
Probabilistic and bottle-neck features for LVCSR of meetings
- Grezl, Karafiat, et al.
- 2007
(Show Context)
Citation Context ...ut to a GMM-HMM system with features derived from the suitably transformed output of one or more neural networks, typically trained to produce distributions over monophone targets. In a similar vein, =-=[50]-=- uses features derived from an earlier “bottle-neck” hidden layer instead of using the neural network outputs directly. Many recent papers (e.g. [51]–[54]) train neural networks on LVSR datasets (ofte... |

25 | Deep learning via hessian-free optimization
- Martens
- 2010
(Show Context)
Citation Context ...e gradient algorithm. Both [56] and [57] investigate why training deep feed-forward neural networks can often be easier with some form of pre-training or a sophisticated optimizer of the sort used in =-=[58]-=-. Since the time of the early hybrid architectures, the vector processing capabilities of modern GPUs and the advent of more effective training algorithms for deep neural nets have made much more powe... |

25 |
G.: Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines
- Ranzato, Hinton
(Show Context)
Citation Context ... conditional distribution over the input space given the latent state (as diagonal covariance GMMs also do). A more powerful first layer model, namely the mean-covariance restricted Boltzmann machine =-=[84]-=- significantly enhanced the performance of context-independent DNN-HMMs for phone recognition in [32]. We therefore view applying similar models to LVSR as an enticing area of future work. ACKNOWLEDGM... |

24 |
Roles of pre-training and fine-tuning in contextdependent DBN-HMMs for real-world speech recognition
- Yu, Deng, et al.
- 2010
(Show Context)
Citation Context ...sed fine-tuning phase requires labeled data, we can potentially leverage a large quantity of unlabeled data during pre-training, although this capability is not yet important for our LVSR experiments =-=[69]-=- due to the abundance of weakly supervised data. III. CD-DNN-HMM Hidden Markov models (HMMs) have been the dominant technique for LVSR for at least two decades. An HMM is a generative model in which t... |

23 | Live search for mobile: Web services by voice on the cellphone
- Acero, Bernstein, et al.
- 2008
(Show Context)
Citation Context ...DNN-HMMs can significantly outperform strong discriminatively-trained context-dependent Gaussian mixture model hidden Markov model (CD-GMM-HMM) baselines on the challenging business search dataset of =-=[36]-=-, collected under actual usage conditions. To our best knowledge, this is the first time DNN-HMMs, which are formerly only used for phone recognition, are successfully applied to large vocabulary spee... |

23 | Speech recognition using neural networks with forward-backward probability generated targets - Yan, Fanty, et al. - 1997 |

22 | Speech recognition using augmented conditional random fields
- Hifny, Renals
- 2009
(Show Context)
Citation Context ...in estimation [10], [11], large margin hidden Markov model (HMM) [12], large-margin MCE [13]–[16], and boosted MMI [17]), as well as in novel acoustic models (such as conditional random fields (CRFs) =-=[18]-=-–[20], hidden Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request t... |

22 |
CDNN: a context dependent neural network for continuous speech recognition
- Bourlard, Morgan, et al.
- 1992
(Show Context)
Citation Context ...ed to mid-vocabulary and some large vocabulary ASR tasks (e.g. in [45], which also employed recurrent neural architectures). However, in earlier work on context dependent ANN-HMM hybrid architectures =-=[46]-=-, the posterior probability of the context-dependent phone was modeled as either or p(si, cj|xt) = p(si|xt)p(ci|sj, xt) (1) p(si, cj|xt) = p(ci|xt)p(si|cj, xt), (2) where xt is the acoustic observatio... |