• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

An Empirical Study of Learning Rates in Deep Neural Networks for Speech Recognition,” ICASSP, (2013)

by A Senior, G Heigold, M-A Ranzato, K Yang
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 13
Next 10 →

Large scale deep neural network acoustic modeling with semi-supervised training data for youtube video transcription

by Hank Liao, Erik Mcdermott, Andrew Senior, Google Inc - in Workshop on Automatic Speech Recognition and Understanding (ASRU , 2013
"... YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Im-proving accessibility to these videos for the hard of hearing and for search and indexing purposes is an excellent application of automatic speech recognition. However, Yo ..."
Abstract - Cited by 12 (2 self) - Add to MetaCart
YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Im-proving accessibility to these videos for the hard of hearing and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely chal-lenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009 YouTube has provided automatic gener-ation of closed captions for videos detected to have English speech; the service now supports ten different languages. This article de-scribes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories. Applying an “island of confidence ” fil-tering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low-rank final layer weight matrix approximation, improved performance by about 13 % relative compared to previously reported sequence trained DNN results for this task. Index Terms — Large vocabulary speech recognition, deep neu-ral networks, deep learning, audio indexing. 1.
(Show Context)

Citation Context

...lterbank energies, 10 to the left and 10 to the right of the center frame to yield an 840 dimensional input vector, computed on 25ms windows with a 10ms step. On the basis of previous experimentation =-=[14]-=- we use an exponentially decaying learning rate schedule whereby the initial learning rate of 0.1 decays by a factor of 10 every 1.5 billion frames. The minibatch size is fixed at 200 and a constant m...

DEEP MIXTURE DENSITY NETWORKS FOR ACOUSTIC MODELING IN STATISTICAL PARAMETRIC SPEECH SYNTHESIS

by Heiga Zen, Andrew Senior
"... Statistical parametric speech synthesis (SPSS) using deep neural net-works (DNNs) has shown its potential to produce naturally-sounding synthesized speech. However, there are limitations in the current im-plementation of DNN-based acoustic modeling for speech synthesis, such as the unimodal nature o ..."
Abstract - Cited by 10 (1 self) - Add to MetaCart
Statistical parametric speech synthesis (SPSS) using deep neural net-works (DNNs) has shown its potential to produce naturally-sounding synthesized speech. However, there are limitations in the current im-plementation of DNN-based acoustic modeling for speech synthesis, such as the unimodal nature of its objective function and its lack of ability to predict variances. To address these limitations, this paper investigates the use of a mixture density output layer. It can esti-mate full probability density functions over real-valued output fea-tures conditioned on the corresponding input features. Experimental results in objective and subjective evaluations show that the use of the mixture density output layer improves the prediction accuracy of acoustic features and the naturalness of the synthesized speech. Index Terms — Statistical parametric speech synthesis; hidden Markov models; deep neural networks; mixture density networks; 1.
(Show Context)

Citation Context

...stochastic gradient descent (SGD)- based back-propagation algorithm was used. To schedule the learning rate of the minibatch stochastic gradient descent (SGD)-based back-propagation algorithm, AdaDec =-=[30]-=- was used.2 Both input and output features in the training data were normalized; the input features were normalized to have zero-mean unit-variance, whereas the output features were normalized to be w...

MEAN-NORMALIZED STOCHASTIC GRADIENT FOR LARGE-SCALE DEEP LEARNING

by Simon Wiesler, Er Richard, Hermann Ney
"... Deep neural networks are typically optimized with stochastic gradi-ent descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on an-alytic results showing that a non-zero mean of features is harmful for the optimization. We prove conver ..."
Abstract - Cited by 5 (2 self) - Add to MetaCart
Deep neural networks are typically optimized with stochastic gradi-ent descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on an-alytic results showing that a non-zero mean of features is harmful for the optimization. We prove convergence of our algorithm in a convex setting. In our experiments we show that our proposed algo-rithm converges faster than SGD. Further, in contrast to earlier work, our algorithm allows for training models with a factorized structure from scratch. We found this structure to be very useful not only be-cause it accelerates training and decoding, but also because it is a very effective means against overfitting. Combining our proposed optimization algorithm with this model structure, model size can be reduced by a factor of eight and still improvements in recognition error rate are obtained. Additional gains are obtained by improving the Newbob learning rate strategy. Index Terms — deep learning, optimization, speech recognition, LVCSR 1.
(Show Context)

Citation Context

...nded when b is bounded, which proves almost sure convergence of MN-SGD. 2.4. Learning rate strategies A critical aspect for the performance of DNNs is the choice of a learning rate strategy, see e.g. =-=[19]-=-. In our group, we mostly used the popular Newbob learning rate strategy as it is implemented in the Quicknet software [20], i.e., the learning rate is kept fixed as long as the frame classification e...

UNIDIRECTIONAL LONG SHORT-TERMMEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS

by Heiga Zen
"... Long short-term memory recurrent neural networks (LSTM-RNNs) have been applied to various speech applications including acoustic modeling for statistical parametric speech synthesis. One of the con-cerns for applying them to text-to-speech applications is its effect on latency. To address this conce ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Long short-term memory recurrent neural networks (LSTM-RNNs) have been applied to various speech applications including acoustic modeling for statistical parametric speech synthesis. One of the con-cerns for applying them to text-to-speech applications is its effect on latency. To address this concern, this paper proposes a low-latency, streaming speech synthesis architecture using unidirectional LSTM-RNNs with a recurrent output layer. The use of unidirectional RNN architecture allows frame-synchronous streaming inference of out-put acoustic features given input linguistic features. The recurrent output layer further encourages smooth transition between acoustic features at consecutive frames. Experimental results in subjective listening tests show that the proposed architecture can synthesize natural sounding speech without requiring utterance-level batch pro-cessing. Index Terms — Statistical parametric speech synthesis; recur-rent neural networks; long short-term memory; low-latency; 1.
(Show Context)

Citation Context

...an squared error between the target and predicted output features. A GPU implementation of a mini-batch stochastic gradient descent (SGD)-based BP algorithm with AdaDec-based learning rate scheduling =-=[34]-=- and momentum term [35] was used. For training the LSTM-RNNs, a distributed CPU implementation of mini-batch ASGD-based truncated back propagation through time (BPTT) [36] algorithm was used [27]. Tra...

MODELLING ACOUSTIC FEATURE DEPENDENCIES WITH ARTIFICIAL NEURAL NETWORKS: TRAJECTORY-RNADE

by Benigno Uria, Iain Murray, Steve Renals, Cassia Valentini-botinhao, John Bridle
"... Given a transcription, sampling from a good model of acous-tic feature trajectories should result in plausible realizations of an utterance. However, samples from current probabilis-tic speech synthesis systems result in low quality synthetic speech. Henter et al. have demonstrated the need to captu ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Given a transcription, sampling from a good model of acous-tic feature trajectories should result in plausible realizations of an utterance. However, samples from current probabilis-tic speech synthesis systems result in low quality synthetic speech. Henter et al. have demonstrated the need to capture the dependencies between acoustic features conditioned on the phonetic labels in order to obtain high quality synthetic speech. These dependencies are often ignored in neural network based acoustic models. We tackle this deficiency by introducing a probabilistic neural network model of acoustic trajectories, trajectory RNADE, able to capture these dependencies. Index Terms — Speech synthesis, artificial neural net-works, acoustic modelling, RNADE, trajectory model
(Show Context)

Citation Context

...e the following order for the acoustic features: voiced/unvoiced, f0, mel-cepstral features (0 to 59 in that order), band aperiodicities (1 to 25 in that order). Both models were trained using AdaDec =-=[24]-=- for 1000 epochs of 1000 updates each, minibatches of size 100 were used. The learning rate was initialized to 3×10−4 and decreased by 3×10−7 after each epoch. We are mainly interested in measuring th...

RASR/NN: THE RWTH NEURAL NETWORK TOOLKIT FOR SPEECH RECOGNITION

by Simon Wiesler, Er Richard, Pavel Golik, Hermann Ney
"... This paper describes the new release of RASR- the open source version of the well-proven speech recognition toolkit developed and used at RWTH Aachen University. The focus is put on the implementation of the NN module for training neural network acoustic models. We describe code design, configuratio ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
This paper describes the new release of RASR- the open source version of the well-proven speech recognition toolkit developed and used at RWTH Aachen University. The focus is put on the implementation of the NN module for training neural network acoustic models. We describe code design, configuration, and features of the NN module. The key feature is a high flexibility regarding the network topology, choice of activation functions, training criteria, and opti-mization algorithm, as well as a built-in support for efficient GPU computing. The evaluation of run-time performance and recognition accuracy is performed exemplary with a deep neural network as acoustic model in a hybrid NN/HMM sys-tem. The results show that RASR achieves a state-of-the-art performance on a real-world large vocabulary task, while offering a complete pipeline for building and applying large scale speech recognition systems. Index Terms — speech recognition, acoustic modeling, neural networks, GPU, open source, RASR 1.
(Show Context)

Citation Context

...QuickNet. Newbob keeps the learning rate constant within each epoch. RASR also supports a power schedule that decays with every mini-batch and has been reported to perform slightly better than Newbob =-=[18]-=-. Mostly, we use a modification of Newbob which decays the learning rate less aggressively than the original Newbob [15]. 2.3. Recognition RASR comes with a dynamic decoder that is based on the histor...

Edinburgh Research Explorer

by unknown authors
"... Peer reviewed version ..."
Abstract - Add to MetaCart
Peer reviewed version
(Show Context)

Citation Context

...e learning rates, instead of empirical tuning them. In our internal experiments on a small scale problem, we found that Adadelta performed comparable to the well tuned exponential scheduling approach =-=[36]-=- with momentum. However, the latter is more cumbersome in tuning the hyperparamters. Refer to [28] for more details on the model training of an encoder-decoder. 3. Experimental Setup 3.1. Encoder type...

-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs

by Frank Seide , Hao Fu , Jasha Droppo , Gang Li , Dong Yu
"... Abstract We show empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively-to but one bit per value-if the quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasib ..."
Abstract - Add to MetaCart
Abstract We show empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively-to but one bit per value-if the quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasible to parallelize SGD through data-parallelism with fast processors like recent GPUs. We implement data-parallel deterministically distributed SGD by combining this finding with AdaGrad, automatic minibatch-size selection, double buffering, and model parallelism. Unexpectedly, quantization benefits AdaGrad, giving a small accuracy gain. For a typical Switchboard DNN with 46M parameters, we reach computation speeds of 27k frames per second (kfps) when using 2880 samples per minibatch, and 51kfps with 16k, on a server with 8 K20X GPUs. This corresponds to speed-ups over a single GPU of 3.6 and 6.3, respectively. 7 training passes over 309h of data complete in under 7h. A 160M-parameter model training processes 3300h of data in under 16h on 20 dual-GPU servers-a 10 times speed-up-albeit at a small accuracy loss.
(Show Context)

Citation Context

... above which convergence slows notably or fails [19, 8]. Thus, every 24h of data, we process the next ≈45 minutes at different minibatch sizes and pick the largest N that does not hurt convergence, based on trainingset frame accuracy. We find that more mature models allow for larger N . So do smaller learning rates, so we also use a gradually decaying learning-rate profile that was determined automatically using frame accuracy on a cross-validation set on an earlier configuration. Lastly, we use AdaGrad—a technique to normalize gradients by their standard deviation over time or recent samples [30, 31]. AdaGrad leads to faster convergence and allows to increase the minibatch size earlier. Our system can apply AdaGrad at three different places: Locally on each node before quantization (which may benefit quantization at the risk of introducing inconsistencies across nodes); during data exchange (risking interference from quantization) after aggregation; and after momentum smoothing (saving memory and fixed cost while reducing the effect of AdaGrad due to peaks being smoothed out). We find AdaGrad responds best to quantized, unsmooted gradients. To address (c), the fixed cost, and to benefit f...

A Study of the Recurrent Neural Network Encoder-Decoder for Large Vocabulary Speech Recognition

by Liang Lu, Xingxing Zhang, Kyunghyun Cho, Steve Renals
"... Deep neural networks have advanced the state-of-the-art in automatic speech recognition, when combined with hidden Markov models (HMMs). Recently there has been interest in using systems based on recurrent neural networks (RNNs) to perform sequence modelling directly, without the require-ment of an ..."
Abstract - Add to MetaCart
Deep neural networks have advanced the state-of-the-art in automatic speech recognition, when combined with hidden Markov models (HMMs). Recently there has been interest in using systems based on recurrent neural networks (RNNs) to perform sequence modelling directly, without the require-ment of an HMM superstructure. In this paper, we study the RNN encoder-decoder approach for large vocabulary end-to-end speech recognition, whereby an encoder transforms a se-quence of acoustic vectors into a sequence of feature represen-tations, from which a decoder recovers a sequence of words. We investigated this approach on the Switchboard corpus us-ing a training set of around 300 hours of transcribed audio data. Without the use of an explicit language model or pronunciation lexicon, we achieved promising recognition accuracy, demon-strating that this approach warrants further investigation. Index Terms: end-to-end speech recognition, deep neural net-works, recurrent neural networks, encoder-decoder. 1.
(Show Context)

Citation Context

...e learning rates, instead of empirical tuning them. In our internal experiments on a small scale problem, we found that Adadelta performed comparable to the well tuned exponential scheduling approach =-=[36]-=- with momentum. However, the latter is more cumbersome in tuning the hyperparamters. Refer to [28] for more details on the model training of an encoder-decoder. 3. Experimental Setup 3.1. Encoder type...

Learning Step Size Controllers for Robust Neural Network Training

by Christian Daniel, Tu Darmstadt, Jonathan Taylor, Sebastian Nowozin
"... This paper investigates algorithms to automatically adapt the learning rate of neural networks (NNs). Start-ing with stochastic gradient descent, a large variety of learning methods has been proposed for the NN setting. However, these methods are usually sensitive to the ini-tial learning rate which ..."
Abstract - Add to MetaCart
This paper investigates algorithms to automatically adapt the learning rate of neural networks (NNs). Start-ing with stochastic gradient descent, a large variety of learning methods has been proposed for the NN setting. However, these methods are usually sensitive to the ini-tial learning rate which has to be chosen by the exper-imenter. We investigate several features and show how an adaptive controller can adjust the learning rate with-out prior knowledge of the learning problem at hand.
(Show Context)

Citation Context

... initial learning rate either manually, or, for example, through Bayesian Optimization (Snoek, Larochelle, and Adams 2012). Subsequently, the learning rate is decreased following a predefined scheme (=-=Senior et al. 2013-=-). Possible schemes include the ‘waterfall’ scheme (Senior et al. 2013), which keeps η constant for multiple steps and then reduces it by large amounts, as well as the exponential scheme (Sutton 1992)...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University