Results 1 - 10
of
13
Large scale deep neural network acoustic modeling with semi-supervised training data for youtube video transcription
- in Workshop on Automatic Speech Recognition and Understanding (ASRU
, 2013
"... YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Im-proving accessibility to these videos for the hard of hearing and for search and indexing purposes is an excellent application of automatic speech recognition. However, Yo ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Im-proving accessibility to these videos for the hard of hearing and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely chal-lenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009 YouTube has provided automatic gener-ation of closed captions for videos detected to have English speech; the service now supports ten different languages. This article de-scribes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories. Applying an “island of confidence ” fil-tering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low-rank final layer weight matrix approximation, improved performance by about 13 % relative compared to previously reported sequence trained DNN results for this task. Index Terms — Large vocabulary speech recognition, deep neu-ral networks, deep learning, audio indexing. 1.
DEEP MIXTURE DENSITY NETWORKS FOR ACOUSTIC MODELING IN STATISTICAL PARAMETRIC SPEECH SYNTHESIS
"... Statistical parametric speech synthesis (SPSS) using deep neural net-works (DNNs) has shown its potential to produce naturally-sounding synthesized speech. However, there are limitations in the current im-plementation of DNN-based acoustic modeling for speech synthesis, such as the unimodal nature o ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
Statistical parametric speech synthesis (SPSS) using deep neural net-works (DNNs) has shown its potential to produce naturally-sounding synthesized speech. However, there are limitations in the current im-plementation of DNN-based acoustic modeling for speech synthesis, such as the unimodal nature of its objective function and its lack of ability to predict variances. To address these limitations, this paper investigates the use of a mixture density output layer. It can esti-mate full probability density functions over real-valued output fea-tures conditioned on the corresponding input features. Experimental results in objective and subjective evaluations show that the use of the mixture density output layer improves the prediction accuracy of acoustic features and the naturalness of the synthesized speech. Index Terms — Statistical parametric speech synthesis; hidden Markov models; deep neural networks; mixture density networks; 1.
MEAN-NORMALIZED STOCHASTIC GRADIENT FOR LARGE-SCALE DEEP LEARNING
"... Deep neural networks are typically optimized with stochastic gradi-ent descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on an-alytic results showing that a non-zero mean of features is harmful for the optimization. We prove conver ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Deep neural networks are typically optimized with stochastic gradi-ent descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on an-alytic results showing that a non-zero mean of features is harmful for the optimization. We prove convergence of our algorithm in a convex setting. In our experiments we show that our proposed algo-rithm converges faster than SGD. Further, in contrast to earlier work, our algorithm allows for training models with a factorized structure from scratch. We found this structure to be very useful not only be-cause it accelerates training and decoding, but also because it is a very effective means against overfitting. Combining our proposed optimization algorithm with this model structure, model size can be reduced by a factor of eight and still improvements in recognition error rate are obtained. Additional gains are obtained by improving the Newbob learning rate strategy. Index Terms — deep learning, optimization, speech recognition, LVCSR 1.
UNIDIRECTIONAL LONG SHORT-TERMMEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS
"... Long short-term memory recurrent neural networks (LSTM-RNNs) have been applied to various speech applications including acoustic modeling for statistical parametric speech synthesis. One of the con-cerns for applying them to text-to-speech applications is its effect on latency. To address this conce ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Long short-term memory recurrent neural networks (LSTM-RNNs) have been applied to various speech applications including acoustic modeling for statistical parametric speech synthesis. One of the con-cerns for applying them to text-to-speech applications is its effect on latency. To address this concern, this paper proposes a low-latency, streaming speech synthesis architecture using unidirectional LSTM-RNNs with a recurrent output layer. The use of unidirectional RNN architecture allows frame-synchronous streaming inference of out-put acoustic features given input linguistic features. The recurrent output layer further encourages smooth transition between acoustic features at consecutive frames. Experimental results in subjective listening tests show that the proposed architecture can synthesize natural sounding speech without requiring utterance-level batch pro-cessing. Index Terms — Statistical parametric speech synthesis; recur-rent neural networks; long short-term memory; low-latency; 1.
MODELLING ACOUSTIC FEATURE DEPENDENCIES WITH ARTIFICIAL NEURAL NETWORKS: TRAJECTORY-RNADE
"... Given a transcription, sampling from a good model of acous-tic feature trajectories should result in plausible realizations of an utterance. However, samples from current probabilis-tic speech synthesis systems result in low quality synthetic speech. Henter et al. have demonstrated the need to captu ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Given a transcription, sampling from a good model of acous-tic feature trajectories should result in plausible realizations of an utterance. However, samples from current probabilis-tic speech synthesis systems result in low quality synthetic speech. Henter et al. have demonstrated the need to capture the dependencies between acoustic features conditioned on the phonetic labels in order to obtain high quality synthetic speech. These dependencies are often ignored in neural network based acoustic models. We tackle this deficiency by introducing a probabilistic neural network model of acoustic trajectories, trajectory RNADE, able to capture these dependencies. Index Terms — Speech synthesis, artificial neural net-works, acoustic modelling, RNADE, trajectory model
RASR/NN: THE RWTH NEURAL NETWORK TOOLKIT FOR SPEECH RECOGNITION
"... This paper describes the new release of RASR- the open source version of the well-proven speech recognition toolkit developed and used at RWTH Aachen University. The focus is put on the implementation of the NN module for training neural network acoustic models. We describe code design, configuratio ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
This paper describes the new release of RASR- the open source version of the well-proven speech recognition toolkit developed and used at RWTH Aachen University. The focus is put on the implementation of the NN module for training neural network acoustic models. We describe code design, configuration, and features of the NN module. The key feature is a high flexibility regarding the network topology, choice of activation functions, training criteria, and opti-mization algorithm, as well as a built-in support for efficient GPU computing. The evaluation of run-time performance and recognition accuracy is performed exemplary with a deep neural network as acoustic model in a hybrid NN/HMM sys-tem. The results show that RASR achieves a state-of-the-art performance on a real-world large vocabulary task, while offering a complete pipeline for building and applying large scale speech recognition systems. Index Terms — speech recognition, acoustic modeling, neural networks, GPU, open source, RASR 1.
-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs
"... Abstract We show empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively-to but one bit per value-if the quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasib ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract We show empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively-to but one bit per value-if the quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasible to parallelize SGD through data-parallelism with fast processors like recent GPUs. We implement data-parallel deterministically distributed SGD by combining this finding with AdaGrad, automatic minibatch-size selection, double buffering, and model parallelism. Unexpectedly, quantization benefits AdaGrad, giving a small accuracy gain. For a typical Switchboard DNN with 46M parameters, we reach computation speeds of 27k frames per second (kfps) when using 2880 samples per minibatch, and 51kfps with 16k, on a server with 8 K20X GPUs. This corresponds to speed-ups over a single GPU of 3.6 and 6.3, respectively. 7 training passes over 309h of data complete in under 7h. A 160M-parameter model training processes 3300h of data in under 16h on 20 dual-GPU servers-a 10 times speed-up-albeit at a small accuracy loss.
A Study of the Recurrent Neural Network Encoder-Decoder for Large Vocabulary Speech Recognition
"... Deep neural networks have advanced the state-of-the-art in automatic speech recognition, when combined with hidden Markov models (HMMs). Recently there has been interest in using systems based on recurrent neural networks (RNNs) to perform sequence modelling directly, without the require-ment of an ..."
Abstract
- Add to MetaCart
(Show Context)
Deep neural networks have advanced the state-of-the-art in automatic speech recognition, when combined with hidden Markov models (HMMs). Recently there has been interest in using systems based on recurrent neural networks (RNNs) to perform sequence modelling directly, without the require-ment of an HMM superstructure. In this paper, we study the RNN encoder-decoder approach for large vocabulary end-to-end speech recognition, whereby an encoder transforms a se-quence of acoustic vectors into a sequence of feature represen-tations, from which a decoder recovers a sequence of words. We investigated this approach on the Switchboard corpus us-ing a training set of around 300 hours of transcribed audio data. Without the use of an explicit language model or pronunciation lexicon, we achieved promising recognition accuracy, demon-strating that this approach warrants further investigation. Index Terms: end-to-end speech recognition, deep neural net-works, recurrent neural networks, encoder-decoder. 1.
Learning Step Size Controllers for Robust Neural Network Training
"... This paper investigates algorithms to automatically adapt the learning rate of neural networks (NNs). Start-ing with stochastic gradient descent, a large variety of learning methods has been proposed for the NN setting. However, these methods are usually sensitive to the ini-tial learning rate which ..."
Abstract
- Add to MetaCart
(Show Context)
This paper investigates algorithms to automatically adapt the learning rate of neural networks (NNs). Start-ing with stochastic gradient descent, a large variety of learning methods has been proposed for the NN setting. However, these methods are usually sensitive to the ini-tial learning rate which has to be chosen by the exper-imenter. We investigate several features and show how an adaptive controller can adjust the learning rate with-out prior knowledge of the learning problem at hand.