Results 1 - 10
of
17
Deep Neural Networks for Acoustic Modeling in Speech Recognition
"... Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative ..."
Abstract
-
Cited by 272 (47 self)
- Add to MetaCart
(Show Context)
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition. I.
Large scale distributed deep networks,
- Proceedings of NIPS,
, 2012
"... Abstract Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have d ..."
Abstract
-
Cited by 107 (12 self)
- Add to MetaCart
(Show Context)
Abstract Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly-sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.
Application of Pretrained Deep Neural Networks to Large Vocabulary Conversational Speech Recognition
, 2012
"... ..."
(Show Context)
Improving wideband speech recognition using mixed-bandwidth training data
- in CD-DNNHMM,” IEEE SLT
, 2012
"... Context-dependent deep neural network hidden Markov model (CD-DNN-HMM) is a recently proposed acoustic model that significantly outperformed Gaussian mixture model (GMM)-HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixedbandwidt ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
(Show Context)
Context-dependent deep neural network hidden Markov model (CD-DNN-HMM) is a recently proposed acoustic model that significantly outperformed Gaussian mixture model (GMM)-HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixedbandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework. We show that DNNs provide the flexibility of using arbitrary features. By using the Mel-scale log-filter bank features we not only achieve higher recognition accuracy than using MFCCs, but also can formulate the mixed-bandwidth training problem as a missing feature problem, in which several feature dimensions have no value when narrowband speech is presented. This treatment makes training CD-DNN-HMMs with mixed-bandwidth data an easy task since no bandwidth extension is needed. Our experiments on voice search data indicate that the proposed solution not only provides higher recognition accuracy for the wideband speech but also allows the same CD-DNN-HMM to recognize mixed-bandwidth speech. By exploiting mixed-bandwidth training data CD-DNN-HMM outperforms fMPE+BMMI trained GMM-HMM, which cannot benefit from using narrowband data, by 18.4%. Index Terms — deep neural network, log filter bank, CD-DNN-HMM, wideband, narrowband, mixed-bandwidth 1.
An empirical study of learning rates in deep neural networks for speech recognition
- in Proc. ICASSP
"... Recent deep neural network systems for large vocabulary speech recognition are trained with minibatch stochastic gradient descent but use a variety of learning rate scheduling schemes. We investigate several of these schemes, particularly AdaGrad. Based on our analy-sis of its limitations, we propos ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
Recent deep neural network systems for large vocabulary speech recognition are trained with minibatch stochastic gradient descent but use a variety of learning rate scheduling schemes. We investigate several of these schemes, particularly AdaGrad. Based on our analy-sis of its limitations, we propose a new variant ‘AdaDec ’ that decou-ples long-term learning-rate scheduling from per-parameter learning rate variation. AdaDec was found to result in higher frame accu-racies than other methods. Overall, careful choice of learning rate schemes leads to faster convergence and lower word error rates. Index Terms — Deep neural networks, large vocabulary speech recognition, Voice Search, learning rate, AdaGrad, AdaDec. 1.
The Case for Onloading Continuous High-Datarate Perception to the Phone
"... Much has been said recently on off-loading computations from the phone. In particular, workloads such as speech and visual recognition that involve models based on “big data ” are thought to be prime candidates for cloud processing. We posit that the next few years will see the arrival of mobile usa ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
(Show Context)
Much has been said recently on off-loading computations from the phone. In particular, workloads such as speech and visual recognition that involve models based on “big data ” are thought to be prime candidates for cloud processing. We posit that the next few years will see the arrival of mobile usages that require continuous processing of audio and video data from wearable devices. We argue that these usages are unlikely to flourish unless substantial computation is moved back on to the phone. We outline possible solutions to the problems inherent in such a move. We advocate a close partnership between perception and systems researchers to realize these usages. 1
Exploiting linear structure within convolutional networks for efficient evaluation,” arXiv:1404.0736
, 2014
"... We present techniques for speeding up the test-time evaluation of large convo-lutional networks, designed for object recognition tasks. These models deliver impressive accuracy, but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Inter ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
We present techniques for speeding up the test-time evaluation of large convo-lutional networks, designed for object recognition tasks. These models deliver impressive accuracy, but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the redundancy present within the con-volutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2×, while keeping the accuracy within 1 % of the original model. 1
MULTIFRAME DEEP NEURAL NETWORKS FOR ACOUSTIC MODELING
"... Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition. Compared to Gaussian mixtures however, they tend to be very expensive computationally, making them challenging to use in real-time applications. One key advantage of such neural networks is ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition. Compared to Gaussian mixtures however, they tend to be very expensive computationally, making them challenging to use in real-time applications. One key advantage of such neural networks is their ability to learn from very long observation windows going up to 400 ms. Given this very long temporal context, it is tempting to wonder whether one can run neural networks at a lower frame rate than the typical 10 ms, and whether there might be computational benefits to doing so. This paper describes a method of tying the neural network parameters over time which achieves comparable performance to the typical frame-synchronous model, while achieving up to a 4X reduction in the computational cost of the neural network activations. Index Terms — deep neural networks, acoustic modeling 1.
Deep Learning with Limited Numerical Precision
"... Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited preci-sion data representation and computation on neu-ral network training. Within the context of low-precision fixed-point computations, we observe the rounding ..."
Abstract
- Add to MetaCart
(Show Context)
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited preci-sion data representation and computation on neu-ral network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in de-termining the network’s behavior during train-ing. Our results show that deep networks can be trained using only 16-bit wide fixed-point num-ber representation when using stochastic round-ing, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that imple-ments low-precision fixed-point arithmetic with stochastic rounding. 1.
Fixed Point Quantization of Deep Convolutional Networks
"... Abstract In recent years increasingly complex architectures for deep convolution networks (DCNs) have been proposed to boost the performance on image recognition tasks. However, the gains in performance have come at a cost of substantial increase in computation and model storage resources. Fixed po ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract In recent years increasingly complex architectures for deep convolution networks (DCNs) have been proposed to boost the performance on image recognition tasks. However, the gains in performance have come at a cost of substantial increase in computation and model storage resources. Fixed point implementation of DCNs has the potential to alleviate some of these complexities and facilitate potential deployment on embedded hardware. In this paper, we propose a quantizer design for fixed point implementation of DCNs. We formulate and solve an optimization problem to identify optimal fixed point bit-width allocation across DCN layers. Our experiments show that in comparison to equal bitwidth settings, the fixed point DCNs with optimized bit width allocation offer > 20% reduction in the model size without any loss in accuracy on CIFAR-10 benchmark. We also demonstrate that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model. In doing so, we report a new state-of-the-art fixed point performance of 6.78% error-rate on CIFAR-10 benchmark.