Results 1  10
of
17
Deep Neural Networks for Acoustic Modeling in Speech Recognition
"... Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative ..."
Abstract

Cited by 272 (47 self)
 Add to MetaCart
(Show Context)
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition. I.
Large scale distributed deep networks,
 Proceedings of NIPS,
, 2012
"... Abstract Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have d ..."
Abstract

Cited by 107 (12 self)
 Add to MetaCart
(Show Context)
Abstract Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for largescale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of LBFGS. Downpour SGD and Sandblaster LBFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves stateoftheart performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestlysized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradientbased machine learning algorithm.
Application of Pretrained Deep Neural Networks to Large Vocabulary Conversational Speech Recognition
, 2012
"... ..."
(Show Context)
Improving wideband speech recognition using mixedbandwidth training data
 in CDDNNHMM,” IEEE SLT
, 2012
"... Contextdependent deep neural network hidden Markov model (CDDNNHMM) is a recently proposed acoustic model that significantly outperformed Gaussian mixture model (GMM)HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixedbandwidt ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
(Show Context)
Contextdependent deep neural network hidden Markov model (CDDNNHMM) is a recently proposed acoustic model that significantly outperformed Gaussian mixture model (GMM)HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixedbandwidth training data to improve wideband speech recognition accuracy in the CDDNNHMM framework. We show that DNNs provide the flexibility of using arbitrary features. By using the Melscale logfilter bank features we not only achieve higher recognition accuracy than using MFCCs, but also can formulate the mixedbandwidth training problem as a missing feature problem, in which several feature dimensions have no value when narrowband speech is presented. This treatment makes training CDDNNHMMs with mixedbandwidth data an easy task since no bandwidth extension is needed. Our experiments on voice search data indicate that the proposed solution not only provides higher recognition accuracy for the wideband speech but also allows the same CDDNNHMM to recognize mixedbandwidth speech. By exploiting mixedbandwidth training data CDDNNHMM outperforms fMPE+BMMI trained GMMHMM, which cannot benefit from using narrowband data, by 18.4%. Index Terms — deep neural network, log filter bank, CDDNNHMM, wideband, narrowband, mixedbandwidth 1.
An empirical study of learning rates in deep neural networks for speech recognition
 in Proc. ICASSP
"... Recent deep neural network systems for large vocabulary speech recognition are trained with minibatch stochastic gradient descent but use a variety of learning rate scheduling schemes. We investigate several of these schemes, particularly AdaGrad. Based on our analysis of its limitations, we propos ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
(Show Context)
Recent deep neural network systems for large vocabulary speech recognition are trained with minibatch stochastic gradient descent but use a variety of learning rate scheduling schemes. We investigate several of these schemes, particularly AdaGrad. Based on our analysis of its limitations, we propose a new variant ‘AdaDec ’ that decouples longterm learningrate scheduling from perparameter learning rate variation. AdaDec was found to result in higher frame accuracies than other methods. Overall, careful choice of learning rate schemes leads to faster convergence and lower word error rates. Index Terms — Deep neural networks, large vocabulary speech recognition, Voice Search, learning rate, AdaGrad, AdaDec. 1.
The Case for Onloading Continuous HighDatarate Perception to the Phone
"... Much has been said recently on offloading computations from the phone. In particular, workloads such as speech and visual recognition that involve models based on “big data ” are thought to be prime candidates for cloud processing. We posit that the next few years will see the arrival of mobile usa ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Much has been said recently on offloading computations from the phone. In particular, workloads such as speech and visual recognition that involve models based on “big data ” are thought to be prime candidates for cloud processing. We posit that the next few years will see the arrival of mobile usages that require continuous processing of audio and video data from wearable devices. We argue that these usages are unlikely to flourish unless substantial computation is moved back on to the phone. We outline possible solutions to the problems inherent in such a move. We advocate a close partnership between perception and systems researchers to realize these usages. 1
Exploiting linear structure within convolutional networks for efficient evaluation,” arXiv:1404.0736
, 2014
"... We present techniques for speeding up the testtime evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy, but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Inter ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
We present techniques for speeding up the testtime evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy, but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internetscale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the redundancy present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large stateoftheart models, we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2×, while keeping the accuracy within 1 % of the original model. 1
MULTIFRAME DEEP NEURAL NETWORKS FOR ACOUSTIC MODELING
"... Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition. Compared to Gaussian mixtures however, they tend to be very expensive computationally, making them challenging to use in realtime applications. One key advantage of such neural networks is ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition. Compared to Gaussian mixtures however, they tend to be very expensive computationally, making them challenging to use in realtime applications. One key advantage of such neural networks is their ability to learn from very long observation windows going up to 400 ms. Given this very long temporal context, it is tempting to wonder whether one can run neural networks at a lower frame rate than the typical 10 ms, and whether there might be computational benefits to doing so. This paper describes a method of tying the neural network parameters over time which achieves comparable performance to the typical framesynchronous model, while achieving up to a 4X reduction in the computational cost of the neural network activations. Index Terms — deep neural networks, acoustic modeling 1.
Deep Learning with Limited Numerical Precision
"... Training of largescale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of lowprecision fixedpoint computations, we observe the rounding ..."
Abstract
 Add to MetaCart
(Show Context)
Training of largescale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of lowprecision fixedpoint computations, we observe the rounding scheme to play a crucial role in determining the network’s behavior during training. Our results show that deep networks can be trained using only 16bit wide fixedpoint number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energyefficient hardware accelerator that implements lowprecision fixedpoint arithmetic with stochastic rounding. 1.
Fixed Point Quantization of Deep Convolutional Networks
"... Abstract In recent years increasingly complex architectures for deep convolution networks (DCNs) have been proposed to boost the performance on image recognition tasks. However, the gains in performance have come at a cost of substantial increase in computation and model storage resources. Fixed po ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract In recent years increasingly complex architectures for deep convolution networks (DCNs) have been proposed to boost the performance on image recognition tasks. However, the gains in performance have come at a cost of substantial increase in computation and model storage resources. Fixed point implementation of DCNs has the potential to alleviate some of these complexities and facilitate potential deployment on embedded hardware. In this paper, we propose a quantizer design for fixed point implementation of DCNs. We formulate and solve an optimization problem to identify optimal fixed point bitwidth allocation across DCN layers. Our experiments show that in comparison to equal bitwidth settings, the fixed point DCNs with optimized bit width allocation offer > 20% reduction in the model size without any loss in accuracy on CIFAR10 benchmark. We also demonstrate that finetuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model. In doing so, we report a new stateoftheart fixed point performance of 6.78% errorrate on CIFAR10 benchmark.