Results 1  10
of
18
ContextDependent Pretrained Deep Neural Networks for Large Vocabulary Speech Recognition
 IEEE Transactions on Audio, Speech, and Language Processing
, 2012
"... Abstract—We propose a novel contextdependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNNHMM) hybrid architecture that trains the ..."
Abstract

Cited by 85 (35 self)
 Add to MetaCart
Abstract—We propose a novel contextdependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNNHMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pretraining algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CDDNNHMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CDDNNHMMs can significantly outperform the conventional contextdependent Gaussian mixture model (GMM)HMMs, with an absolute sentence accuracy improvement of 5.8 % and 9.2 % (or relative error reduction of 16.0 % and 23.2%) over the CDGMMHMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively. Index Terms—Speech recognition, deep belief network, contextdependent phone, LVSR, DNNHMM, ANNHMM I.
Deep Neural Networks for Acoustic Modeling in Speech Recognition
"... Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative ..."
Abstract

Cited by 43 (18 self)
 Add to MetaCart
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition. I.
Deep Sparse Rectifier Neural Networks
"... While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multilayer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbol ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multilayer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard nonlinearity and nondifferentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semisupervised setups with extraunlabeled data, deep rectifier networks can reach their best performance without requiring any unsupervised pretraining on purely supervised tasks with large labeled datasets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised neural networks, and closing the performance gap between neural networks learnt with and without unsupervised pretraining. 1
Representation Learning: A Review and New Perspectives
, 2012
"... The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representationlearning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and joint training of deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep architectures. This motivates longerterm unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
Improved preconditioner for hessian free optimization
 In NIPS Workshop on Deep Learning and Unsupervised Feature Learning
, 2011
"... We investigate the use of Hessian Free optimization for learning deep autoencoders. One of the critical components in that algorithm is the choice of the preconditioner. We argue in this paper that the Jacobi preconditioner leads to faster optimization and we show how it can be accurately and effici ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We investigate the use of Hessian Free optimization for learning deep autoencoders. One of the critical components in that algorithm is the choice of the preconditioner. We argue in this paper that the Jacobi preconditioner leads to faster optimization and we show how it can be accurately and efficiently estimated using a randomized algorithm. 1
Deep Learning Made Easier by Linear Transformations in Perceptrons
"... We transform the outputs of each hidden neuron in a multilayer perceptron network to be zero mean and zero slope, and use separate shortcut connections to model the linear dependencies instead. This transformation aims at separating the problems of learning the linear and nonlinear parts of the who ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
We transform the outputs of each hidden neuron in a multilayer perceptron network to be zero mean and zero slope, and use separate shortcut connections to model the linear dependencies instead. This transformation aims at separating the problems of learning the linear and nonlinear parts of the whole inputoutput mapping, which has many benefits. We study the theoretical properties of the transformation by noting that they make the Fisher information matrix closer to a diagonal matrix, and thus standard gradient closer to the natural gradient. We experimentally confirm the usefulness of the transformations by noting that they make basic stochastic gradient learning competitive with stateoftheart learning algorithms in speed, and that they seem also to help find solutions that generalize better. The experiments include both classification of handwritten digits with a 3layer network and learning a lowdimensional representation for images by using a 6layer autoencoder network. The transformations were beneficial in all cases, with and without regularization. 1
Training Recurrent Neural Networks
, 2013
"... Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging probl ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging problems. We first describe a new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs. The new model is more powerful than similar models while being less difficult to train. Next, we present a new variant of the Hessianfree (HF) optimizer and show that it can train RNNs on tasks that have extreme longrange temporal dependencies, which were previously considered to be impossibly hard. We then apply HF to characterlevel language modelling and get excellent results. We also apply HF to optimal control and obtain RNN control laws that can successfully operate under conditions of delayed feedback and unknown disturbances. Finally, we describe a random parameter initialization scheme that allows gradient descent with momentum to train RNNs on problems with longterm dependencies. This directly contradicts widespread beliefs about the inability of firstorder methods to do so, and suggests that previous attempts at training RNNs failed partly due to flaws in the random initialization.
Training Deep and Recurrent Networks with HessianFree Optimization
, 2012
"... HessianFree optimization (HF) is an approach for unconstrained minimization of realvalued smooth objective functions. Like standard Newton’s method, it uses local quadratic approximations to generate update proposals. It belongs to the broad class of approximate Newton methods that are practical f ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
HessianFree optimization (HF) is an approach for unconstrained minimization of realvalued smooth objective functions. Like standard Newton’s method, it uses local quadratic approximations to generate update proposals. It belongs to the broad class of approximate Newton methods that are practical for problems
Learning TwoLayer Contractive Encodings
"... Abstract. Unsupervised learning of feature hierarchies is often a good initialization for supervised training of deep architectures. In existing deep learning methods, these feature hierarchies are built layer by layer in a greedy fashion using autoencoders or restricted Boltzmann machines. Both yi ..."
Abstract
 Add to MetaCart
Abstract. Unsupervised learning of feature hierarchies is often a good initialization for supervised training of deep architectures. In existing deep learning methods, these feature hierarchies are built layer by layer in a greedy fashion using autoencoders or restricted Boltzmann machines. Both yield encoders, which compute linear projections followed by a smooth thresholding function. In this work, we demonstrate that these encoders fail to find stable features when the required computation is in the exclusiveor class. To overcome this limitation, we propose a twolayer encoder which is not restricted in the type of features it can learn. The proposed encoder can be regularized by an extension of previous work on contractive regularization. We demonstrate the advantages of twolayer encoders qualitatively, as well as on commonly used benchmark datasets. 1
On the importance of initialization and momentum in deep learning
"... Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a welldesigned random initializa ..."
Abstract
 Add to MetaCart
Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a welldesigned random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with longterm dependencies) to levels of performance that were previously achievable only with HessianFree optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and wellinitialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated secondorder methods. 1.