Results 21 - 30
of
254
Rectifier nonlinearities improve neural network acoustic models
- in ICML Workshop on Deep Learning for Audio, Speech and Language Processing
, 2013
"... Deep neural network acoustic models pro-duce substantial gains in large vocabu-lary continuous speech recognition systems. Emerging work with rectified linear (ReL) hidden units demonstrates additional gains in final system performance relative to more commonly used sigmoidal nonlinearities. In this ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
(Show Context)
Deep neural network acoustic models pro-duce substantial gains in large vocabu-lary continuous speech recognition systems. Emerging work with rectified linear (ReL) hidden units demonstrates additional gains in final system performance relative to more commonly used sigmoidal nonlinearities. In this work, we explore the use of deep rectifier networks as acoustic models for the 300 hour Switchboard conversational speech recogni-tion task. Using simple training procedures without pretraining, networks with rectifier nonlinearities produce 2 % absolute reduc-tions in word error rates over their sigmoidal counterparts. We analyze hidden layer repre-sentations to quantify differences in how ReL units encode inputs as compared to sigmoidal units. Finally, we evaluate a variant of the ReL unit with a gradient more amenable to optimization in an attempt to further im-prove deep rectifier networks. 1.
Adaptation of context-dependent deep neural networks for automatic speech recognition
- in Proc. SLT’12
, 2012
"... In this paper, we evaluate the effectiveness of adaptation methods for context-dependent deep-neural-network hidden Markov models (CD-DNN-HMMs) for automatic speech recognition. We investigate the affine transformation and several of its variants for adapting the top hidden layer. We compare the aff ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
(Show Context)
In this paper, we evaluate the effectiveness of adaptation methods for context-dependent deep-neural-network hidden Markov models (CD-DNN-HMMs) for automatic speech recognition. We investigate the affine transformation and several of its variants for adapting the top hidden layer. We compare the affine transformations against direct adaptation of the softmax layer weights. The feature-space discriminative linear regression (fDLR) method with the affine transformations on the input layer is also evaluated. On a large vocabulary speech recognition task, a stochastic gradient ascent implementation of the fDLR and the top hidden layer adaptation is shown to reduce word error rates (WERs) by 17 % and 14%, respectively, compared to the baseline DNN performances. With a batch update implementation, the softmax layer adaptation technique reduces WERs by 10%. We observe that using bias shift performs as well as doing scaling plus bias shift.
Extracting deep bottleneck features using stackedauto-encoders
- in Acoustics, Speech and Signal Processing (ICASSP),2013 IEEE International Conference on. IEEE, 2013. Proceedings of the First Workshop on Speech, Language and Audio in Multimedia (SLAM
"... In this work, a novel training scheme for generating bottleneck fea-tures from deep neural networks is proposed. A stack of denois-ing auto-encoders is first trained in a layer-wise, unsupervised man-ner. Afterwards, the bottleneck layer and an additional layer are added and the whole network is fin ..."
Abstract
-
Cited by 20 (16 self)
- Add to MetaCart
(Show Context)
In this work, a novel training scheme for generating bottleneck fea-tures from deep neural networks is proposed. A stack of denois-ing auto-encoders is first trained in a layer-wise, unsupervised man-ner. Afterwards, the bottleneck layer and an additional layer are added and the whole network is fine-tuned to predict target phoneme states. We perform experiments on a Cantonese conversational tele-phone speech corpus and find that increasing the number of auto-encoders in the network produces more useful features, but requires pre-training, especially when little training data is available. Using more unlabeled data for pre-training only yields additional gains. Evaluations on larger datasets and on different system setups demon-strate the general applicability of our approach. In terms of word error rate, relative improvements of 9.2 % (Cantonese, ML train-ing), 9.3 % (Tagalog, BMMI-SAT training), 12 % (Tagalog, confu-sion network combinations with MFCCs), and 8.7 % (Switchboard) are achieved.
Deep convolutional neural networks using heterogeneous pooling for trading-off acoustic invariance with phonetic confusion,” ICASSP
, 2013
"... We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
(Show Context)
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided by domain knowledge about how speech classes would change when formant frequencies are modified. The convolution and heterogeneouspooling layers are followed by a fully connected multi-layer neural network to form a deep architecture interfaced to an HMM for continuous speech recognition. During training, all layers of this entire deep net are regularized using a variant of the “dropout” technique. Experimental evaluation demonstrates the effectiveness of both heterogeneous pooling and dropout regularization. On the TIMIT phonetic recognition task, we have achieved an 18.7% phone error rate, lowest on this standard task reported in the literature with a single system and with no use of information about speaker identity. Preliminary experiments on large vocabulary speech recognition in a voice search task also show error rate reduction using heterogeneous pooling in the deep convolutional neural network. Index Terms — convolution, heterogeneous pooling, deep, neural network, invariance, discrimination, formants 1.
Modeling spectral envelopes using restricted Boltzmann machines for statistical Parametric speech synthesis,” ICASSP
, 2013
"... This paper presents a new spectral modeling method for statistical parametric speech synthesis. In contrast to the conventional methods in which high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM) based parametric spee ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
(Show Context)
This paper presents a new spectral modeling method for statistical parametric speech synthesis. In contrast to the conventional methods in which high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM) based parametric speech synthesis, our new method directly models the distribution of the lower-level, un-transformed or raw spectral envelopes. Instead of using single Gaussian distributions, we adopt restricted Boltzmann machines (RBM) to represent the distribution of the spectral envelopes at each HMM state. We anticipate these will give superior performance in modeling the joint distribution of high-dimensional stochastic vectors. The spectral parameters are derived from the spectral envelope corresponding to the estimated mode of each context-dependent RBM and act as the Gaussian mean vector in the parameter generation procedure at synthesis time. Our experimental results show that the RBM is able to model the distribution of the spectral envelopes with better accuracy and generalization ability than the Gaussian mixture model. As a result, our proposed method can significantly improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra. Index Terms — Speech synthesis, hidden Markov model, restricted Boltzmann machine, spectral envelope
Deepwalk: Online learning of social representations. arXiv preprint arXiv:1403.6652
, 2014
"... We present DeepWalk, a novel approach for learning la-tent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. Deep-Walk generalizes recent advancements in language mod-eling and ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
(Show Context)
We present DeepWalk, a novel approach for learning la-tent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. Deep-Walk generalizes recent advancements in language mod-eling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from trun-cated random walks to learn latent representations by treat-ing walks as the equivalent of sentences. We demonstrate DeepWalk’s latent representations on several multi-label network classification tasks for social networks such as Blog-Catalog, Flickr, and YouTube. Our results show that Deep-Walk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk’s representations can pro-vide F1 scores up to 10 % higher than competing methods when labeled data is sparse. In some experiments, Deep-Walk’s representations are able to outperform all baseline methods while using 60 % less training data. DeepWalk is also scalable. It is an online learning algo-rithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classifica-tion, and anomaly detection.
Multilingual acoustic models using distributed deep neural networks,” ICASSP
, 2013
"... Today’s speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training h ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
(Show Context)
Today’s speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training has the potential to solve the data issue and close the performance gap between resource-rich and resourcescarce languages. Neural networks lend themselves naturally to parameter sharing across languages, and distributed implementations have made it feasible to train large networks. In this paper, we present experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total. The average relative gains over the monolingual baselines are 4%/2 % (data-scarce/data-rich languages) for cross- and 7%/2 % for multi-lingual training. However, the additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks, compared to two weeks (monolingual) and one week (crosslingual). Index Terms — Speech recognition, parameter sharing, deep neural networks, multilingual training, distributed neural networks
Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding,” in
- Proc. INTERSPEECH,
, 2013
"... Abstract One of the key problems in spoken language understanding (SLU) is the task of slot filling. In light of the recent success of applying deep neural network technologies in domain detection and intent identification, we carried out an in-depth investigation on the use of recurrent neural net ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
(Show Context)
Abstract One of the key problems in spoken language understanding (SLU) is the task of slot filling. In light of the recent success of applying deep neural network technologies in domain detection and intent identification, we carried out an in-depth investigation on the use of recurrent neural networks for the more difficult task of slot filling involving sequence discrimination. In this work, we implemented and compared several important recurrent-neural-network architectures, including the Elman-type and Jordan-type recurrent networks and their variants. To make the results easy to reproduce and compare, we implemented these networks on the common Theano neural network toolkit, and evaluated them on the ATIS benchmark. We also compared our results to a conditional random fields (CRF) baseline. Our results show that on this task, both types of recurrent networks outperform the CRF baseline substantially, and a bi-directional Jordantype network that takes into account both past and future dependencies among slots works best, outperforming a CRFbased baseline by 14% in relative error reduction.
Deep Maxout Networks for Low-Resource Speech Recognition
- Proc. Automatic Speech Recognition and Understanding (ASRU
"... ABSTRACT As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show stateof-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition ( ..."
Abstract
-
Cited by 18 (11 self)
- Add to MetaCart
(Show Context)
ABSTRACT As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show stateof-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks. Our focus is on the particular advantage of DMNs under low-resource conditions with limited transcribed speech. We extend DMNs to hybrid and bottleneck feature systems, and explore optimal network structures (number of maxout layers, pooling strategy, etc) for both setups. On the newly released Babel corpus, behaviors of DMNs are extensively studied under different levels of data availability. Experiments show that DMNs improve low-resource speech recognition significantly. Moreover, DMNs introduce sparsity to their hidden activations and thus can act as sparse feature extractors.
MULTILINGUAL TRAINING OF DEEP NEURAL NETWORKS
"... We investigate multilingual modeling in the context of a deep neural network (DNN) – hidden Markov model (HMM) hybrid, where the DNN outputs are used as the HMM state likelihoods. By viewing neural networks as a cascade of feature extractors followed by a logistic regression classifier, we hypothes ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
(Show Context)
We investigate multilingual modeling in the context of a deep neural network (DNN) – hidden Markov model (HMM) hybrid, where the DNN outputs are used as the HMM state likelihoods. By viewing neural networks as a cascade of feature extractors followed by a logistic regression classifier, we hypothesise that the hidden layers, which act as feature extractors, will be transferable between languages. As a corollary, we propose that training the hidden layers on multiple languages makes them more suitable for such cross-lingual transfer. We experimentally confirm these hypotheses on the GlobalPhone corpus using seven languages from three different language families: Germanic, Romance, and Slavic. The experiments demonstrate substantial improvements over a monolingual DNN-HMM hybrid baseline, and hint at avenues of further exploration. Index Terms — Speech recognition, deep learning, neural networks, multilingual modeling