• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Recent advances in deep learning for speech research at Microsoft,” in (2013)

by L Deng, J Li, J-T Huang
Venue:Proc. ICASSP,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 23
Next 10 →

Deep convolutional neural networks using heterogeneous pooling for trading-off acoustic invariance with phonetic confusion,” ICASSP

by Li Deng, Ossama Abdel-hamid, Dong Yu , 2013
"... We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided ..."
Abstract - Cited by 19 (9 self) - Add to MetaCart
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided by domain knowledge about how speech classes would change when formant frequencies are modified. The convolution and heterogeneouspooling layers are followed by a fully connected multi-layer neural network to form a deep architecture interfaced to an HMM for continuous speech recognition. During training, all layers of this entire deep net are regularized using a variant of the “dropout” technique. Experimental evaluation demonstrates the effectiveness of both heterogeneous pooling and dropout regularization. On the TIMIT phonetic recognition task, we have achieved an 18.7% phone error rate, lowest on this standard task reported in the literature with a single system and with no use of information about speaker identity. Preliminary experiments on large vocabulary speech recognition in a voice search task also show error rate reduction using heterogeneous pooling in the deep convolutional neural network. Index Terms — convolution, heterogeneous pooling, deep, neural network, invariance, discrimination, formants 1.
(Show Context)

Citation Context

...emerging technology that has recently demonstrated dramatic success in speech feature extraction and recognition, scaling very well from small [8][26][27][28] to medium [3][4][19][36] and to large [2]=-=[6]-=-[17] [21][34][32][37] tasks. (For recent reviews on the use of neural networks in speech recognition, see [29][17]). Some related DNN architectures have also demonstrated effectiveness in speech under...

Exploring convolutional neural network structures and optimization techniques for speech recognition,”

by Ossama Abdel-Hamid , Li Deng , Dong Yu - in Interspeech-2013, , 2013
"... Abstract Recently, convolutional neural networks (CNNs) have been shown to outperform the standard fully connected deep neural networks within the hybrid deep neural network / hidden Markov model (DNN/HMM) framework on the phone recognition task. In this paper, we extend the earlier basic form of t ..."
Abstract - Cited by 14 (1 self) - Add to MetaCart
Abstract Recently, convolutional neural networks (CNNs) have been shown to outperform the standard fully connected deep neural networks within the hybrid deep neural network / hidden Markov model (DNN/HMM) framework on the phone recognition task. In this paper, we extend the earlier basic form of the CNN and explore it in multiple ways. We first investigate several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers. We then develop a novel weighted softmax pooling layer so that the size in the pooling layer can be automatically learned. Further, we evaluate the effect of CNN pretraining, which is achieved by using a convolutional version of the RBM. We show that all CNN architectures we have investigated outperform the earlier basic form of the DNN on both the phone recognition and large vocabulary speech recognition tasks. The architecture with limited weight sharing provides additional gains over the full weight sharing architecture. The softmax pooling layer performs as well as the best CNN with the manually tuned fixed-pooling size, and has a potential for further improvement. Finally, we show that CNN pretraining produces significantly better results on a large vocabulary speech recognition task.
(Show Context)

Citation Context

...stricted Boltzmann Machine 1. Introduction Recently, deep neural network hidden Markov model (DNN/HMM) hybrid systems achieved remarkable performance in many large vocabulary speech recognition tasks =-=[1, 2, 3, 4, 5, 6, 7]-=-. This is attributed to the improved modeling power of the DNN that enables it to map complex patterns into class labels or posterior probabilities. This modeling power stems from the deep-layered str...

New types of deep neural network learning for speech recognition and related applications: An overview

by Li Deng, Geoffrey Hinton, Brian Kingsbury - in Proc. Int. Conf. Acoust., Speech, Signal Process , 2013
"... In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications, ” as organized by the authors. We also describe the historical context in ..."
Abstract - Cited by 11 (4 self) - Add to MetaCart
In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications, ” as organized by the authors. We also describe the historical context in which acoustic models based on deep neural networks have been developed. The technical overview of the papers presented in our special session is organized into five ways of improving deep learning methods: (1) better optimization; (2) better types of neural activation function and better network architectures; (3) better ways to determine the myriad hyper-parameters of deep neural networks; (4) more appropriate ways to preprocess speech for deep neural networks; and (5) ways of leveraging multiple languages or dialects that are more easily achieved with deep neural networks than with Gaussian mixture models. Index Terms — deep neural network, convolutional neural network, recurrent neural network, optimization, spectrogram features, multitask, multilingual, speech recognition, music processing
(Show Context)

Citation Context

...of the vocal tract. In an important paper, Abdel-Hamid et. al. [1] demonstrated that convolution across frequency was very effective for TIMIT. More recent work described in the papers from Microsoft =-=[12]-=-[2][13] shows that designing the convolution and pooling layers to properly trade-off between invariance to the vocal tract length and discrimination among speech sounds together with the “dropout” te...

Deep Neural Network Approach for the Dialog State Tracking Challenge

by Matthew Henderson, Blaise Thomson, Steve Young
"... While belief tracking is known to be important in allowing statistical dialog systems to manage dialogs in a highly robust manner, until recently little attention has been given to analysing the behaviour of belief tracking techniques. The Dialogue State Tracking Challenge has allowed for such an an ..."
Abstract - Cited by 10 (5 self) - Add to MetaCart
While belief tracking is known to be important in allowing statistical dialog systems to manage dialogs in a highly robust manner, until recently little attention has been given to analysing the behaviour of belief tracking techniques. The Dialogue State Tracking Challenge has allowed for such an analysis, comparing multiple belief tracking approaches on a shared task. Recent success in using deep learning for speech research motivates the Deep Neural Network approach presented here. The model parameters can be learnt by directly maximising the likelihood of the training data. The paper explores some aspects of the training, and the resulting tracker is found to perform competitively, particularly on a corpus of dialogs from a system not found in the training. 1
(Show Context)

Citation Context

...en-us/events/dstc/ with fewer hidden layers. Recent developments in speech research have shown promising results using deep learning, motivating its use in the context of dialog (Hinton et al., 2012; =-=Li et al., 2013-=-). This paper presents a technique which solves the task of outputting a sequence of probability distributions over an arbitrary number of possible values using a single neural network, by learning ti...

Singular Value Decomposition Based Low-Footprint Speaker Adaptation and Personalization for Deep Neural Network,”

by Jian Xue , Jinyu Li , Dong Yu , Mike Seltzer , Yifan Gong - in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, , 2014
"... ABSTRACT The large number of parameters in deep neural networks (DNN) for automatic speech recognition (ASR) makes speaker adaptation very challenging. It also limits the use of speaker personalization due to the huge storage cost in large-scale deployments. In this paper we address DNN adaptation ..."
Abstract - Cited by 10 (3 self) - Add to MetaCart
ABSTRACT The large number of parameters in deep neural networks (DNN) for automatic speech recognition (ASR) makes speaker adaptation very challenging. It also limits the use of speaker personalization due to the huge storage cost in large-scale deployments. In this paper we address DNN adaptation and personalization issues by presenting two methods based on the singular value decomposition (SVD). The first method uses an SVD to replace the weight matrix of a speaker independent DNN by the product of two low rank matrices. Adaptation is then performed by updating a square matrix inserted between the two low-rank matrices. In the second method, we adapt the full weight matrix but only store the delta matrix -the difference between the original and adapted weight matrices. We decrease the footprint of the adapted model by storing a reduced rank version of the delta matrix via an SVD. The proposed methods were evaluated on short message dictation task. Experimental results show that we can obtain similar accuracy improvements as the previously proposed Kullback-Leibler divergence (KLD) regularized method with far fewer parameters, which only requires 0.89% of the original model storage.
(Show Context)

Citation Context

...ing a reduced rank version of the delta matrix via an SVD. The proposed methods were evaluated on short message dictation task. Experimental results show that we can obtain similar accuracy improvements as the previously proposed Kullback-Leibler divergence (KLD) regularized method with far fewer parameters, which only requires 0.89% of the original model storage. Index Terms— deep neural network, speaker adaptation, speaker personalization, singular value decomposition 1. INTRODUCTION Recent progress in deep learning has attracted a lot of interest in automatic speech recognition (ASR) [1][2][3][4][5][6]. The discovery of the strong modeling capabilities of deep neural networks (DNN) and the availability of high-speed hardware has made it feasible to train huge networks with tens of millions of parameters. In the framework of context-dependent DNN hidden-Markovmodels (CD-DNN-HMM) [1], the conventional Gaussian Mixture Model (GMM) is replaced by a DNN to evaluate the senone log-likelihood. Besides CD-DNN-HMMs, a DNN can also be used to provide the bottle-neck features for a GMM-HMM system [7][8]. In both applications of a DNN in ASR, significant accuracy improvement was achieved. Howe...

Do deep nets really need to be deep

by Lei Jimmy Ba, Rich Caruana - in Advances in Neural Information Processing Systems, 2014
"... Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievab ..."
Abstract - Cited by 9 (0 self) - Add to MetaCart
Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow neural nets can learn these deep functions using the same number of parameters as the original deep models. On the TIMIT phoneme recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional architectures. 1
(Show Context)

Citation Context

...eural net consisting of a convolutional layer and max-pooling layer followed by three hidden layers containing 2000 ReLU units [2]. The CNN was trained using the same convolutional architecture as in =-=[6]-=-. We also formed an ensemble of nine CNN models, ECNN. The accuracy of DNN, CNN, and ECNN on the final test set are shown in Table 1. The error rate of the convolutional deep net (CNN) is about 2.1% b...

Eye gaze for spoken language understanding in multi-modal conversational interactions.” ICMI

by Dilek Hakkani-tür, Malcolm Slaney, Asli Celikyilmaz, Larry Heck , 2014
"... When humans converse with each other, they naturally amal-gamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to vi-sual (sc ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
When humans converse with each other, they naturally amal-gamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to vi-sual (screen) elements in a conversational web browsing sys-tem. The system detects eye gaze, recognizes speech, and then interprets the user’s browsing intent (e.g., click on a specific element) through a combination of spoken language understanding and eye gaze tracking. We experiment with multi-turn interactions collected in a wizard-of-Oz scenario where users are asked to perform several web-browsing tasks. We compare several gaze features and evaluate their effec-tiveness when combined with speech-based lexical features. The resulting multi-modal system not only increases user in-tent (turn) accuracy by 17%, but also resolves the referring expression ambiguity commonly observed in dialog systems with a 10 % increase in F-measure.
(Show Context)

Citation Context

...n. There are in total 175,113 candidate links on the web pages visited by the users (an average of 301.4 per each click turn). We use a state-of-the-art large vocabulary ASR system in our experiments =-=[5]-=-. The acoustic models incorporate the latest advances in context-dependent deep neural networks (DNN) for estimating senone likelihoods. The language model (LM) is a general-purpose backoff 4-gram mod...

The Relation of Eye Gaze and Face Pose: Potential Impact on Speech Recognition

by Malcolm Slaney, Andreas Stolcke, Dilek Hakkani-tür
"... We are interested in using context to improve speech recog-nition and speech understanding. Knowing what the user is attending to visually helps us predict their utterances and thus makes speech recognition easier. Eye gaze is one way to access this signal, but is often unavailable (or expensive to ..."
Abstract - Cited by 3 (2 self) - Add to MetaCart
We are interested in using context to improve speech recog-nition and speech understanding. Knowing what the user is attending to visually helps us predict their utterances and thus makes speech recognition easier. Eye gaze is one way to access this signal, but is often unavailable (or expensive to gather) at longer distances. In this paper we look at joint eye-gaze and facial-pose information while users perform a speech reading task. We hypothesize, and verify experimen-tally, that the eyes lead, and then the face follows. Face pose might not be as fast, or as accurate a signal of visual attention as eye gaze, but based on experiments correlat-ing eye gaze with speech recognition, we conclude that face pose provides useful information to bias a recognizer toward higher accuracy.
(Show Context)

Citation Context

...mation is not general, but was deemed sufficiently accurate for this paper’s purposes. 2.3 Automatic Speech Recognition We use a state-of-the-art large vocabulary speech recognizer in our experiments =-=[2, 6]-=-. The acoustic models incorporate the latest advances in context-dependent deep neural networks (DNN) for estimating senone likelihoods. The language model (LM) is a general-purpose backoff 4-gram mod...

Deep Segmental Neural Networks for Speech Recognition

by Ossama Abdel-hamid, Li Deng, Dong Yu, Hui Jiang
"... Hybrid systems which integrate the deep neural network (DNN) and hidden Markov model (HMM) have recently achieved re-markable performance in many large vocabulary speech recog-nition tasks. These systems, however, remain to rely on the HMM and assume the acoustic scores for the (windowed) frames are ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Hybrid systems which integrate the deep neural network (DNN) and hidden Markov model (HMM) have recently achieved re-markable performance in many large vocabulary speech recog-nition tasks. These systems, however, remain to rely on the HMM and assume the acoustic scores for the (windowed) frames are independent given the state, suffering from the same difficulty as in the previous GMM-HMM systems. In this pa-per, we propose the deep segmental neural network (DSNN), a segmental model that uses DNNs to estimate the acoustic scores of phonemic or sub-phonemic segments with variable lengths. This allows the DSNN to represent each segment as a single unit, in which frames are made dependent on each other. We describe the architecture of the DSNN, as well as its learning and decoding algorithms. Our evaluation experiments demon-strate that the DSNN can outperform the DNN/HMM hybrid systems and two existing segmental models including the seg-mental conditional random field and the shallow segmental neu-ral network.
(Show Context)

Citation Context

...mental Neural Network 1. Introduction Recently, deep-neural-network hidden Markov model (DNN/HMM) hybrid systems have achieved remarkable performance in many large vocabulary speech recognition tasks =-=[1, 2, 3, 4, 5, 6, 7, 8]-=-. These DNN/HMM hybrid systems, however, estimate the observation likelihood score for each (windowed) frame independently, and rely on a separate HMM to connect these scores to form the overall score...

Gaze enhanced speech recognition

by Malcolm Slaney , Rahul Rajan , Andreas Stolcke , Partha Parthasarathy - In Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP , 2014
"... ABSTRACT This work demonstrates through simulations and experimental work the potential of eye-gaze data to improve speech-recognition results. Multimodal interfaces, where users see information on a display and use their voice to control an interaction, are of growing importance as mobile phones a ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
ABSTRACT This work demonstrates through simulations and experimental work the potential of eye-gaze data to improve speech-recognition results. Multimodal interfaces, where users see information on a display and use their voice to control an interaction, are of growing importance as mobile phones and tablets grow in popularity. We demonstrate an improvement in speech-recognition performance, as measured by word error rate, by rescoring the output from a large-vocabulary speech-recognition system. We use eye-gaze data as a spotlight and collect bigram word statistics near to where the user looks in time and space. We see a 25% relative reduction in the word-error rate over a generic language model, and approximately a 10% reduction in errors over a strong, page-specific baseline language model.
(Show Context)

Citation Context

...blue, a full 20 seconds before this utterance, to the color red at the end of the utterance. 3. RECOGNITION EXPERIMENTS We use a state-of-the-art large vocabulary speech recognizer in our experiments =-=[6]-=-. The acoustic models incorporate the latest advances in context-dependent deep neural networks (DNN) for estimating senone likelihoods. The language model (LM) is a general-purpose backoff 4-gram mod...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University