Results 1 - 10
of
23
Deep convolutional neural networks using heterogeneous pooling for trading-off acoustic invariance with phonetic confusion,” ICASSP
, 2013
"... We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
(Show Context)
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided by domain knowledge about how speech classes would change when formant frequencies are modified. The convolution and heterogeneouspooling layers are followed by a fully connected multi-layer neural network to form a deep architecture interfaced to an HMM for continuous speech recognition. During training, all layers of this entire deep net are regularized using a variant of the “dropout” technique. Experimental evaluation demonstrates the effectiveness of both heterogeneous pooling and dropout regularization. On the TIMIT phonetic recognition task, we have achieved an 18.7% phone error rate, lowest on this standard task reported in the literature with a single system and with no use of information about speaker identity. Preliminary experiments on large vocabulary speech recognition in a voice search task also show error rate reduction using heterogeneous pooling in the deep convolutional neural network. Index Terms — convolution, heterogeneous pooling, deep, neural network, invariance, discrimination, formants 1.
Exploring convolutional neural network structures and optimization techniques for speech recognition,”
- in Interspeech-2013,
, 2013
"... Abstract Recently, convolutional neural networks (CNNs) have been shown to outperform the standard fully connected deep neural networks within the hybrid deep neural network / hidden Markov model (DNN/HMM) framework on the phone recognition task. In this paper, we extend the earlier basic form of t ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
(Show Context)
Abstract Recently, convolutional neural networks (CNNs) have been shown to outperform the standard fully connected deep neural networks within the hybrid deep neural network / hidden Markov model (DNN/HMM) framework on the phone recognition task. In this paper, we extend the earlier basic form of the CNN and explore it in multiple ways. We first investigate several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers. We then develop a novel weighted softmax pooling layer so that the size in the pooling layer can be automatically learned. Further, we evaluate the effect of CNN pretraining, which is achieved by using a convolutional version of the RBM. We show that all CNN architectures we have investigated outperform the earlier basic form of the DNN on both the phone recognition and large vocabulary speech recognition tasks. The architecture with limited weight sharing provides additional gains over the full weight sharing architecture. The softmax pooling layer performs as well as the best CNN with the manually tuned fixed-pooling size, and has a potential for further improvement. Finally, we show that CNN pretraining produces significantly better results on a large vocabulary speech recognition task.
New types of deep neural network learning for speech recognition and related applications: An overview
- in Proc. Int. Conf. Acoust., Speech, Signal Process
, 2013
"... In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications, ” as organized by the authors. We also describe the historical context in ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
(Show Context)
In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications, ” as organized by the authors. We also describe the historical context in which acoustic models based on deep neural networks have been developed. The technical overview of the papers presented in our special session is organized into five ways of improving deep learning methods: (1) better optimization; (2) better types of neural activation function and better network architectures; (3) better ways to determine the myriad hyper-parameters of deep neural networks; (4) more appropriate ways to preprocess speech for deep neural networks; and (5) ways of leveraging multiple languages or dialects that are more easily achieved with deep neural networks than with Gaussian mixture models. Index Terms — deep neural network, convolutional neural network, recurrent neural network, optimization, spectrogram features, multitask, multilingual, speech recognition, music processing
Deep Neural Network Approach for the Dialog State Tracking Challenge
"... While belief tracking is known to be important in allowing statistical dialog systems to manage dialogs in a highly robust manner, until recently little attention has been given to analysing the behaviour of belief tracking techniques. The Dialogue State Tracking Challenge has allowed for such an an ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
(Show Context)
While belief tracking is known to be important in allowing statistical dialog systems to manage dialogs in a highly robust manner, until recently little attention has been given to analysing the behaviour of belief tracking techniques. The Dialogue State Tracking Challenge has allowed for such an analysis, comparing multiple belief tracking approaches on a shared task. Recent success in using deep learning for speech research motivates the Deep Neural Network approach presented here. The model parameters can be learnt by directly maximising the likelihood of the training data. The paper explores some aspects of the training, and the resulting tracker is found to perform competitively, particularly on a corpus of dialogs from a system not found in the training. 1
Singular Value Decomposition Based Low-Footprint Speaker Adaptation and Personalization for Deep Neural Network,”
- in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing,
, 2014
"... ABSTRACT The large number of parameters in deep neural networks (DNN) for automatic speech recognition (ASR) makes speaker adaptation very challenging. It also limits the use of speaker personalization due to the huge storage cost in large-scale deployments. In this paper we address DNN adaptation ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
(Show Context)
ABSTRACT The large number of parameters in deep neural networks (DNN) for automatic speech recognition (ASR) makes speaker adaptation very challenging. It also limits the use of speaker personalization due to the huge storage cost in large-scale deployments. In this paper we address DNN adaptation and personalization issues by presenting two methods based on the singular value decomposition (SVD). The first method uses an SVD to replace the weight matrix of a speaker independent DNN by the product of two low rank matrices. Adaptation is then performed by updating a square matrix inserted between the two low-rank matrices. In the second method, we adapt the full weight matrix but only store the delta matrix -the difference between the original and adapted weight matrices. We decrease the footprint of the adapted model by storing a reduced rank version of the delta matrix via an SVD. The proposed methods were evaluated on short message dictation task. Experimental results show that we can obtain similar accuracy improvements as the previously proposed Kullback-Leibler divergence (KLD) regularized method with far fewer parameters, which only requires 0.89% of the original model storage.
Do deep nets really need to be deep
- in Advances in Neural Information Processing Systems, 2014
"... Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievab ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow neural nets can learn these deep functions using the same number of parameters as the original deep models. On the TIMIT phoneme recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional architectures. 1
Eye gaze for spoken language understanding in multi-modal conversational interactions.” ICMI
, 2014
"... When humans converse with each other, they naturally amal-gamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to vi-sual (sc ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
(Show Context)
When humans converse with each other, they naturally amal-gamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to vi-sual (screen) elements in a conversational web browsing sys-tem. The system detects eye gaze, recognizes speech, and then interprets the user’s browsing intent (e.g., click on a specific element) through a combination of spoken language understanding and eye gaze tracking. We experiment with multi-turn interactions collected in a wizard-of-Oz scenario where users are asked to perform several web-browsing tasks. We compare several gaze features and evaluate their effec-tiveness when combined with speech-based lexical features. The resulting multi-modal system not only increases user in-tent (turn) accuracy by 17%, but also resolves the referring expression ambiguity commonly observed in dialog systems with a 10 % increase in F-measure.
The Relation of Eye Gaze and Face Pose: Potential Impact on Speech Recognition
"... We are interested in using context to improve speech recog-nition and speech understanding. Knowing what the user is attending to visually helps us predict their utterances and thus makes speech recognition easier. Eye gaze is one way to access this signal, but is often unavailable (or expensive to ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
We are interested in using context to improve speech recog-nition and speech understanding. Knowing what the user is attending to visually helps us predict their utterances and thus makes speech recognition easier. Eye gaze is one way to access this signal, but is often unavailable (or expensive to gather) at longer distances. In this paper we look at joint eye-gaze and facial-pose information while users perform a speech reading task. We hypothesize, and verify experimen-tally, that the eyes lead, and then the face follows. Face pose might not be as fast, or as accurate a signal of visual attention as eye gaze, but based on experiments correlat-ing eye gaze with speech recognition, we conclude that face pose provides useful information to bias a recognizer toward higher accuracy.
Deep Segmental Neural Networks for Speech Recognition
"... Hybrid systems which integrate the deep neural network (DNN) and hidden Markov model (HMM) have recently achieved re-markable performance in many large vocabulary speech recog-nition tasks. These systems, however, remain to rely on the HMM and assume the acoustic scores for the (windowed) frames are ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Hybrid systems which integrate the deep neural network (DNN) and hidden Markov model (HMM) have recently achieved re-markable performance in many large vocabulary speech recog-nition tasks. These systems, however, remain to rely on the HMM and assume the acoustic scores for the (windowed) frames are independent given the state, suffering from the same difficulty as in the previous GMM-HMM systems. In this pa-per, we propose the deep segmental neural network (DSNN), a segmental model that uses DNNs to estimate the acoustic scores of phonemic or sub-phonemic segments with variable lengths. This allows the DSNN to represent each segment as a single unit, in which frames are made dependent on each other. We describe the architecture of the DSNN, as well as its learning and decoding algorithms. Our evaluation experiments demon-strate that the DSNN can outperform the DNN/HMM hybrid systems and two existing segmental models including the seg-mental conditional random field and the shallow segmental neu-ral network.
Gaze enhanced speech recognition
- In Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP
, 2014
"... ABSTRACT This work demonstrates through simulations and experimental work the potential of eye-gaze data to improve speech-recognition results. Multimodal interfaces, where users see information on a display and use their voice to control an interaction, are of growing importance as mobile phones a ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
ABSTRACT This work demonstrates through simulations and experimental work the potential of eye-gaze data to improve speech-recognition results. Multimodal interfaces, where users see information on a display and use their voice to control an interaction, are of growing importance as mobile phones and tablets grow in popularity. We demonstrate an improvement in speech-recognition performance, as measured by word error rate, by rescoring the output from a large-vocabulary speech-recognition system. We use eye-gaze data as a spotlight and collect bigram word statistics near to where the user looks in time and space. We see a 25% relative reduction in the word-error rate over a generic language model, and approximately a 10% reduction in errors over a strong, page-specific baseline language model.