Results 1 - 10
of
28
RECENT ADVANCES IN DEEP LEARNING FOR SPEECH RESEARCH AT MICROSOFT
"... Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the ..."
Abstract
-
Cited by 23 (10 self)
- Add to MetaCart
(Show Context)
Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the current deep learning technology. We organize this overview along the feature-domain and model-domain dimensions according to the conventional approach to analyzing speech systems. Selected experimental results, including speech recognition and related applications such as spoken dialogue and language modeling, are presented to demonstrate and analyze the strengths and weaknesses of the techniques described in the paper. Potential improvement of these techniques and future research directions are discussed. Index Terms — deep learning, neural network, multilingual, speech recognition, spectral features, convolution, dialogue
Towards speaker adaptive training of deep neural network acoustic models,” to appear in
- Proc. Interspeech,
, 2014
"... ABSTRACT Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on wo ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
(Show Context)
ABSTRACT Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on word error rates (WERs). In this paper, we present different methods to further improve and extend SAT-DNN. First, we conduct detailed analysis to investigate i-vector extractor training and flexible feature fusion. Second, the SAT-DNN approach is extended to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNNbased feature extraction. Third, for transcribing multimedia data, we enrich the i-vector representation with global speaker attributes (age, gender, etc.) obtained automatically from the video signal. On a collection of instructional videos, incorporation of the additional visual features is observed to boost the recognition accuracy of SAT-DNN.
Distributed Learning of Multilingual DNN Feature Extractors using GPUs
- in Proc. Interspeech
, 2014
"... Abstract Multilingual deep neural networks (DNNs) can act as deep feature extractors and have been applied successfully to crosslanguage acoustic modeling. Learning these feature extractors becomes an expensive task, because of the enlarged multilingual training data and the sequential nature of st ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Abstract Multilingual deep neural networks (DNNs) can act as deep feature extractors and have been applied successfully to crosslanguage acoustic modeling. Learning these feature extractors becomes an expensive task, because of the enlarged multilingual training data and the sequential nature of stochastic gradient descent (SGD). This paper investigates strategies to accelerate the learning process over multiple GPU cards. We propose the DistModel and DistLang frameworks which distribute feature extractor learning by models and languages respectively. The time-synchronous DistModel has the nice property of tolerating infrequent model averaging. With 3 GPUs, DistModel achieves 2.6× speed-up and causes no loss on word error rates. When using DistLang, we observe better acceleration but worse recognition performance. Further evaluations are conducted to scale DistModel to more languages and GPU cards.
Improving language-universal feature extraction with deep maxout and convolutional neural networks,” to appear in
- Proc. Interspeech,
, 2014
"... Abstract When deployed in automated speech recognition (ASR), deep neural networks (DNNs) can be treated as a complex feature extractor plus a simple linear classifier. Previous work has investigated the utility of multilingual DNNs acting as language-universal feature extractors (LUFEs). In this p ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
(Show Context)
Abstract When deployed in automated speech recognition (ASR), deep neural networks (DNNs) can be treated as a complex feature extractor plus a simple linear classifier. Previous work has investigated the utility of multilingual DNNs acting as language-universal feature extractors (LUFEs). In this paper, we explore different strategies to further improve LUFEs. First, we replace the standard sigmoid nonlinearity with the recently proposed maxout units. The resulting maxout LUFEs have the nice property of generating sparse feature representations. Second, the convolutional neural network (CNN) architecture is applied to obtain more invariant feature space. We evaluate the performance of LUFEs on a cross-language ASR task. Each of the proposed techniques results in word error rate reduction compared with the existing DNN-based LUFEs. Combining the two methods together brings additional improvement on the target language.
NEURAL NETWORKS FOR DISTANT SPEECH RECOGNITION
"... Distant conversational speech recognition is challenging ow-ing to the presence of multiple, overlapping talkers, additional non-speech acoustic sources, and the effects of reverberation. In this paper we review work on distant speech recognition, with an emphasis on approaches which combine multich ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Distant conversational speech recognition is challenging ow-ing to the presence of multiple, overlapping talkers, additional non-speech acoustic sources, and the effects of reverberation. In this paper we review work on distant speech recognition, with an emphasis on approaches which combine multichan-nel signal processing with acoustic modelling, and investi-gate the use of hybrid neural network / hidden Markov model acoustic models for distant speech recognition of meetings recorded using microphone arrays. In particular we investi-gate the use of convolutional and fully-connected neural net-works with different activation functions (sigmoid, rectified linear, and maxout). We performed experiments on the AMI and ICSI meeting corpora, with results indicating that neu-ral network models are capable of significant improvements in accuracy compared with discriminatively trained Gaussian mixture models. Index Terms — convolutional neural networks, distant speech recognition, rectifier unit, maxout networks, beam-forming, meetings, AMI corpus, ICSI corpus 1.
deep convolutional nets and robust features for reverberation-robust speech recognition,” in
- Proc. of SLT,
, 2014
"... ABSTRACT While human listeners can understand speech in reverberant conditions, indicating that the auditory system is robust to such degradations, reverberation leads to high word error rates for automatic speech recognition (ASR) systems. In this work, we present robust acoustic features motivate ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
ABSTRACT While human listeners can understand speech in reverberant conditions, indicating that the auditory system is robust to such degradations, reverberation leads to high word error rates for automatic speech recognition (ASR) systems. In this work, we present robust acoustic features motivated by human speech perception for use in a convolutional deep neural network (CDNN)-based acoustic model for recognizing continuous speech in a reverberant condition. Using a single-feature system trained with the single channel data distributed through the REVERB 2014 challenge on ASR in reverberant conditions, we show a substantial relative reduction in word error rates (WERs) compared to the conventional filterbank energy-based features for single-channel simulated and real reverberation conditions. The reduction is more pronounced when multiple features and systems were combined together. The proposed system outperforms the best system reported in REVERB-2014 challenge in single channel full-batch processing task.
Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking
- In Proceedings of SIGdial. Association for Computational Linguistics
, 2015
"... The natural language generation (NLG) component of a spoken dialogue system (SDS) usually needs a substantial amount of handcrafting or a well-labeled dataset to be trained on. These limitations add sig-nificantly to development costs and make cross-domain, multi-lingual dialogue sys-tems intractabl ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
The natural language generation (NLG) component of a spoken dialogue system (SDS) usually needs a substantial amount of handcrafting or a well-labeled dataset to be trained on. These limitations add sig-nificantly to development costs and make cross-domain, multi-lingual dialogue sys-tems intractable. Moreover, human lan-guages are context-aware. The most nat-ural response should be directly learned from data rather than depending on pre-defined syntaxes or rules. This paper presents a statistical language generator based on a joint recurrent and convolu-tional neural network structure which can be trained on dialogue act-utterance pairs without any semantic alignments or pre-defined grammar trees. Objective metrics suggest that this new model outperforms previous methods under the same experi-mental conditions. Results of an evalua-tion by human judges indicate that it pro-duces not only high quality but linguisti-cally varied utterances which are preferred compared to n-gram and rule-based sys-tems. 1
Application of convolutional neural networks to speaker recognition in noisy conditions
- In Fifteenth Annual Conference of the International Speech Communication Association
, 2014
"... Abstract This paper applies a convolutional neural network (CNN) trained for automatic speech recognition (ASR) to the task of speaker identification (SID). In the CNN/i-vector front end, the sufficient statistics are collected based on the outputs of the CNN as opposed to the traditional universal ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract This paper applies a convolutional neural network (CNN) trained for automatic speech recognition (ASR) to the task of speaker identification (SID). In the CNN/i-vector front end, the sufficient statistics are collected based on the outputs of the CNN as opposed to the traditional universal background model (UBM). Evaluated on heavily degraded speech data, the CNN/i-vector front end provides performance comparable to the UBM/i-vector baseline. The combination of these approaches, however, is shown to provide improvements of 26% in miss rate to considerably outperform the fusion of two different features in the traditional UBM/i-vectors approach. An analysis of the language-and channel-dependency of the CNN/i-vector approach is also provided to highlight future research directions.
Language ID-based training of multilingual stacked bottleneck features,” in
- Proc. InterSpeech,
, 2014
"... Abstract In this paper, we explore multilingual feature-level data sharing via Deep Neural Network (DNN) stacked bottleneck features. Given a set of available source languages, we apply language identification to pick the language most similar to the target language, for more efficient use of multi ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract In this paper, we explore multilingual feature-level data sharing via Deep Neural Network (DNN) stacked bottleneck features. Given a set of available source languages, we apply language identification to pick the language most similar to the target language, for more efficient use of multilingual resources. Our experiments with IARPA-Babel languages show that bottleneck features trained on the most similar source language perform better than those trained on all available source languages. Further analysis suggests that only data similar to the target language is useful for multilingual training.
On speaker adaptation of long short-term memory recurrent neural networks
- in Sixteenth Annual Conference of the International Speech Communication Association (INTERSPEECH) (To Appear). ISCA
, 2015
"... Long Short-Term Memory (LSTM) is a recurrent neural net-work (RNN) architecture specializing in modeling long-range temporal dynamics. On acoustic modeling tasks, LSTM-RNNs have shown better performance than DNNs and conventional RNNs. In this paper, we conduct an extensive study on speaker adaptati ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Long Short-Term Memory (LSTM) is a recurrent neural net-work (RNN) architecture specializing in modeling long-range temporal dynamics. On acoustic modeling tasks, LSTM-RNNs have shown better performance than DNNs and conventional RNNs. In this paper, we conduct an extensive study on speaker adaptation of LSTM-RNNs. Speaker adaptation helps to reduce the mismatch between acoustic models and testing speakers. We have two main goals for this study. First, on a benchmark dataset, the existing DNN adaptation techniques are evaluated on the adaptation of LSTM-RNNs. We observe that LSTM-RNNs can be effectively adapted by using speaker-adaptive (SA) front-end, or by inserting speaker-dependent (SD) layers. Second, we propose two adaptation approaches that implement the SD-layer-insertion idea specifically for LSTM-RNNs. Us-ing these approaches, speaker adaptation improves word error rates by 3-4 % relative over a strong LSTM-RNN baseline. This improvement is enlarged to 6-7 % if we exploit SA features for further adaptation. Index Terms: Long Short-Term Memory, recurrent neural net-work, acoustic modeling, speaker adaptation