Results 1 - 10
of
29
RECENT ADVANCES IN DEEP LEARNING FOR SPEECH RESEARCH AT MICROSOFT
"... Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the ..."
Abstract
-
Cited by 23 (10 self)
- Add to MetaCart
Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the current deep learning technology. We organize this overview along the feature-domain and model-domain dimensions according to the conventional approach to analyzing speech systems. Selected experimental results, including speech recognition and related applications such as spoken dialogue and language modeling, are presented to demonstrate and analyze the strengths and weaknesses of the techniques described in the paper. Potential improvement of these techniques and future research directions are discussed. Index Terms — deep learning, neural network, multilingual, speech recognition, spectral features, convolution, dialogue
Multilingual acoustic models using distributed deep neural networks,” ICASSP
, 2013
"... Today’s speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training h ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
(Show Context)
Today’s speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training has the potential to solve the data issue and close the performance gap between resource-rich and resourcescarce languages. Neural networks lend themselves naturally to parameter sharing across languages, and distributed implementations have made it feasible to train large networks. In this paper, we present experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total. The average relative gains over the monolingual baselines are 4%/2 % (data-scarce/data-rich languages) for cross- and 7%/2 % for multi-lingual training. However, the additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks, compared to two weeks (monolingual) and one week (crosslingual). Index Terms — Speech recognition, parameter sharing, deep neural networks, multilingual training, distributed neural networks
Deep Maxout Networks for Low-Resource Speech Recognition
- Proc. Automatic Speech Recognition and Understanding (ASRU
"... ABSTRACT As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show stateof-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition ( ..."
Abstract
-
Cited by 18 (11 self)
- Add to MetaCart
(Show Context)
ABSTRACT As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show stateof-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks. Our focus is on the particular advantage of DMNs under low-resource conditions with limited transcribed speech. We extend DMNs to hybrid and bottleneck feature systems, and explore optimal network structures (number of maxout layers, pooling strategy, etc) for both setups. On the newly released Babel corpus, behaviors of DMNs are extensively studied under different levels of data availability. Experiments show that DMNs improve low-resource speech recognition significantly. Moreover, DMNs introduce sparsity to their hidden activations and thus can act as sparse feature extractors.
New types of deep neural network learning for speech recognition and related applications: An overview
- in Proc. Int. Conf. Acoust., Speech, Signal Process
, 2013
"... In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications, ” as organized by the authors. We also describe the historical context in ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
(Show Context)
In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications, ” as organized by the authors. We also describe the historical context in which acoustic models based on deep neural networks have been developed. The technical overview of the papers presented in our special session is organized into five ways of improving deep learning methods: (1) better optimization; (2) better types of neural activation function and better network architectures; (3) better ways to determine the myriad hyper-parameters of deep neural networks; (4) more appropriate ways to preprocess speech for deep neural networks; and (5) ways of leveraging multiple languages or dialects that are more easily achieved with deep neural networks than with Gaussian mixture models. Index Terms — deep neural network, convolutional neural network, recurrent neural network, optimization, spectrogram features, multitask, multilingual, speech recognition, music processing
Towards speaker adaptive training of deep neural network acoustic models,” to appear in
- Proc. Interspeech,
, 2014
"... ABSTRACT Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on wo ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
(Show Context)
ABSTRACT Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on word error rates (WERs). In this paper, we present different methods to further improve and extend SAT-DNN. First, we conduct detailed analysis to investigate i-vector extractor training and flexible feature fusion. Second, the SAT-DNN approach is extended to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNNbased feature extraction. Third, for transcribing multimedia data, we enrich the i-vector representation with global speaker attributes (age, gender, etc.) obtained automatically from the video signal. On a collection of instructional videos, incorporation of the additional visual features is observed to boost the recognition accuracy of SAT-DNN.
Distributed Learning of Multilingual DNN Feature Extractors using GPUs
- in Proc. Interspeech
, 2014
"... Abstract Multilingual deep neural networks (DNNs) can act as deep feature extractors and have been applied successfully to crosslanguage acoustic modeling. Learning these feature extractors becomes an expensive task, because of the enlarged multilingual training data and the sequential nature of st ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
(Show Context)
Abstract Multilingual deep neural networks (DNNs) can act as deep feature extractors and have been applied successfully to crosslanguage acoustic modeling. Learning these feature extractors becomes an expensive task, because of the enlarged multilingual training data and the sequential nature of stochastic gradient descent (SGD). This paper investigates strategies to accelerate the learning process over multiple GPU cards. We propose the DistModel and DistLang frameworks which distribute feature extractor learning by models and languages respectively. The time-synchronous DistModel has the nice property of tolerating infrequent model averaging. With 3 GPUs, DistModel achieves 2.6× speed-up and causes no loss on word error rates. When using DistLang, we observe better acceleration but worse recognition performance. Further evaluations are conducted to scale DistModel to more languages and GPU cards.
Improving language-universal feature extraction with deep maxout and convolutional neural networks,” to appear in
- Proc. Interspeech,
, 2014
"... Abstract When deployed in automated speech recognition (ASR), deep neural networks (DNNs) can be treated as a complex feature extractor plus a simple linear classifier. Previous work has investigated the utility of multilingual DNNs acting as language-universal feature extractors (LUFEs). In this p ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
(Show Context)
Abstract When deployed in automated speech recognition (ASR), deep neural networks (DNNs) can be treated as a complex feature extractor plus a simple linear classifier. Previous work has investigated the utility of multilingual DNNs acting as language-universal feature extractors (LUFEs). In this paper, we explore different strategies to further improve LUFEs. First, we replace the standard sigmoid nonlinearity with the recently proposed maxout units. The resulting maxout LUFEs have the nice property of generating sparse feature representations. Second, the convolutional neural network (CNN) architecture is applied to obtain more invariant feature space. We evaluate the performance of LUFEs on a cross-language ASR task. Each of the proposed techniques results in word error rate reduction compared with the existing DNN-based LUFEs. Combining the two methods together brings additional improvement on the target language.
Investigation of maxout networks for speech recognition
- in Proc IEEE ICASSP
, 2014
"... We explore the use of maxout neuron in various aspects of acous-tic modelling for large vocabulary speech recognition systems; in-cluding low-resource scenario and multilingual knowledge transfers. Through the experiments on voice search and short message dicta-tion datasets, we found that maxout ne ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
(Show Context)
We explore the use of maxout neuron in various aspects of acous-tic modelling for large vocabulary speech recognition systems; in-cluding low-resource scenario and multilingual knowledge transfers. Through the experiments on voice search and short message dicta-tion datasets, we found that maxout networks are around three times faster to train and offer lower or comparable word error rates on several tasks, when compared to the networks with logistic nonlin-earity. We also present a detailed study of the maxout unit internal behaviour suggesting the use of different nonlinearities in different layers. Index Terms — deep neural networks, maxout networks, multi-task learning, low-resource speech recognition 1.
Acoustic and Lexical Resource Constrained ASR using Language-Independent Acoustic Model and Language-Dependent Probabilistic Lexical Model,” Idiap,
, 2014
"... Abstract One of the key challenges involved in building statistical automatic speech recognition (ASR) systems is modeling the relationship between subword units or "lexical units" and acoustic feature observations. To model this relationship two types of resources are needed, namely, aco ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Abstract One of the key challenges involved in building statistical automatic speech recognition (ASR) systems is modeling the relationship between subword units or "lexical units" and acoustic feature observations. To model this relationship two types of resources are needed, namely, acoustic resources i.e., speech data with word level transcriptions and lexical resources where each word is transcribed in terms of subword units. Standard ASR systems typically use phonemes or phones as subword units. However, not all languages have well developed acoustic and phonetic lexical resources. In this paper, we show that the relationship between lexical units and acoustic features can be factored into two parts through a latent variable, namely, an acoustic model and a lexical model. In the acoustic model the relationship between latent variables and acoustic features is modeled, while in the lexical model a probabilistic relationship between latent variables and lexical units is modeled. We elucidate that in standard hidden Markov model based ASR systems, the relationship between lexical units and latent variables is one-to-one and the lexical model is deterministic. Through a literature survey we show that this deterministic lexical modeling imposes the need for well developed acoustic and lexical resources from the target language or domain to build an ASR system. We then propose an approach that addresses both acoustic and phonetic lexical resource constraints in ASR system development. In the proposed approach, latent variables are multilingual phones and lexical units are graphemes of the target language or domain. We show that the acoustic model can be trained on domain-independent or language-independent resources and the lexical model that models a probabilistic relationship between graphemes and multilingual phones can be trained on a relatively small amount of transcribed speech data from the target domain or language. The potential and the efficacy of the proposed approach is demonstrated through experiments and comparisons with other approaches on three different ASR tasks: non-native and accented speech recognition, rapid development of an ASR system for a new language, and development of an ASR system for a minority language.
Multilingual deep neural network based acoustic modeling for rapid language adaptation
- in ICASSP. IEEE
, 2014
"... This paper presents a study on multilingual deep neural net-work (DNN) based acoustic modeling and its application to new languages. We investigate the effect of phone merging on multilingual DNN in context of rapid language adapta-tion. Moreover, the combination of multilingual DNNs with Kullback–L ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper presents a study on multilingual deep neural net-work (DNN) based acoustic modeling and its application to new languages. We investigate the effect of phone merging on multilingual DNN in context of rapid language adapta-tion. Moreover, the combination of multilingual DNNs with Kullback–Leibler divergence based acoustic modeling (KL-HMM) is explored. Using ten different languages from the Globalphone database, our studies reveal that crosslingual acoustic model transfer through multilingual DNNs is superior to unsuper-vised RBM pre-training and greedy layer-wise supervised training. We also found that KL-HMM based decoding consistently outperforms conventional hybrid decoding, es-pecially in low-resource scenarios. Furthermore, the ex-periments indicate that multilingual DNN training equally benefits from simple phoneset concatenation and manually derived universal phonesets. Index Terms — Multilingual DNN, phone merging, rapid language adaptation, KL-HMM 1.