• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Crosslanguage knowledge transfer using multilingual deep neural network with shared hidden layers,” in (2013)

by J Huang, J Li, D Yu, L Deng, Y Gong
Venue:Proc. ICASSP,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 29
Next 10 →

RECENT ADVANCES IN DEEP LEARNING FOR SPEECH RESEARCH AT MICROSOFT

by Li Deng, Jinyu Li, Jui-ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Michael L. Seltzer, Geoff Zweig, Xiaodong He, Jason Williams, Yifan Gong, Alex Acero
"... Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the ..."
Abstract - Cited by 23 (10 self) - Add to MetaCart
Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the current deep learning technology. We organize this overview along the feature-domain and model-domain dimensions according to the conventional approach to analyzing speech systems. Selected experimental results, including speech recognition and related applications such as spoken dialogue and language modeling, are presented to demonstrate and analyze the strengths and weaknesses of the techniques described in the paper. Potential improvement of these techniques and future research directions are discussed. Index Terms — deep learning, neural network, multilingual, speech recognition, spectral features, convolution, dialogue

Multilingual acoustic models using distributed deep neural networks,” ICASSP

by G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, J. Dean , 2013
"... Today’s speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training h ..."
Abstract - Cited by 18 (2 self) - Add to MetaCart
Today’s speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training has the potential to solve the data issue and close the performance gap between resource-rich and resourcescarce languages. Neural networks lend themselves naturally to parameter sharing across languages, and distributed implementations have made it feasible to train large networks. In this paper, we present experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total. The average relative gains over the monolingual baselines are 4%/2 % (data-scarce/data-rich languages) for cross- and 7%/2 % for multi-lingual training. However, the additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks, compared to two weeks (monolingual) and one week (crosslingual). Index Terms — Speech recognition, parameter sharing, deep neural networks, multilingual training, distributed neural networks
(Show Context)

Citation Context

... from another language by using model adaptation [13, 14], a tandem approach [15, 16, 17, 18, 19], a phone mapping [20], unsupervised pre-training [21], initialization with an existing neural network =-=[22, 19, 23]-=-, or building a language-universal acoustic model based on a shared phone set [24]. Also, multilingual recognition including language identification is beyond the scope of this paper [20]. The multili...

Deep Maxout Networks for Low-Resource Speech Recognition

by Yajie Miao , Florian Metze , Shourabh Rawat - Proc. Automatic Speech Recognition and Understanding (ASRU
"... ABSTRACT As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show stateof-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition ( ..."
Abstract - Cited by 18 (11 self) - Add to MetaCart
ABSTRACT As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show stateof-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks. Our focus is on the particular advantage of DMNs under low-resource conditions with limited transcribed speech. We extend DMNs to hybrid and bottleneck feature systems, and explore optimal network structures (number of maxout layers, pooling strategy, etc) for both setups. On the newly released Babel corpus, behaviors of DMNs are extensively studied under different levels of data availability. Experiments show that DMNs improve low-resource speech recognition significantly. Moreover, DMNs introduce sparsity to their hidden activations and thus can act as sparse feature extractors.
(Show Context)

Citation Context

...tage of DMNs is to naturally introduce sparsity in theslearned representations. Since our focus is on low-resourcestasks, we study feature extraction in the context of crosslingual speech recognition =-=[12]-=-. Our goal is to improvesspeech recognition on LimitedLP Tagalog, with thespresence of auxiliary languages including LimitedLPsCantonese, Turkish and Pashto also from the Babel corpus.sTo achieve this...

New types of deep neural network learning for speech recognition and related applications: An overview

by Li Deng, Geoffrey Hinton, Brian Kingsbury - in Proc. Int. Conf. Acoust., Speech, Signal Process , 2013
"... In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications, ” as organized by the authors. We also describe the historical context in ..."
Abstract - Cited by 11 (4 self) - Add to MetaCart
In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications, ” as organized by the authors. We also describe the historical context in which acoustic models based on deep neural networks have been developed. The technical overview of the papers presented in our special session is organized into five ways of improving deep learning methods: (1) better optimization; (2) better types of neural activation function and better network architectures; (3) better ways to determine the myriad hyper-parameters of deep neural networks; (4) more appropriate ways to preprocess speech for deep neural networks; and (5) ways of leveraging multiple languages or dialects that are more easily achieved with deep neural networks than with Gaussian mixture models. Index Terms — deep neural network, convolutional neural network, recurrent neural network, optimization, spectrogram features, multitask, multilingual, speech recognition, music processing
(Show Context)

Citation Context

...y as much as DNNs from being trained on multiple languages simultaneously or from being trained on one language and then modified for another language (e.g. [37]). Both the Microsoft paper [12] (also =-=[29]-=-) and the Google paper [22] elaborate such a new capability, sharing the same example of multilingual speech recognition. In Figure 1, the multi-task learning accomplished by DNN is shown for two scen...

Towards speaker adaptive training of deep neural network acoustic models,” to appear in

by Yajie Miao , Lu Jiang , Justin Chiu , Hao Zhang , Florian Metze - Proc. Interspeech, , 2014
"... ABSTRACT Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on wo ..."
Abstract - Cited by 9 (7 self) - Add to MetaCart
ABSTRACT Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on word error rates (WERs). In this paper, we present different methods to further improve and extend SAT-DNN. First, we conduct detailed analysis to investigate i-vector extractor training and flexible feature fusion. Second, the SAT-DNN approach is extended to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNNbased feature extraction. Third, for transcribing multimedia data, we enrich the i-vector representation with global speaker attributes (age, gender, etc.) obtained automatically from the video signal. On a collection of instructional videos, incorporation of the additional visual features is observed to boost the recognition accuracy of SAT-DNN.
(Show Context)

Citation Context

... SAT-CNN models aresfound to perform better than their baselines relatively by 48%. Moreover, DNNs trained with multilingual data havesserved successfully as deep feature extractors on a newslanguage =-=[14]-=-. For more invariant feature representations,swe extend SAT-DNN to the learning of multilingual featuresextractors and develop two strategies to train iVecNN oversmultiple languages. Cross-language ex...

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

by Yajie Miao , Hao Zhang , Florian Metze - in Proc. Interspeech , 2014
"... Abstract Multilingual deep neural networks (DNNs) can act as deep feature extractors and have been applied successfully to crosslanguage acoustic modeling. Learning these feature extractors becomes an expensive task, because of the enlarged multilingual training data and the sequential nature of st ..."
Abstract - Cited by 8 (5 self) - Add to MetaCart
Abstract Multilingual deep neural networks (DNNs) can act as deep feature extractors and have been applied successfully to crosslanguage acoustic modeling. Learning these feature extractors becomes an expensive task, because of the enlarged multilingual training data and the sequential nature of stochastic gradient descent (SGD). This paper investigates strategies to accelerate the learning process over multiple GPU cards. We propose the DistModel and DistLang frameworks which distribute feature extractor learning by models and languages respectively. The time-synchronous DistModel has the nice property of tolerating infrequent model averaging. With 3 GPUs, DistModel achieves 2.6× speed-up and causes no loss on word error rates. When using DistLang, we observe better acceleration but worse recognition performance. Further evaluations are conducted to scale DistModel to more languages and GPU cards.
(Show Context)

Citation Context

...ed to perform classification withsrespect to HMM states. Following this idea, multilingualsDNNs are trained collaboratively over the source languages,swith their hidden layers shared across languages =-=[9]-=-. On thestarget language, these shared layers are taken as a featuresextractor which is intrinsically language-independent.sPrevious work [5, 9] has reported the effectiveness ofsaforementioned featur...

Improving language-universal feature extraction with deep maxout and convolutional neural networks,” to appear in

by Yajie Miao , Florian Metze - Proc. Interspeech, , 2014
"... Abstract When deployed in automated speech recognition (ASR), deep neural networks (DNNs) can be treated as a complex feature extractor plus a simple linear classifier. Previous work has investigated the utility of multilingual DNNs acting as language-universal feature extractors (LUFEs). In this p ..."
Abstract - Cited by 6 (5 self) - Add to MetaCart
Abstract When deployed in automated speech recognition (ASR), deep neural networks (DNNs) can be treated as a complex feature extractor plus a simple linear classifier. Previous work has investigated the utility of multilingual DNNs acting as language-universal feature extractors (LUFEs). In this paper, we explore different strategies to further improve LUFEs. First, we replace the standard sigmoid nonlinearity with the recently proposed maxout units. The resulting maxout LUFEs have the nice property of generating sparse feature representations. Second, the convolutional neural network (CNN) architecture is applied to obtain more invariant feature space. We evaluate the performance of LUFEs on a cross-language ASR task. Each of the proposed techniques results in word error rate reduction compared with the existing DNN-based LUFEs. Combining the two methods together brings additional improvement on the target language.
(Show Context)

Citation Context

...in [3]sthat the effectiveness of DNNs comes largely from thesinvariance of the representations to variability such assspeakers, environments and channels.sFollowing this feature learning formulation, =-=[4]-=- trainssDNNs over multiple languages, with the hidden layers sharedsacross languages. These shared layers are taken as a languageuniversal feature extractor (LUFE) [4]. Given a new language,sacoustic ...

Investigation of maxout networks for speech recognition

by Pawel Swietojanski, Jinyu Li, Jui-ting Huang - in Proc IEEE ICASSP , 2014
"... We explore the use of maxout neuron in various aspects of acous-tic modelling for large vocabulary speech recognition systems; in-cluding low-resource scenario and multilingual knowledge transfers. Through the experiments on voice search and short message dicta-tion datasets, we found that maxout ne ..."
Abstract - Cited by 5 (5 self) - Add to MetaCart
We explore the use of maxout neuron in various aspects of acous-tic modelling for large vocabulary speech recognition systems; in-cluding low-resource scenario and multilingual knowledge transfers. Through the experiments on voice search and short message dicta-tion datasets, we found that maxout networks are around three times faster to train and offer lower or comparable word error rates on several tasks, when compared to the networks with logistic nonlin-earity. We also present a detailed study of the maxout unit internal behaviour suggesting the use of different nonlinearities in different layers. Index Terms — deep neural networks, maxout networks, multi-task learning, low-resource speech recognition 1.
(Show Context)

Citation Context

...on [4], ii) conversational-style large vocabulary speech recognition (LVSR) systems [5, 6, 7, 8], iii) noise robust applications [9], iv) various aspects of multi– and cross– lingual learning schemes =-=[10, 11, 12, 13, 14]-=- and v) distant and multichannel LVSR of meetings [15]. All of the above examples share similar feed-forward multi-layer network architectures where each hidden layer implements a linear affine operat...

Acoustic and Lexical Resource Constrained ASR using Language-Independent Acoustic Model and Language-Dependent Probabilistic Lexical Model,” Idiap,

by Ramya Rasipuram , Mathew Magimai-Doss , 2014
"... Abstract One of the key challenges involved in building statistical automatic speech recognition (ASR) systems is modeling the relationship between subword units or "lexical units" and acoustic feature observations. To model this relationship two types of resources are needed, namely, aco ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
Abstract One of the key challenges involved in building statistical automatic speech recognition (ASR) systems is modeling the relationship between subword units or "lexical units" and acoustic feature observations. To model this relationship two types of resources are needed, namely, acoustic resources i.e., speech data with word level transcriptions and lexical resources where each word is transcribed in terms of subword units. Standard ASR systems typically use phonemes or phones as subword units. However, not all languages have well developed acoustic and phonetic lexical resources. In this paper, we show that the relationship between lexical units and acoustic features can be factored into two parts through a latent variable, namely, an acoustic model and a lexical model. In the acoustic model the relationship between latent variables and acoustic features is modeled, while in the lexical model a probabilistic relationship between latent variables and lexical units is modeled. We elucidate that in standard hidden Markov model based ASR systems, the relationship between lexical units and latent variables is one-to-one and the lexical model is deterministic. Through a literature survey we show that this deterministic lexical modeling imposes the need for well developed acoustic and lexical resources from the target language or domain to build an ASR system. We then propose an approach that addresses both acoustic and phonetic lexical resource constraints in ASR system development. In the proposed approach, latent variables are multilingual phones and lexical units are graphemes of the target language or domain. We show that the acoustic model can be trained on domain-independent or language-independent resources and the lexical model that models a probabilistic relationship between graphemes and multilingual phones can be trained on a relatively small amount of transcribed speech data from the target domain or language. The potential and the efficacy of the proposed approach is demonstrated through experiments and comparisons with other approaches on three different ASR tasks: non-native and accented speech recognition, rapid development of an ASR system for a new language, and development of an ASR system for a minority language.
(Show Context)

Citation Context

...e language are modeled. The present paper focuses on the first problem. To model the relationship between lexical units and acoustic features, transcribed speech data and a phonetic lexicon are required. While this is not an issue for resource rich languages, it is challenging for under-resourced languages and domains that may not have such resources (Besacier et al., 2014). In the literature, the lack of transcribed speech data has been typically addressed through multilingual and crosslingual approaches (Kohler, 1998; Schultz and Waibel, 2001; Burget et al., 2010; Swietojanski et al., 2012; Huang et al., 2013). In these approaches, first the relationship between lexical units and acoustic feature observations is learned on domain- or language-independent data and later adapted on target language or domain data. If the phonetic lexicon in the target language is not available, then the use of alternate subword units such as graphemes has been explored (Schukat-Talamazzini et al., 1993; Kanthak and Ney, 2002; Killer et al., 2003; Dines and Magimai-Doss, 2007; Ko and Mak, 2014). However, the lack of both acoustic and lexical resources has rarely been studied in the past (Stuker, 2008b; Stuker, 2008a)...

Multilingual deep neural network based acoustic modeling for rapid language adaptation

by Ngoc Thang Vu, David Imseng, Daniel Povey, Petr Motlicek, Tanja Schultz, Herve ́ Bourlard - in ICASSP. IEEE , 2014
"... This paper presents a study on multilingual deep neural net-work (DNN) based acoustic modeling and its application to new languages. We investigate the effect of phone merging on multilingual DNN in context of rapid language adapta-tion. Moreover, the combination of multilingual DNNs with Kullback–L ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
This paper presents a study on multilingual deep neural net-work (DNN) based acoustic modeling and its application to new languages. We investigate the effect of phone merging on multilingual DNN in context of rapid language adapta-tion. Moreover, the combination of multilingual DNNs with Kullback–Leibler divergence based acoustic modeling (KL-HMM) is explored. Using ten different languages from the Globalphone database, our studies reveal that crosslingual acoustic model transfer through multilingual DNNs is superior to unsuper-vised RBM pre-training and greedy layer-wise supervised training. We also found that KL-HMM based decoding consistently outperforms conventional hybrid decoding, es-pecially in low-resource scenarios. Furthermore, the ex-periments indicate that multilingual DNN training equally benefits from simple phoneset concatenation and manually derived universal phonesets. Index Terms — Multilingual DNN, phone merging, rapid language adaptation, KL-HMM 1.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University