• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

A deep architecture with bilinear modeling of hidden representations: Applications to phonetic recognition, (2012)

by B Hutchinson, L Deng, D Yu
Venue:Proceedings of ICASSP,
Add To MetaCart

Tools

Sorted by:
Results 1 - 9 of 9

Deep Neural Networks for Acoustic Modeling in Speech Recognition

by Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, Brian Kingsbury
"... Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative ..."
Abstract - Cited by 272 (47 self) - Add to MetaCart
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition. I.
(Show Context)

Citation Context

...tween one and two orders of magnitude, but the fine-tuning stage remains a serious bottleneck and more effective ways of parallelizing training are needed. Some recent attempts are described in [52], =-=[57]-=-. Most DBN-DNN acoustic models are fine-tuned by applying stochastic gradient descent with momentum to small mini-batches of training cases. More sophisticated optimization methods that can be used on...

Use of kernel deep convex networks and end-to-end learning for spoken language understanding

by Li Deng, Gokhan Tur, Xiaodong He, Dilek Hakkani-tur - IEEE SLT , 2012
"... We present our recent and ongoing work on applying deep learning techniques to spoken language understanding (SLU) problems. The previously developed deep convex network (DCN) is extended to its kernel version (K-DCN) where the number of hidden units in each DCN layer approaches infinity using the k ..."
Abstract - Cited by 16 (9 self) - Add to MetaCart
We present our recent and ongoing work on applying deep learning techniques to spoken language understanding (SLU) problems. The previously developed deep convex network (DCN) is extended to its kernel version (K-DCN) where the number of hidden units in each DCN layer approaches infinity using the kernel trick. We report experimental results demonstrating dramatic error reduction achieved by the K-DCN over both the Boosting-based baseline and the DCN on a domain classification task of SLU, especially when a highly correlated set of features extracted from search query click logs are used. Not only can DCN and K-DCN be used as a domain or intent classifier for SLU, they can also be used as local, discriminative feature extractors for the slot filling task of SLU. The interface of K-DCN to slot filling systems via the softmax function is presented. Finally, we outline an end-to-end learning strategy for training the softmax parameters (and potentially all DCN and K-DCN parameters) where the learning objective can take any performance measure (e.g. the F-measure) for the full SLU system. Index Terms — kernel learning, deep learning, spoken language understanding, domain detection, slot filling 1.

Machine Learning Paradigms for Speech Recognition: An Overview

by Li Deng, Xiao Li , 2013
"... Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasional ..."
Abstract - Cited by 11 (1 self) - Add to MetaCart
Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasionally does use ASR as a large-scale, realistic application to rigorously test the effectiveness of a given technique, and to inspire new problems arising from the inherently sequential and dynamic nature of speech. On the other hand, even though ASR is available commercially for some applications, it is largely an unsolved problem—for almost all applications, the performance of ASR is not on par with human performance. New insight from modern ML methodology shows great promise to advance the state-of-the-art in ASR technology. This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems. The intent is to foster further cross-pollination between the ML and ASR communities than has occurred in the past. The article is organized according to the major ML paradigms that are either popular already or have potential for making significant contributions to ASR technology. The paradigms presented and elaborated in this overview include: generative and discriminative learning; supervised, unsupervised, semi-supervised, and active learning; adaptive and multi-task learning; and Bayesian learning. These learning paradigms are motivated and discussed in the context of ASR technology and applications. We finally present and analyze recent developments of deep learning and learning with sparse representations, focusing on their direct relevance to advancing ASR technology.

The deep tensor neural network with applications to large vocabulary speech recognition

by Dong Yu, Senior Member - IEEE Audio, Speech, Lang. Process , 2013
"... Abstract—The recently proposed context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) have been proved highly promising for large vocabulary speech recognition. In this paper, we develop a more advanced type of DNN, which we call the deep tensor neural network (DTNN). The DTNN exte ..."
Abstract - Cited by 10 (8 self) - Add to MetaCart
Abstract—The recently proposed context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) have been proved highly promising for large vocabulary speech recognition. In this paper, we develop a more advanced type of DNN, which we call the deep tensor neural network (DTNN). The DTNN extends the conventional DNN byreplacingoneormoreofits layers with a double-projection (DP) layer, in which each input vector is projected into two nonlinear subspaces, and a tensor layer, in which two subspace projections interact with each other and jointly predict the next layer in the deep architecture. In addition, we describe an approach to map the tensor layers to the conventional sigmoid layers so that the former can be treated and trained in a similar way to the latter. With this mapping we can consider a DTNN as the DNN augmented with DP layers so that not only the BP learning algorithm of DTNNs can be cleanly derived but also new types of DTNNs can be more easily developed. Evaluation on Switchboard tasks indicates that DTNNs can outperform the already high-performing DNNs with 4–5 % and 3 % relative word error reduction, respectively, using 30-hr and 309-hr training sets. Index Terms—Automatic speech recognition, CD-DNN-HMM, large vocabulary, tensor deep neural networks.
(Show Context)

Citation Context

...mbinations. The DTNN that we will present in this paper, however, is a deep network and it predicts the upper layer directly through the tensor connections as shownin(2)inSectionIII. More recent work =-=[20]-=- replaced the single sigmoid hidden layer with a tensor layer in a deep network where blocks of shallow networks are used to construct the stacking deep architecture and each block in the stacking net...

Large vocabulary speech recognition using deep tensor neural networks

by Dong Yu, Li Deng, Frank Seide - in Proc. Interspeech ’12
"... Recently, we proposed and developed the context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) for large vocabulary speech recognition and achieved highly promising recognition results including over one third fewer word errors than the discriminatively trained, conventional HMM-ba ..."
Abstract - Cited by 7 (3 self) - Add to MetaCart
Recently, we proposed and developed the context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) for large vocabulary speech recognition and achieved highly promising recognition results including over one third fewer word errors than the discriminatively trained, conventional HMM-based systems on the 300hr Switchboard benchmark task. In this paper, we extend DNNs to deep tensor neural networks (DTNNs) in which one or more layers are double-projection and tensor layers. The basic idea of the DTNN comes from our realization that many factors interact with each other to predict the output. To represent these interactions, we project the input to two nonlinear subspaces through the double-projection layer and model the interactions between these two subspaces and the output neurons through a tensor with three-way connections. Evaluation on 30hr Switchboard task indicates that DTNNs can outperform DNNs with similar number of parameters with 5% relative word error reduction. Index Terms: automatic speech recognition, tensor deep neural networks, CD-DNN-HMM, large vocabulary 1.
(Show Context)

Citation Context

...ctors and labels. Yu, Chen, and Deng [14] extended the gated softmax layer to DNNs and also proposed a tensor-based architecture that uses separately predicted gating factors. Hutchinson, Deng and Yu =-=[13]-=- replaced the single sigmoid hidden layer with a tensor layer in the stacking networks. The DTNN proposed in this work is different from all the above prior arts in that it uses double-projection laye...

Discriminative Features via Generalized Eigenvectors

by Nikos Karampatziakis, Paul Mineiro
"... Representing examples in a way that is compati-ble with the underlying classifier can greatly en-hance the performance of a learning system. In this paper we investigate scalable techniques for inducing discriminative features by taking ad-vantage of simple second order structure in the data. We foc ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
Representing examples in a way that is compati-ble with the underlying classifier can greatly en-hance the performance of a learning system. In this paper we investigate scalable techniques for inducing discriminative features by taking ad-vantage of simple second order structure in the data. We focus on multiclass classification and show that features extracted from the generalized eigenvectors of the class conditional second mo-ments lead to classifiers with excellent empirical performance. Moreover, these features have at-tractive theoretical properties, such as inducing representations that are invariant to linear trans-formations of the input. We evaluate classifiers built from these features on three different tasks, obtaining state of the art results. 1.
(Show Context)

Citation Context

... audio. Such a classifier can be composed with standard sequence modeling techniques to produce an overall solution, which has made the multiclass problem a subject of research (Hinton et al., 2012b; =-=Hutchinson et al., 2012-=-). In this experiment we focus exclusively on the multiclass problem. We use a standard preprocessing of TIMIT as our initial representation (Hutchinson et al., 2012). Specifically the speech is conve...

Parallel training of deep stacking networks

by Li Deng, Brian Hutchinson, Dong Yu - in Interspeech , 2012
"... The Deep Stacking Network (DSN) is a special type of deep architecture developed to enable and benefit from parallel learning of its model parameters on large CPU clusters. As a prospective key component of future speech recognizers, the architectural design of the DSN and its parallel training endo ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
The Deep Stacking Network (DSN) is a special type of deep architecture developed to enable and benefit from parallel learning of its model parameters on large CPU clusters. As a prospective key component of future speech recognizers, the architectural design of the DSN and its parallel training endow the DSN with scalability over a vast amount of training data. In this paper, we present our first parallel implementation of the DSN training algorithm. Particularly, we show the tradeoff between the time/memory saving via training parallelism and the associated cost arising from inter-CPU communication. Further, in phone classification experiments, we demonstrate a significantly lowered error rate using parallel full-batch training distributed over a CPU cluster, compared with sequential minibatch training implemented in a single CPU machine under otherwise identical experimental conditions and as exploited prior to the work reported in this paper. Index Terms: parallel and distributed computing, deep stacking networks, full-batch training, phone classification
(Show Context)

Citation Context

...duction Since the birth of deep learning around 2006 [10][2][14], deep models with various types have recently been developed and successfully evaluated for a number of speech processing applications =-=[3]-=-[4][7][11][15]. Among these models, the Deep Stacking Network (DSN), presented recently in [5][6], is particularly attractive due to the potential of using parallel computing for learning its weights....

CHAPTER 1.2 DEEP DISCRIMINATIVE AND GENERATIVE MODELS FOR PATTERN RECOGNITION

by Li Deng , Navdeep Jaitly
"... In this chapter we describe deep generative and discriminative models as they have been applied to speech recognition and related pattern recognition problems. The former models describe the distribution of data or the joint distribution of data and the corresponding targets, whereas the latter mod ..."
Abstract - Add to MetaCart
In this chapter we describe deep generative and discriminative models as they have been applied to speech recognition and related pattern recognition problems. The former models describe the distribution of data or the joint distribution of data and the corresponding targets, whereas the latter models describe the distribution of targets conditioned on data. Both models are characterized as being 'deep' as they use layers of latent or hidden variables. Understanding and exploiting tradeoffs between deep generative and discriminative models is a fascinating area of research and it forms the background of this chapter. We focus on speech recognition but our analysis is applicable to other domains. We suggest ways in which deep generative models can be beneficially integrated with deep discriminative models based on their respective strengths. We also examine the recent advances in end-to-end optimization, a hallmark of deep learning that differentiates it from most standard pattern recognition practices.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE SPECIAL ISSUE IN LEARNING DEEP A

by Brian Hutchinson, Student Member, Li Deng, Dong Yu, Senior Member
"... Abstract—A novel deep architecture, the Tensor Deep Stacking Network (T-DSN), is presented. The T-DSN consists of multiple, stacked blocks, where each block contains a bilinear mapping from two hidden layers to the output layer, using a weight tensor to incorporate higher-order statistics of the hid ..."
Abstract - Add to MetaCart
Abstract—A novel deep architecture, the Tensor Deep Stacking Network (T-DSN), is presented. The T-DSN consists of multiple, stacked blocks, where each block contains a bilinear mapping from two hidden layers to the output layer, using a weight tensor to incorporate higher-order statistics of the hidden binary ([0, 1]) features. A learning algorithm for the T-DSN’s weight matrices and tensors is developed and described, in which the main parameter estimation burden is shifted to a convex sub-problem with a closedform solution. Using an efficient and scalable parallel implementation for CPU clusters, we train sets of T-DSNs in three popular tasks in an increasing order of the data size: handwritten digit recognition using MNIST (60k), isolated state/phone classification and continuous phone recognition using TIMIT (1.1m), and isolated phone classification using WSJ0 (5.2m). Experimental results in all three tasks demonstrate the effectiveness of the T-DSN and the associated learning methods in a consistent manner. In particular, asufficient depth of the T-DSN, a symmetry in the two hidden layers structure in each T-DSN block, our model parameter learning algorithm, and a softmax layer on top of T-DSN are shown to have all contributed to the low error rates observed in the experiments for all three tasks.
(Show Context)

Citation Context

...ot been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE SPECIAL ISSUE IN LEARNING DEEP ARCHITECTURES, IEEE TPAMI, 2012 2 in =-=[8]-=-. This paper significantly expands the work and contains comprehensive experimental results plus details of the learning algorithm and its implementation. One major motivation for developing the recen...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University