• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Accelerated parallelizable neural networks learning algorithms for speech recognition (2010)

by D Yu, L Deng
Venue:Proc. Interspeech 2011
Add To MetaCart

Tools

Sorted by:
Results 1 - 6 of 6

A deep architecture with bilinear modeling of hidden representations: Applications to phonetic recognition

by Brian Hutchinson, Li Deng, Dong Yu - Proc. ICASSP , 2012
"... We develop and describe a novel deep architecture, the Tensor Deep Stacking Network (T-DSN), where multiple blocks are stacked one on top of another and where a bilinear mapping from hidden representations to the output in each block is used to incorporate higherorder statistics of the input feature ..."
Abstract - Cited by 9 (8 self) - Add to MetaCart
We develop and describe a novel deep architecture, the Tensor Deep Stacking Network (T-DSN), where multiple blocks are stacked one on top of another and where a bilinear mapping from hidden representations to the output in each block is used to incorporate higherorder statistics of the input features. A learning algorithm for the T-DSN is presented, in which the main parameter estimation burden is shifted to a convex sub-problem with a closed-form solution. Using an efficient and scalable parallel implementation, we train a T-DSN to discriminate standard three-state monophones in the TIMIT database. The T-DSN outperforms an alternative pretrained Deep Neural Network (DNN) architecture in frame-level classification (both state and phone) and in the cross-entropy measure. For continuous phonetic recognition, T-DSN performs equivalently to a DNN but without the need for a hard-to-scale, sequential fine-tuning step. Index Terms — deep learning, higher-order statistics, tensors, stacking model, phonetic classification and recognition 1.
(Show Context)

Citation Context

...gher layers the input data may be concatenated with one or more output representations from the previous layers. The lower-layer weight matrix W can be optimized using an accelerated gradient descent =-=[7, 8]-=- algorithm to minimize the squared error f = ‖U T H − Y‖F . Embedding the solution of Eq.1 into the objective and deriving the gradient, we obtain [ ∇Wf =2X H T ◦ (1 − H T [ ) ◦ H † (HT T )(TH † ) − T...

DEEP STACKING NETWORKS FOR INFORMATION RETRIEVAL

by Li Deng, Xiaodong He, Jianfeng Gao
"... Deep stacking networks (DSN) are a special type of deep model equipped with parallel and scalable learning. We report successful applications of DSN to an information retrieval (IR) task pertaining to relevance prediction for sponsor search after careful regularization methods are incorporated to th ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Deep stacking networks (DSN) are a special type of deep model equipped with parallel and scalable learning. We report successful applications of DSN to an information retrieval (IR) task pertaining to relevance prediction for sponsor search after careful regularization methods are incorporated to the previous DSN methods developed for speech and image classification tasks. The DSN-based system significantly outperforms the LambdaRank-based system which represents a recent state-of-the-art for IR in normalized discounted cumulative gain (NDCG) measures, despite the use of mean square error as DSN’s training objective. We demonstrate desirable monotonic correlation between NDCG and classification rate in a wide range of IR quality. The weaker correlation and more flat relationship in the high IR-quality region suggest the need for developing new learning objectives and optimization methods. Index Terms — deep stacking network, information retrieval, document ranking
(Show Context)

Citation Context

...presented in the above formulation if and are augmented with ones.) 3.2 Module-bound fine tuning The weight matrices of the DSN in each module can be further learned using batch-mode gradient descent =-=[18]-=-. The computation of the error gradient makes use of Eq. (3) and proceeds by [( )( ) ] (4) [([( ) ] )([( ) ] ) ] [ ( ) ] [( ) ] [( ( )[ ( )] ) ( ) [ ( )] ] [ ( ) [ ( )( ) ( )]] where ( ) is pseudo-inv...

An overview of deep-structured learning for information processing

by Li Deng - in Proc. Asian-Pacific Signal & Information Processing Annual Summit & Conference , 2011
"... Abstract — In this paper, I will introduce to the APSIPA audience an emerging area of machine learning, deep-structured learning. It refers to a class of machine learning techniques, developed mostly since 2006, where many layers of information processing stages in hierarchical architectures are exp ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Abstract — In this paper, I will introduce to the APSIPA audience an emerging area of machine learning, deep-structured learning. It refers to a class of machine learning techniques, developed mostly since 2006, where many layers of information processing stages in hierarchical architectures are exploited for pattern classification and for unsupervised feature learning. First, the brief history of deep learning is discussed. Then, I develop a classificatory scheme to analyze and summarize major work reported in the deep learning literature. Using this scheme, I provide a taxonomy-oriented survey on the existing deep architectures, and categorize them into three types: generative, discriminative, and hybrid. Two prime deep architectures, one hybrid and one discriminative, are presented in detail. Finally, selected applications of deep learning are reviewed in broad areas of information processing including audio/speech, image/video, multimodality, language modeling, natural language processing, and information retrieval. I.
(Show Context)

Citation Context

...ature. Forsexample, the deep-structured CRF, which stacks many layerssof CRFs, have been successfully used in the task of languagesidentification (Yu, Wang, Karam, and Deng, 2010), phonesrecognition (=-=Yu and Deng, 2010-=-)0, sequential labeling insnatural language processing (Yu, Wang, and Deng, 2010), andsconfidence calibration in speech recognition (Yu, Li, andsDeng, 2010).sAs another example, in (Saon and Chien, 20...

Parallel training of deep stacking networks

by Li Deng, Brian Hutchinson, Dong Yu - in Interspeech , 2012
"... The Deep Stacking Network (DSN) is a special type of deep architecture developed to enable and benefit from parallel learning of its model parameters on large CPU clusters. As a prospective key component of future speech recognizers, the architectural design of the DSN and its parallel training endo ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
The Deep Stacking Network (DSN) is a special type of deep architecture developed to enable and benefit from parallel learning of its model parameters on large CPU clusters. As a prospective key component of future speech recognizers, the architectural design of the DSN and its parallel training endow the DSN with scalability over a vast amount of training data. In this paper, we present our first parallel implementation of the DSN training algorithm. Particularly, we show the tradeoff between the time/memory saving via training parallelism and the associated cost arising from inter-CPU communication. Further, in phone classification experiments, we demonstrate a significantly lowered error rate using parallel full-batch training distributed over a CPU cluster, compared with sequential minibatch training implemented in a single CPU machine under otherwise identical experimental conditions and as exploited prior to the work reported in this paper. Index Terms: parallel and distributed computing, deep stacking networks, full-batch training, phone classification
(Show Context)

Citation Context

...he algorithms for learning its weight parameters emphasizing its parallel learning capability, and presented experimental evidence for its effectiveness in speech and image classification tasks [5][6]=-=[17]-=-. In this paper, we build upon and extend the earlier work by describing our new parallel implementation of the DSN learning algorithm. In particular, we show how the gradient is computed in CPU clust...

Fig. 1. Example T-DSN architecture with two complete blocks.

by Uk R L×l
"... Fig. 2. Equivalent architecture to the bottom (blue) block of T-DSN in Fig. 1 where the tensor is unfolded into a large matrix. ..."
Abstract - Add to MetaCart
Fig. 2. Equivalent architecture to the bottom (blue) block of T-DSN in Fig. 1 where the tensor is unfolded into a large matrix.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE SPECIAL ISSUE IN LEARNING DEEP A

by Brian Hutchinson, Student Member, Li Deng, Dong Yu, Senior Member
"... Abstract—A novel deep architecture, the Tensor Deep Stacking Network (T-DSN), is presented. The T-DSN consists of multiple, stacked blocks, where each block contains a bilinear mapping from two hidden layers to the output layer, using a weight tensor to incorporate higher-order statistics of the hid ..."
Abstract - Add to MetaCart
Abstract—A novel deep architecture, the Tensor Deep Stacking Network (T-DSN), is presented. The T-DSN consists of multiple, stacked blocks, where each block contains a bilinear mapping from two hidden layers to the output layer, using a weight tensor to incorporate higher-order statistics of the hidden binary ([0, 1]) features. A learning algorithm for the T-DSN’s weight matrices and tensors is developed and described, in which the main parameter estimation burden is shifted to a convex sub-problem with a closedform solution. Using an efficient and scalable parallel implementation for CPU clusters, we train sets of T-DSNs in three popular tasks in an increasing order of the data size: handwritten digit recognition using MNIST (60k), isolated state/phone classification and continuous phone recognition using TIMIT (1.1m), and isolated phone classification using WSJ0 (5.2m). Experimental results in all three tasks demonstrate the effectiveness of the T-DSN and the associated learning methods in a consistent manner. In particular, asufficient depth of the T-DSN, a symmetry in the two hidden layers structure in each T-DSN block, our model parameter learning algorithm, and a softmax layer on top of T-DSN are shown to have all contributed to the low error rates observed in the experiments for all three tasks.
(Show Context)

Citation Context

... are concatenated with one or more output representations (typically y) from the previous blocks. The lower-layer weight matrix W in a DSN block can be optimized using an accelerated gradient descent =-=[14]-=- algorithm to minimize the squared error objective in Eqn. 1. Embedding the solution of Eqn. 2 into the objective and deriving the gradient, we obtain (2) ∇Wf = X [ H T ◦ (1 − H T ) ◦ Θ ] , (3) where ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University