• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Efficient and effective algorithms for training single-hidden-layer neural networks,” (2012)

by D Yu, L Deng
Venue:Pattern Recognition Letters,
Add To MetaCart

Tools

Sorted by:
Results 1 - 8 of 8

A deep architecture with bilinear modeling of hidden representations: Applications to phonetic recognition

by Brian Hutchinson, Li Deng, Dong Yu - Proc. ICASSP , 2012
"... We develop and describe a novel deep architecture, the Tensor Deep Stacking Network (T-DSN), where multiple blocks are stacked one on top of another and where a bilinear mapping from hidden representations to the output in each block is used to incorporate higherorder statistics of the input feature ..."
Abstract - Cited by 9 (8 self) - Add to MetaCart
We develop and describe a novel deep architecture, the Tensor Deep Stacking Network (T-DSN), where multiple blocks are stacked one on top of another and where a bilinear mapping from hidden representations to the output in each block is used to incorporate higherorder statistics of the input features. A learning algorithm for the T-DSN is presented, in which the main parameter estimation burden is shifted to a convex sub-problem with a closed-form solution. Using an efficient and scalable parallel implementation, we train a T-DSN to discriminate standard three-state monophones in the TIMIT database. The T-DSN outperforms an alternative pretrained Deep Neural Network (DNN) architecture in frame-level classification (both state and phone) and in the cross-entropy measure. For continuous phonetic recognition, T-DSN performs equivalently to a DNN but without the need for a hard-to-scale, sequential fine-tuning step. Index Terms — deep learning, higher-order statistics, tensors, stacking model, phonetic classification and recognition 1.
(Show Context)

Citation Context

...gher layers the input data may be concatenated with one or more output representations from the previous layers. The lower-layer weight matrix W can be optimized using an accelerated gradient descent =-=[7, 8]-=- algorithm to minimize the squared error f = ‖U T H − Y‖F . Embedding the solution of Eq.1 into the objective and deriving the gradient, we obtain [ ∇Wf =2X H T ◦ (1 − H T [ ) ◦ H † (HT T )(TH † ) − T...

Accelerated parallelizable neural network learning algorithm for speech recognition

by Dong Yu, Li Deng - in Proc. Interspeech , 2011
"... We describe a set of novel, batch-mode algorithms we developed recently as one key component in scalable, deep neural network based speech recognition. The essence of these algorithms is to structure the singlehidden-layer neural network so that the upper-layer’s weights can be written as a determin ..."
Abstract - Cited by 6 (5 self) - Add to MetaCart
We describe a set of novel, batch-mode algorithms we developed recently as one key component in scalable, deep neural network based speech recognition. The essence of these algorithms is to structure the singlehidden-layer neural network so that the upper-layer’s weights can be written as a deterministic function of the lower-layer’s weights. This structure is effectively exploited during training by plugging in the deterministic function to the least square error objective function while calculating the gradients. Accelerating techniques are further exploited to make the weight updates move along the most promising directions. The experiments on TIMIT frame-level phone and phonestate classification show strong results. In particular, the error rate is strictly monotonically dropping as the minibatch size increases. This demonstrates the potential for the proposed batch-mode algorithms in large scale speech recognition since they are easily parallelizable across computers. Index Terms: neural network, scalability, structure, constraints, FISTA acceleration, optimization, pseudoinverse, weighted LSE, phone state classification, speech recognition, deep learning 1.
(Show Context)

Citation Context

... accelerated algorithm performs the best, achieving recognition accuracy of 98.9% when the deep belief network (DBN) pretraining algorithm is used to initialize the lower-level neural network weights =-=[13]-=-. This is slightly better than the 98.8% accuracy of the DBN reported in [10] but with a small fraction of the training time. The DBN pretraining helped here since the objective function is non-convex...

USING DEEP STACKING NETWORK TO IMPROVE STRUCTURED COMPRESSED SENSING WITH MULTIPLE MEASUREMENT VECTORS

by Hamid Palangi , Rabab Ward , Li Deng
"... ABSTRACT We study the MMV (Multiple Measurement Vectors) compressive sensing setting with a specific sparse structured support. The locations of the non-zero rows in the sparse matrix are not known. All that is known is that the locations of the non-zero rows have probabilities that vary from one g ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
ABSTRACT We study the MMV (Multiple Measurement Vectors) compressive sensing setting with a specific sparse structured support. The locations of the non-zero rows in the sparse matrix are not known. All that is known is that the locations of the non-zero rows have probabilities that vary from one group of rows to another. We propose two novel greedy algorithms for the exact recovery of the sparse matrix in this structured MMV compressive sensing problem. The first algorithm models the matrix sparse structure using a shallow non-linear neural network. The input of this network is the residual matrix after the prediction and the output is the sparse matrix to be recovered. The second algorithm improves the shallow neural network prediction by using the stacking operation to form a deep stacking network. Experimental evaluation demonstrates the superior performance of both new algorithms over existing MMV methods. Among all, the algorithm using the deep stacking network for modelling the structure in MMV compressive sensing performs the best.
(Show Context)

Citation Context

...tating the greedy algorithm. With linear activation functions for the hidden layer and output layer neurons, (7) becomes a simple linear relationship. A more efficient choice for fh(.) is a non-linear function like the sigmoid function 11+e−t . However with a nonlinear fh(.), solving (8) becomes a non trivial task. Existing algorithms in the neural networks literature for finding W1 and W2, such as the backpropagation and conjugate gradient backpropagation are very slow and inefficient when fh(.) is non-linear. Since we use one hidden layer only, there are more efficient ways, like the one in [13], to calculate the weights. In [13], the fact that W2 is dependent on W1 is taken into account when calculating the gradient. This leads to the following expression of the gradient of the energy function E = ‖St − S‖2F : ∂E ∂W1 = 2R[HT ◦(1−H)T◦[H†(H[St]T )(StH†)−[St]T (StH†)]] (9) where H† = HT (HHT )−1, ◦ is the Hadamard product operator and H is the output of the hidden layer: H = 1 1 + exp(−W1xin) (10) where xin = vec(R). Then we should search in the opposite direction of the gradient, i.e., Wk+1 1 = Wk1 − ρ ∂E ∂Wk 1 (11) where ρ is the learning rate. After updating W1 using (11), W2 is ca...

Learning Input and Recurrent Weight Matrices in Echo State Networks

by Hamid Palangi , Li Deng , Rabab K Ward
"... Abstract The traditional echo state network (ESN) is a special type of a temporally deep model, the recurrent network (RNN), which carefully designs the recurrent matrix and fixes both the recurrent and input matrices in the RNN. The ESN also adopts the linear output (or readout) units to simplify ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
Abstract The traditional echo state network (ESN) is a special type of a temporally deep model, the recurrent network (RNN), which carefully designs the recurrent matrix and fixes both the recurrent and input matrices in the RNN. The ESN also adopts the linear output (or readout) units to simplify the leanring of the only output matrix in the RNN. In this paper, we devise a special technique that takes advantage of the linearity in the output units in the ESN to learn the input and recurrent matrices, not carried on earlier ESNs due to the well-known difficulty of their learning. Compared with the technique of BackProp Through Time (BPTT) in learning the general RNNs, our proposed technique makes use of the linearity in the output units to provide constraints among various matrices in the RNN, enabling the computation of the gradients as the learning signal in an analytical form instead of by recursion as in the BPTT. Experimental results on phone state classification show that learning either or both the input and recurrent matrices in the ESN is superior to the traditional ESN without learning them, especially when longer time steps are used in analytically computing the gradients.
(Show Context)

Citation Context

...1[H T 1 ◦ (1−H T 1 )W T rec ◦H T 2 ◦ (1−H T 2 ) ◦ S] +X2[H T 2 ◦ (1−H T 2 ) ◦ S] (16) where ◦ is element-wise multiplication. 2Note that this is one of the main differences with the work presented in =-=[18, 19]-=- where there is no temporal connection in the single layer network and hence no time dependency is considered. 4 3.2 Case 2 The gradient calculated in Case 1 is not accurate because the dependency bet...

Recurrent deepstacking networks for sequence classification,”

by Hamid Palangi , Li Deng , Rabab K Ward - in Signal and Information Processing (ChinaSIP), 2014 IEEE China Summit International Conference on, , 2014
"... ABSTRACT Deep Stacking Networks (DSNs) are constructed by stacking shallow feed-forward neural networks on top of each other using concatenated features derived from the lower modules of the DSN and the raw input data. DSNs do not have recurrent connections, making them less effective to model and ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
ABSTRACT Deep Stacking Networks (DSNs) are constructed by stacking shallow feed-forward neural networks on top of each other using concatenated features derived from the lower modules of the DSN and the raw input data. DSNs do not have recurrent connections, making them less effective to model and classify input data with temporal dependencies. In this paper, we embed recurrent connections into the DSN, giving rise to Recurrent Deep Stacking Networks (R-DSNs). Each module of the R-DSN consists of a special form of recurrent neural networks. Generalizing from the earlier DSN, the use of linearity in the output units of the R-DSN enables us to derive a closed form for computing the gradient of the cost function with respect to all network matrices without backpropagating errors. Each module in the R-DSN is initialized with an echo state network, where the input and recurrent weights are fixed to have the echo state property. Then all connection weights within the module are fine tuned using batch-mode gradient descent where the gradient takes an analytical form. Experiments are performed on the TIMIT dataset for frame-level phone state classification with 183 classes. The results show that the R-DSN gives higher classification accuracy over a single recurrent neural network without stacking.
(Show Context)

Citation Context

...iCi] (15) where n is the number of time steps and Ci = [H T i ◦ (1−HTi )WTrec] ◦Ci+1 , for i = 1, . . . , n− 1 Cn = H T n ◦ (1−HTn ) ◦A (16) To calculate A using (14), Hn and Tn are used. After calculating the gradient of the cost function w.r.t W, the input weights W are updated using the following update equation Wi+1 = Wi − α ∂E ∂Wi + β(Wi −Wi−1) (17) β = mold mnew mnew = 1 + √ 1 + 4m2old 2 (18) where α is the step size and the initial value for mold and mnew is 1. The third term in (17) helps the algorithm to converge faster and is based on the FISTA algorithm proposed in [18] and used in [19] and [17]. 3.2. Learning Recurrent Weights To learn the recurrent weights, the gradient of the cost function w.r.t Wrec should be calculated: ∂E ∂Wrec = n∑ i=1 Wn−irec Hi−1Ci (19) where H0 includes the initial hidden states and Cn = H T n ◦ (1−HTn ) ◦A Ci = H T i ◦ (1−HTi ) ◦Ci+1 (20) and A is calculated using (14) based on Hn and Tn. Only the non-zero entries of the sparse matrix Wrec are updated using (17) and the gradient calculated in (19). To make sure that the network has the echo state property after each epoch, the entries of Wrec are renormalized such that the maximum eigenvalue of Wr...

Convolutional deep stacking networks for distributed compressive sensing,”

by Hamid Palangi , Rabab Ward , Li Deng - Signal Processing, , 2016
"... a b s t r a c t This paper addresses the reconstruction of sparse vectors in the Multiple Measurement Vectors (MMV) problem in compressive sensing, where the sparse vectors are correlated. This problem has so far been studied using model based and Bayesian methods. In this paper, we propose a deep ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
a b s t r a c t This paper addresses the reconstruction of sparse vectors in the Multiple Measurement Vectors (MMV) problem in compressive sensing, where the sparse vectors are correlated. This problem has so far been studied using model based and Bayesian methods. In this paper, we propose a deep learning approach that relies on a Convolutional Deep Stacking Network (CDSN) to capture the dependency among the different channels. To reconstruct the sparse vectors, we propose a greedy method that exploits the information captured by CDSN. The proposed method encodes the sparse vectors using random measurements (as done usually in compressive sensing). Experiments using a real world image dataset show that the proposed method outperforms the traditional MMV solver, i.e., Simultaneous Orthogonal Matching Pursuit (SOMP), as well as three of the Bayesian methods proposed for solving the MMV compressive sensing problem. We also show that the proposed method is almost as fast as greedy methods. The good performance of the proposed method depends on the availability of training data (as is the case in all deep learning methods). The training data, e.g., different images of the same class or signals with similar sparsity patterns are usually available for many applications.
(Show Context)

Citation Context

... m k+ 1 Fig. 2. The curve of FISTA coefficients − + mk mk 1 1 in (16) with respect to the epoch number. H. Palangi et al. / Signal Processing 131 (2017) 181–189 185μ= − + ( ) ( ) ( ) ( ) ( ) ( ) ⎡⎣ ⎤⎦W W H T Wargmin 12 13 l l T l l W 2 2 2 2 2 2 2 l 2 which results in: μ= + [ ] ( ) ( ) ( ) ( ) − ( )⎡⎣ ⎤⎦W I H H H T 14l l l T l T2 1 where I is the identity matrix. To find ( )W l1 , for each layer of CDSN we use the stochastic gradient descent method. To calculate the gradient of the cost function with respect to ( )W l1 given the fact that ( )W l2 and ( )H l depend on ( )W l1 , it can be shown [34] that the gradient of the cost function in (11) with respect to ( )W l1 is: ( ) ∂ − ∂ = ◦ − ◦ [ ] − [ ] ( ) ( ) ( ) ( ) ( ) ( ) † ( ) ( ) † ( ) † ⎡ ⎣ ⎢⎢ ⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦ ⎡ ⎣⎢ ⎡⎣ ⎤⎦ ⎡ ⎣⎢ ⎤ ⎦⎥ ⎡ ⎣⎢ ⎤ ⎦⎥ ⎡ ⎣⎢ ⎤ ⎦⎥ ⎤ ⎦⎥ ⎤ ⎦ ⎥⎥ 15 V T W Z H H H H T T H T T H1 l l l l T l T l l T l T l2 2 1 where ( )Z l is a matrix whose columns are ( )z l in (10) corresponding to different training samples in the training set and ○ is the Hadamard product operator. Using the gradient information from past iterations can help to improve the convergence speed inFig. 3. High level block diagramconvex optimization problems ...

Fig. 1. Example T-DSN architecture with two complete blocks.

by Uk R L×l
"... Fig. 2. Equivalent architecture to the bottom (blue) block of T-DSN in Fig. 1 where the tensor is unfolded into a large matrix. ..."
Abstract - Add to MetaCart
Fig. 2. Equivalent architecture to the bottom (blue) block of T-DSN in Fig. 1 where the tensor is unfolded into a large matrix.

RESEARCH ARTICLE Fast, Simple and Accurate Handwritten Digit Classification by Training Shallow Neural Network Classifiers with the ‘Extreme Learning Machine ’ Algorithm

by Mark D. Mcdonnell, Migel D. Tissera, Tony Vladusich, André Van Schaik, Jonathan Tapson
"... Recent advances in training deep (multi-layer) architectures have inspired a renaissance in neural network use. For example, deep convolutional networks are becoming the default option for difficult tasks on large datasets, such as image and speech recognition. However, here we show that error rates ..."
Abstract - Add to MetaCart
Recent advances in training deep (multi-layer) architectures have inspired a renaissance in neural network use. For example, deep convolutional networks are becoming the default option for difficult tasks on large datasets, such as image and speech recognition. However, here we show that error rates below 1 % on the MNIST handwritten digit benchmark can be replicated with shallow non-convolutional neural networks. This is achieved by training such networks using the ‘Extreme Learning Machine ’ (ELM) approach, which also enables a very rapid training time ( * 10 minutes). Adding distortions, as is common practise for MNIST, reduces error rates even further. Our methods are also shown to be capable of achieving less than 5.5 % error rates on the NORB image database. To achieve these results, we intro-duce several enhancements to the standard ELM algorithm, which individually and in com-bination can significantly improve performance. The main innovation is to ensure each hidden-unit operates only on a randomly sized and positioned patch of each image. This form of random ‘receptive field ’ sampling of the input ensures the input weight matrix is sparse, with about 90 % of weights equal to zero. Furthermore, combining our methods with
(Show Context)

Citation Context

...84–1000–10 6.05% [15] C-ELM * 5% [21] CIW-ELM, 784–1000–10 3.55% [15] ELM, 784–7840-10 2.75% [27] ELM, 784-unknown-10 2.61% [20] CIW-ELM, 784–7000-10 1.52% [15] ELM+backpropagation, 784–2048-10 1.45% =-=[23]-=- Deep ELM, 784–700-15000–10 0.97% [20] ELM & backprop RF-(CIW & C)-ELM, 784—(2 × 6400)-20-500-10 0.91% (1.04%) This report SLFN ELM RF-ELM,784–15000–10 1.36% (1.48%) This report CIW-ELM,784–15000–10 1...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University