Results 1 -
5 of
5
Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR
"... In this paper we investigate how much feature extraction is re-quired by a deep neural network (DNN) based acoustic model for automatic speech recognition (ASR). We decompose the feature extraction pipeline of a state-of-the-art ASR system step by step and evaluate acoustic models trained on standar ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
In this paper we investigate how much feature extraction is re-quired by a deep neural network (DNN) based acoustic model for automatic speech recognition (ASR). We decompose the feature extraction pipeline of a state-of-the-art ASR system step by step and evaluate acoustic models trained on standard MFCC features, critical band energies (CRBE), FFT magnitude spec-trum and even on the raw time signal. The focus is put on raw time signal as input features, i.e. as much as zero feature ex-traction prior to DNN training. Noteworthy, the gap in recog-nition accuracy between MFCC and raw time signal decreases strongly once we switch from sigmoid activation function to rectified linear units, offering a real alternative to standard sig-nal processing. The analysis of the first layer weights reveals that the DNN can discover multiple band pass filters in time do-main. Therefore we try to improve the raw time signal based system by initializing the first hidden layer weights with im-pulse responses of an audiologically motivated filter bank. In-spired by the multi-resolutional analysis layer learned automati-cally from raw time signal input, we train the DNN on a combi-nation of multiple short-term features. This illustrates how the DNN can learn from the little differences between MFCC, PLP and Gammatone features, suggesting that it is useful to present the DNN with different views on the underlying audio. Index Terms: acoustic modeling, raw signal, neural networks 1.
Optimizing neural networks with kronecker-factored approximate curvature
, 2015
"... Abstract We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diago ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.
RASR/NN: THE RWTH NEURAL NETWORK TOOLKIT FOR SPEECH RECOGNITION
"... This paper describes the new release of RASR- the open source version of the well-proven speech recognition toolkit developed and used at RWTH Aachen University. The focus is put on the implementation of the NN module for training neural network acoustic models. We describe code design, configuratio ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
This paper describes the new release of RASR- the open source version of the well-proven speech recognition toolkit developed and used at RWTH Aachen University. The focus is put on the implementation of the NN module for training neural network acoustic models. We describe code design, configuration, and features of the NN module. The key feature is a high flexibility regarding the network topology, choice of activation functions, training criteria, and opti-mization algorithm, as well as a built-in support for efficient GPU computing. The evaluation of run-time performance and recognition accuracy is performed exemplary with a deep neural network as acoustic model in a hybrid NN/HMM sys-tem. The results show that RASR achieves a state-of-the-art performance on a real-world large vocabulary task, while offering a complete pipeline for building and applying large scale speech recognition systems. Index Terms — speech recognition, acoustic modeling, neural networks, GPU, open source, RASR 1.
UNSUPERVISED ADAPTATION OF A DENOISING AUTOENCODER BY BAYESIAN FEATURE ENHANCEMENT FOR REVERBERANT ASR UNDER MISMATCH CONDITIONS
"... The parametric Bayesian Feature Enhancement (BFE) and a data-driven Denoising Autoencoder (DA) both bring performance gains in severe single-channel speech recognition conditions. The first can be adjusted to different conditions by an appropriate parameter set-ting, while the latter needs to be tra ..."
Abstract
- Add to MetaCart
(Show Context)
The parametric Bayesian Feature Enhancement (BFE) and a data-driven Denoising Autoencoder (DA) both bring performance gains in severe single-channel speech recognition conditions. The first can be adjusted to different conditions by an appropriate parameter set-ting, while the latter needs to be trained on conditions similar to the ones expected at decoding time, making it vulnerable to a mismatch between training and test conditions. We use a DNN backend and study reverberant ASR under three types of mismatch conditions: different room reverberation times, different speaker to microphone distances and the difference between artificially reverberated data and the recordings in a reverberant environment. We show that for these mismatch conditions BFE can provide the targets for a DA. This unsupervised adaptation provides a performance gain over the direct use of BFE and even enables to compensate for the mismatch of real and simulated reverberant data. Index Terms — robust speech recognition, deep neuronal net-works, feature enhancement, denoising autoencoder 1.
Research Article Deep Neural Networks with Multistate Activation Functions
, 2015
"... which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We propose multistate activation functions (MSAFs) for deep neural networks (DNNs). These MSAFs are new kinds of activation functions which are capable of representingmore than ..."
Abstract
- Add to MetaCart
(Show Context)
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We propose multistate activation functions (MSAFs) for deep neural networks (DNNs). These MSAFs are new kinds of activation functions which are capable of representingmore than two states, including theN-orderMSAFs and the symmetricalMSAF. DNNs with these MSAFs can be trained via conventional Stochastic Gradient Descent (SGD) as well as mean-normalised SGD. We also discuss how these MSAFs perform when used to resolve classification problems. Experimental results on the TIMIT corpus reveal that, on speech recognition tasks, DNNs with MSAFs perform better than the conventional DNNs, getting a relative improvement of 5.60 % on phoneme error rates. Further experiments also reveal that mean-normalised SGD facilitates the training processes of DNNs with MSAFs, especially when being with large training sets. The models can also be directly trained without pretraining when the training set is sufficiently large, which results in a considerable relative improvement of 5.82 % on word error rates. 1.