Results 1 - 10
of
242
Deep Neural Networks for Acoustic Modeling in Speech Recognition
"... Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative ..."
Abstract
-
Cited by 272 (47 self)
- Add to MetaCart
(Show Context)
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition. I.
Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition
- IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING
, 2012
"... We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to pr ..."
Abstract
-
Cited by 254 (50 self)
- Add to MetaCart
We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8 % and 9.2 % (or relative error reduction of 16.0 % and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively.
Locating Singing Voice Segments within Music Signals
, 2001
"... A sung vocal line is the prominent feature of much popular music. It would be useful to reliably locate the portions of a musical track during which the vocals are present, both as a ‘signature’ of the piece and as a precursor to automatic recognition of lyrics. Here, we approach this problem by usi ..."
Abstract
-
Cited by 77 (6 self)
- Add to MetaCart
(Show Context)
A sung vocal line is the prominent feature of much popular music. It would be useful to reliably locate the portions of a musical track during which the vocals are present, both as a ‘signature’ of the piece and as a precursor to automatic recognition of lyrics. Here, we approach this problem by using the acoustic classifier of a speech recognizer as a detector for speech-like sounds. Although singing (including a musical background) is a relatively poor match to an acoustic model trained on normal speech, we propose various statistics of the classifier’s output in order to discriminate singing from instrumental accompaniment. A simple HMM allows us to find a best labeling sequence for this uncertain data. On a test set of forty 15 second excerpts of randomly-selected music, our classifier achieved around 80 % classification accuracy at the frame level. The utility of different features, and our plans for eventual lyrics recognition, are discussed. 1.
Improving Word Accuracy with Gabor Feature Extraction
- Proc. ICSLP
, 2002
"... A novel type of feature extraction for automatic speech recognition is investigated. Two-dimensional Gabor functions, with varying extents and tuned to different rates and directions of spectro-temporal modulation, are applied as filters to a spectro-temporal representation provided by mel spectra. ..."
Abstract
-
Cited by 54 (3 self)
- Add to MetaCart
(Show Context)
A novel type of feature extraction for automatic speech recognition is investigated. Two-dimensional Gabor functions, with varying extents and tuned to different rates and directions of spectro-temporal modulation, are applied as filters to a spectro-temporal representation provided by mel spectra. The use of these functions is motivated by findings in neurophysiology and psychoacoustics. Data-driven parameter selection was used to obtain Gabor feature sets, the performance of which is evaluated on the Aurora 2 and 3 datasets both on their own and in combination with the Qualcomm-OGI-ICSI Aurora proposal. The Gabor features consistently provide performance improvements.
Tandem Acoustic Modeling In Large-Vocabulary Recognition
- in Proc. ICASSP-2001
, 2001
"... In the tandem approach to modeling the acoustic signal, a neural-net preprocessor is first discriminatively trained to estimate posterior probabilities across a phone set. These are then used as feature inputs for a conventional hidden Markov model (HMM) based speech recognizer, which relearns the a ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
(Show Context)
In the tandem approach to modeling the acoustic signal, a neural-net preprocessor is first discriminatively trained to estimate posterior probabilities across a phone set. These are then used as feature inputs for a conventional hidden Markov model (HMM) based speech recognizer, which relearns the associations to subword units. In this paper, we apply the tandem approach to the data provided for the first Speech in Noisy Environments (SPINE1) evaluation conducted by the Naval Research Laboratory (NRL) in August 2000. In our previous experience with the ETSI Aurora noisy digits (a small-vocabulary, high-noise task) the tandem approach achieved error-rate reductions of over 50% relative to the HMM baseline. For SPINE1, a larger task involving more spontaneous speech, we find that, when context-independent models are used, the tandem features continue to result in large reductions in word-error rates relative to those achieved by systems using standard MFC or PLP features. However, these ...
The IBM 2004 conversational telephony system for rich transcription
- in Proc. ICASSP ’05
, 2005
"... This paper describes the technical advances in IBM’s conversational telephony submission to the DARPA-sponsored 2004 Rich Transcription evaluation (RT-04). These advances include a system architecture based on cross-adaptation; a new form of feature-based MPE training; the use of a full-scale discri ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
(Show Context)
This paper describes the technical advances in IBM’s conversational telephony submission to the DARPA-sponsored 2004 Rich Transcription evaluation (RT-04). These advances include a system architecture based on cross-adaptation; a new form of feature-based MPE training; the use of a full-scale discriminatively trained full covariance gaussian system; the use of septaphone cross-word acoustic context in static decoding graphs; and the incorporation of 2100 hours of training data in every system component. These advances reduced the error rate by approximately 21 % relative, on the 2003 test set, over the best-performing system in last year’s evaluation, and produced the best results on the RT-04 current and progress CTS data. 1.
Articulatory feature-based methods for acoustic and audio-visual speech recognition: 2006 JHU summer workshop final report
- JOHNS HOPKINS UNIVERSITY CENTER FOR
, 2007
"... We report on investigations, conducted at the 2006 JHU Summer Workshop, of the use of articulatory features in automatic speech recognition. We explore the use of articulatory features for both observation and pronunciation modeling, and for both audio-only and audio-visual speech recognition. In th ..."
Abstract
-
Cited by 33 (11 self)
- Add to MetaCart
(Show Context)
We report on investigations, conducted at the 2006 JHU Summer Workshop, of the use of articulatory features in automatic speech recognition. We explore the use of articulatory features for both observation and pronunciation modeling, and for both audio-only and audio-visual speech recognition. In the area of observation modeling, we use the outputs of a set of multilayer perceptron articulatory feature classifiers (1) directly, in an extension of hybrid HMM/ANN models, and (2) as part of the observation vector in a standard Gaussian mixture-based model, an extension of the now popular “tandem ” approach. In the area of pronunciation modeling, we explore models consisting of multiple hidden streams of states, each corresponding to a different articulatory feature and having soft synchrony constraints, for both audio-only and audio-visual speech recognition. Our models are implemented as dynamic Bayesian networks, and our
Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptrons
- In Proceedings of ICASSP
, 2006
"... Recent results with phone-posterior acoustic features estimated by multilayer perceptrons (MLPs) have shown that such features can effectively improve the accuracy of state-of-the-art large vocabulary speech recognition systems. MLP features are trained discriminatively to perform phone classificati ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
(Show Context)
Recent results with phone-posterior acoustic features estimated by multilayer perceptrons (MLPs) have shown that such features can effectively improve the accuracy of state-of-the-art large vocabulary speech recognition systems. MLP features are trained discriminatively to perform phone classification and are therefore, like acoustic models, tuned to a particular language and application domain. In this paper we investigate how portable such features are across domains and languages. We show that even without retraining, English-trained MLP features can provide a significant boost to recognition accuracy in new domains within the same language, as well as in entirely different languages such as Mandarin and Arabic. We also show the effectiveness of feature-level adaptation in porting MLP features to new domains. 1.
UNDERSTANDING HOW DEEP BELIEF NETWORKS PERFORM ACOUSTIC MODELLING
"... Deep Belief Networks (DBNs) are a very competitive alternative to Gaussian mixture models for relating states of a hidden Markov model to frames of coefficients derived from the acoustic input. They are competitive for three reasons: DBNs can be fine-tuned as neural networks; DBNs have many non-line ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
(Show Context)
Deep Belief Networks (DBNs) are a very competitive alternative to Gaussian mixture models for relating states of a hidden Markov model to frames of coefficients derived from the acoustic input. They are competitive for three reasons: DBNs can be fine-tuned as neural networks; DBNs have many non-linear hidden layers; and DBNs are generatively pre-trained. This paper illustrates how each of these three aspects contributes to the DBN’s good recognition performance using both phone recognition performance on the TIMIT corpus and a dimensionally reduced visualization of the relationships between the feature vectors learned by the DBNs that preserves the similarity structure of the feature vectors at multiple scales. The same two methods are also used to investigate the most suitable type of input representation for a DBN. Index Terms — Deep belief networks, neural networks, acoustic modeling
Localized spectro-temporal features for automatic speech recognition
- Proc. Eurospeech
, 2003
"... Recent results from physiological and psychoacoustic studies indicate that spectrally and temporally localized time-frequency envelope patterns form a relevant basis of auditory perception. This motivates new approaches to feature extraction for automatic speech recognition (ASR) which utilize two-d ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
(Show Context)
Recent results from physiological and psychoacoustic studies indicate that spectrally and temporally localized time-frequency envelope patterns form a relevant basis of auditory perception. This motivates new approaches to feature extraction for automatic speech recognition (ASR) which utilize two-dimensional spectro-temporal modulation filters. The paper provides a motivation and a brief overview on the work related to Localized Spectro-Temporal Features (LSTF). It further focuses on the Gabor feature approach, where a feature selection scheme is applied to automatically obtain a suitable set of Gabor-type features for a given task. The optimized feature sets are examined in ASR experiments with respect to robustness and their statistical properties are analyzed. 1. Getting auditory... again?