• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Tandem connectionist feature extraction for conventional HMM systems

by Hynek Hermansky, Daniel P. W. Ellis, Sangita Sharma
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 242
Next 10 →

Deep Neural Networks for Acoustic Modeling in Speech Recognition

by Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, Brian Kingsbury
"... Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative ..."
Abstract - Cited by 272 (47 self) - Add to MetaCart
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition. I.
(Show Context)

Citation Context

... complementarity of the AE-BN and baseline systems. Instead of replacing the coefficients usually modeled by GMMs, neural networks can also be used to provide additional features for the GMM to model =-=[8]-=-, [9], [63]. DBN-DNNs have recently been shown to be very effective in such tandem systems. On the Aurora2 test set, pretraining decreased WERs by more than one third for speech with signal-to-noise l...

Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition

by George E. Dahl, Dong Yu, Li Deng, Alex Acero - IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , 2012
"... We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to pr ..."
Abstract - Cited by 254 (50 self) - Add to MetaCart
We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8 % and 9.2 % (or relative error reduction of 16.0 % and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively.

Locating Singing Voice Segments within Music Signals

by Adam L. Berenzweig, Daniel P. W. Ellis , 2001
"... A sung vocal line is the prominent feature of much popular music. It would be useful to reliably locate the portions of a musical track during which the vocals are present, both as a ‘signature’ of the piece and as a precursor to automatic recognition of lyrics. Here, we approach this problem by usi ..."
Abstract - Cited by 77 (6 self) - Add to MetaCart
A sung vocal line is the prominent feature of much popular music. It would be useful to reliably locate the portions of a musical track during which the vocals are present, both as a ‘signature’ of the piece and as a precursor to automatic recognition of lyrics. Here, we approach this problem by using the acoustic classifier of a speech recognizer as a detector for speech-like sounds. Although singing (including a musical background) is a relatively poor match to an acoustic model trained on normal speech, we propose various statistics of the classifier’s output in order to discriminate singing from instrumental accompaniment. A simple HMM allows us to find a best labeling sequence for this uncertain data. On a test set of forty 15 second excerpts of randomly-selected music, our classifier achieved around 80 % classification accuracy at the frame level. The utility of different features, and our plans for eventual lyrics recognition, are discussed. 1.
(Show Context)

Citation Context

...s well or better. At this point, the system resembles the ‘tandem acoustic models’ (PPFs used as inputs to a Gaussian-mixture-model recognizer) that we have recently been using for speech recognit=-=ion [6]-=-. Our best performing singing segmenter is a tandem connection of a neural-net discriminatory speech model, followed by a high-dimensional Gaussian distribution model for each of the two classes, foll...

Improving Word Accuracy with Gabor Feature Extraction

by Michael Kleinschmidt, David Gelbart - Proc. ICSLP , 2002
"... A novel type of feature extraction for automatic speech recognition is investigated. Two-dimensional Gabor functions, with varying extents and tuned to different rates and directions of spectro-temporal modulation, are applied as filters to a spectro-temporal representation provided by mel spectra. ..."
Abstract - Cited by 54 (3 self) - Add to MetaCart
A novel type of feature extraction for automatic speech recognition is investigated. Two-dimensional Gabor functions, with varying extents and tuned to different rates and directions of spectro-temporal modulation, are applied as filters to a spectro-temporal representation provided by mel spectra. The use of these functions is motivated by findings in neurophysiology and psychoacoustics. Data-driven parameter selection was used to obtain Gabor feature sets, the performance of which is evaluated on the Aurora 2 and 3 datasets both on their own and in combination with the Qualcomm-OGI-ICSI Aurora proposal. The Gabor features consistently provide performance improvements.
(Show Context)

Citation Context

...pendicular to its direction of modulation. 3.1. Set up 3. ASR EXPERIMENTS The Gabor features approach is evaluated within the aurora experimental framework [10] using a) the Tandem recognition system =-=[11]-=- and d) a combination of it with the Qualcomm-ICSI-OGI QIO-NoTRAPS system, which is described in [12]. Variants of that are b) and c): the Gabor Tandem system as a single stream combined with noise ro...

Tandem Acoustic Modeling In Large-Vocabulary Recognition

by Daniel Ellis, Rita Singh, Sunil Sivadas - in Proc. ICASSP-2001 , 2001
"... In the tandem approach to modeling the acoustic signal, a neural-net preprocessor is first discriminatively trained to estimate posterior probabilities across a phone set. These are then used as feature inputs for a conventional hidden Markov model (HMM) based speech recognizer, which relearns the a ..."
Abstract - Cited by 44 (2 self) - Add to MetaCart
In the tandem approach to modeling the acoustic signal, a neural-net preprocessor is first discriminatively trained to estimate posterior probabilities across a phone set. These are then used as feature inputs for a conventional hidden Markov model (HMM) based speech recognizer, which relearns the associations to subword units. In this paper, we apply the tandem approach to the data provided for the first Speech in Noisy Environments (SPINE1) evaluation conducted by the Naval Research Laboratory (NRL) in August 2000. In our previous experience with the ETSI Aurora noisy digits (a small-vocabulary, high-noise task) the tandem approach achieved error-rate reductions of over 50% relative to the HMM baseline. For SPINE1, a larger task involving more spontaneous speech, we find that, when context-independent models are used, the tandem features continue to result in large reductions in word-error rates relative to those achieved by systems using standard MFC or PLP features. However, these ...
(Show Context)

Citation Context

...f combining the advantages of both, and several groups have pursued variants of this theme [4, 5]. We recently developed a particularly simple variant, which we have termed ‘tandem acoustic modeling=-=’ [6]-=-, in which an NN classifier is first trained to estimate context-independent phone posterior probabilities. The probability vectors are then treated as normal feature vectors and used as the input for...

The IBM 2004 conversational telephony system for rich transcription

by Hagen Soltau, Brian Kingsbury, Lidia Mangu, Daniel Povey, George Saon, Geoffrey Zweig - in Proc. ICASSP ’05 , 2005
"... This paper describes the technical advances in IBM’s conversational telephony submission to the DARPA-sponsored 2004 Rich Transcription evaluation (RT-04). These advances include a system architecture based on cross-adaptation; a new form of feature-based MPE training; the use of a full-scale discri ..."
Abstract - Cited by 37 (3 self) - Add to MetaCart
This paper describes the technical advances in IBM’s conversational telephony submission to the DARPA-sponsored 2004 Rich Transcription evaluation (RT-04). These advances include a system architecture based on cross-adaptation; a new form of feature-based MPE training; the use of a full-scale discriminatively trained full covariance gaussian system; the use of septaphone cross-word acoustic context in static decoding graphs; and the incorporation of 2100 hours of training data in every system component. These advances reduced the error rate by approximately 21 % relative, on the 2003 test set, over the best-performing system in last year’s evaluation, and produced the best results on the RT-04 current and progress CTS data. 1.
(Show Context)

Citation Context

...ix (note that the features are added to the LDA+MLLT features so this is a reasonable initialization). The method of fMPE may be compared with past work using neural-net posteriors as feature vectors =-=[9]-=-. However, previous methods either use only transformed posteriors as features, or concatenate posterior-derived features and standard recognition features. The new method differs in its high dimensio...

Articulatory feature-based methods for acoustic and audio-visual speech recognition: 2006 JHU summer workshop final report

by Karen Livescu, Özgür Çetin, Simon King, Chris Bartels, Nash Borges, Arthur Kantor, Partha Lal, Lisa Yung, Ari Bezman, Bronwyn Woods, et al. - JOHNS HOPKINS UNIVERSITY CENTER FOR , 2007
"... We report on investigations, conducted at the 2006 JHU Summer Workshop, of the use of articulatory features in automatic speech recognition. We explore the use of articulatory features for both observation and pronunciation modeling, and for both audio-only and audio-visual speech recognition. In th ..."
Abstract - Cited by 33 (11 self) - Add to MetaCart
We report on investigations, conducted at the 2006 JHU Summer Workshop, of the use of articulatory features in automatic speech recognition. We explore the use of articulatory features for both observation and pronunciation modeling, and for both audio-only and audio-visual speech recognition. In the area of observation modeling, we use the outputs of a set of multilayer perceptron articulatory feature classifiers (1) directly, in an extension of hybrid HMM/ANN models, and (2) as part of the observation vector in a standard Gaussian mixture-based model, an extension of the now popular “tandem ” approach. In the area of pronunciation modeling, we explore models consisting of multiple hidden streams of states, each corresponding to a different articulatory feature and having soft synchrony constraints, for both audio-only and audio-visual speech recognition. Our models are implemented as dynamic Bayesian networks, and our
(Show Context)

Citation Context

...e the outputs of multilayer perceptron (MLP) AF classifiers in two ways: to estimate p(o|q) (a “hybrid” approach [9]); and as part of the observation vector after post-processing (a “tandem” approach =-=[10]-=-). We investigate “embedded training” of the MLPs, in which training data is aligned using an AF-based recognizer and the MLPs are retrained [11]. For pronunciation modeling, we test a model consistin...

Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptrons

by Andreas Stolcke, Mei-yuh Hwang, Xin Lei, Nelson Morgan, Dimitra Vergyri - In Proceedings of ICASSP , 2006
"... Recent results with phone-posterior acoustic features estimated by multilayer perceptrons (MLPs) have shown that such features can effectively improve the accuracy of state-of-the-art large vocabulary speech recognition systems. MLP features are trained discriminatively to perform phone classificati ..."
Abstract - Cited by 33 (2 self) - Add to MetaCart
Recent results with phone-posterior acoustic features estimated by multilayer perceptrons (MLPs) have shown that such features can effectively improve the accuracy of state-of-the-art large vocabulary speech recognition systems. MLP features are trained discriminatively to perform phone classification and are therefore, like acoustic models, tuned to a particular language and application domain. In this paper we investigate how portable such features are across domains and languages. We show that even without retraining, English-trained MLP features can provide a significant boost to recognition accuracy in new domains within the same language, as well as in entirely different languages such as Mandarin and Arabic. We also show the effectiveness of feature-level adaptation in porting MLP features to new domains. 1.
(Show Context)

Citation Context

...n closely related to the recognition task, or at least to train the front end’s parameters according to such a criterion. This was achieved in the Tandem approach to hybrid connectionist/HMM modeling =-=[2]-=-, based on prior work in neural-networkbased acoustic modeling [3]. The Tandem approach consists of training a multilayer perceptron (MLP) to perform phone posterior estimation at the frame level, bas...

UNDERSTANDING HOW DEEP BELIEF NETWORKS PERFORM ACOUSTIC MODELLING

by Abdel-rahman Mohamed, Geoffrey Hinton, Gerald Penn
"... Deep Belief Networks (DBNs) are a very competitive alternative to Gaussian mixture models for relating states of a hidden Markov model to frames of coefficients derived from the acoustic input. They are competitive for three reasons: DBNs can be fine-tuned as neural networks; DBNs have many non-line ..."
Abstract - Cited by 27 (3 self) - Add to MetaCart
Deep Belief Networks (DBNs) are a very competitive alternative to Gaussian mixture models for relating states of a hidden Markov model to frames of coefficients derived from the acoustic input. They are competitive for three reasons: DBNs can be fine-tuned as neural networks; DBNs have many non-linear hidden layers; and DBNs are generatively pre-trained. This paper illustrates how each of these three aspects contributes to the DBN’s good recognition performance using both phone recognition performance on the TIMIT corpus and a dimensionally reduced visualization of the relationships between the feature vectors learned by the DBNs that preserves the similarity structure of the feature vectors at multiple scales. The same two methods are also used to investigate the most suitable type of input representation for a DBN. Index Terms — Deep belief networks, neural networks, acoustic modeling
(Show Context)

Citation Context

...structure of speech signals, with each HMM state using a Gaussian mixture model (GMM) to model some type of spectral representation of the sound wave. Some ASR systems use feedforward neural networks =-=[1, 2]-=-. DBNs [3] were proposed for acoustic modeling in speech recognition [4] because they have a higher modeling capacity per parameter than GMMs and they also have a fairly efficient training procedure t...

Localized spectro-temporal features for automatic speech recognition

by Michael Kleinschmidt - Proc. Eurospeech , 2003
"... Recent results from physiological and psychoacoustic studies indicate that spectrally and temporally localized time-frequency envelope patterns form a relevant basis of auditory perception. This motivates new approaches to feature extraction for automatic speech recognition (ASR) which utilize two-d ..."
Abstract - Cited by 25 (0 self) - Add to MetaCart
Recent results from physiological and psychoacoustic studies indicate that spectrally and temporally localized time-frequency envelope patterns form a relevant basis of auditory perception. This motivates new approaches to feature extraction for automatic speech recognition (ASR) which utilize two-dimensional spectro-temporal modulation filters. The paper provides a motivation and a brief overview on the work related to Localized Spectro-Temporal Features (LSTF). It further focuses on the Gabor feature approach, where a feature selection scheme is applied to automatically obtain a suitable set of Gabor-type features for a given task. The optimized feature sets are examined in ASR experiments with respect to robustness and their statistical properties are analyzed. 1. Getting auditory... again?
(Show Context)

Citation Context

...ng temporal context on the order of 10 to 100 ms. Depending on the system, this is part of the back end as in the connectionist approach [21] or part of the feature extraction as in the Tandem system =-=[22]-=-. The main problem of LSTF is the large number of possible parameter combinations. This issue may be solved implicitly by automatic learning in neural networks with a spectrogram input and a long time...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University