• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

A segmental CRF approach to large vocabulary continuous speech recognition (2009)

by G Zweig, P Nguyen
Venue:in Proc. ASRU
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 14
Next 10 →

Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition

by George E. Dahl, Student Member, Dong Yu, Senior Member, Li Deng, Alex Acero - IEEE Transactions on Audio, Speech, and Language Processing , 2012
"... Abstract—We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the ..."
Abstract - Cited by 9 (1 self) - Add to MetaCart
Abstract—We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8 % and 9.2 % (or relative error reduction of 16.0 % and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively. Index Terms—Speech recognition, deep belief network, context-dependent phone, LVSR, DNN-HMM, ANN-HMM I.

SCARF: A Segmental Conditional Random Field Toolkit for Speech Recognition

by Geoffrey Zweig, Patrick Nguyen
"... This paper describes a new toolkit- SCARF- for doing speech recognition with segmental conditional random fields. It is designed to allow for the integration of numerous, possibly redundant segment level acoustic features, along with a complete language model, in a coherent speech recognition framew ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
This paper describes a new toolkit- SCARF- for doing speech recognition with segmental conditional random fields. It is designed to allow for the integration of numerous, possibly redundant segment level acoustic features, along with a complete language model, in a coherent speech recognition framework. SCARF performs a segmental analysis, where each segment corresponds to a word, thus allowing for the incorporation of acoustic features defined at the phoneme, multi-phone, syllable and word level. SCARF is designed to make it especially convenient to use acoustic detection events as input, such as the detection of energy bursts, phonemes, or other events. Language modeling is done by associating each state in the SCRF with a state in an underlying n-gram language model, and SCARF supports the joint and discriminative training of language model and acoustic model parameters. SCARF is available for download from

Structured log linear models for noise robust speech recognition

by S. -x. (austin Zhang, M. J. F Gales - Signal Processing Letters, IEEE , 2010
"... [ The use of discriminative models for structured classification tasks, such as automatic speech recognition is becoming increasingly popular. The major contribution of this work is we proposed a large margin structured log-linear model for noise robust continuous ASR. 1 An important aspect of log-l ..."
Abstract - Cited by 4 (3 self) - Add to MetaCart
[ The use of discriminative models for structured classification tasks, such as automatic speech recognition is becoming increasingly popular. The major contribution of this work is we proposed a large margin structured log-linear model for noise robust continuous ASR. 1 An important aspect of log-linear models is the form of the features. The features used in our structured log linear model are derived from generative kernels. This provides an elegant way of combining generative and discriminative models to handle time-varying data. Additionally, since the features are based on the generative models, model-based compensation can be easily performed for noise robustness. Third, the designed joint feature space can be decomposed at the arc level. This allows efficient decoding and training with lattices, which is important for any larger vocabulary extensions. Previous work in this area is extended in two important directions. First, instead of using CML training which is commonly used for discriminative models, this paper describes efficient large margin training for sentence-level log linear models based on lattices. Depending on the nature of the joint feature-space and labels, we have proved that this form of model is closely related to structured SVMs and Multiclass SVMs. Second, efficient lattice-based classification of continuous data is also performed incorporating a joint feature space. This novel model combines generative kernels, discriminative models, efficient lattice-based large margin training and modelbased noise compensation. It is evaluated on a noise corrupted continuous digit task: AURORA 2.0. Results on the AURORA 2 demonstrate that modelling the structure information yields significant improvements.]

Structured Support Vector Machines for Noise Robust Continuous Speech Recognition

by Shi-xiong Zhang, M. J. F. Gales
"... The use of discriminative models is an interesting alternative to generative models for speech recognition. This paper examines one form of these models, structured support vector machines (SVMs), for noise robust speech recognition. One important aspect of structured SVMs is the form of the joint f ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
The use of discriminative models is an interesting alternative to generative models for speech recognition. This paper examines one form of these models, structured support vector machines (SVMs), for noise robust speech recognition. One important aspect of structured SVMs is the form of the joint feature space. In this work features based on generative models are used, which allows model-based compensation schemes to be applied to yield robust joint features. However, these features require the segmentation of frames into words, or subwords, to be specified. In previous work this segmentation was obtained using generative models. Here the segmentations are refined using the parameters of the structured SVM. A Viterbilike scheme for obtaining “optimal ” segmentations, and modifications to the training algorithm to allow them to be efficiently used, are described. The performance of the approach is evaluated on a noise corrupted continuous digit task: AURORA 2. Index Terms: speech recognition, structural SVMs, optimal alignment, large margin, log linear model

LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION WITH CONTEXT-DEPENDENT DBN-HMMS

by George E. Dahl, Dong Yu, Li Deng, Alex Acero
"... The context-independent deep belief network (DBN) hidden Markov model (HMM) hybrid architecture has recently achieved promising results for phone recognition. In this work, we propose a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines o ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
The context-independent deep belief network (DBN) hidden Markov model (HMM) hybrid architecture has recently achieved promising results for phone recognition. In this work, we propose a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice search task. Our system achieves absolute sentence accuracy improvements of 5.8 % and 9.2 % over GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively, which translate to relative error reductions of 16.0 % and 23.2%. Index Terms — Speech recognition, deep belief network, context-dependent phone, LVCSR, DBN-HMM 1.

INTEGRATING META-INFORMATION INTO EXEMPLAR-BASED SPEECH RECOGNITION WITH SEGMENTAL CONDITIONAL RANDOM FIELDS

by Kris Demuynck, Dirk Van Compernolle, Leuven Esat, Patrick Nguyen, Geoffrey Zweig
"... Exemplar based recognition systems are characterized by the fact that, instead of abstracting large amounts of data into compact models, they store the observed data enriched with some annotations and infer on-the-fly from the data by finding those exemplars that resemble the input speech best. One ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Exemplar based recognition systems are characterized by the fact that, instead of abstracting large amounts of data into compact models, they store the observed data enriched with some annotations and infer on-the-fly from the data by finding those exemplars that resemble the input speech best. One advantage of exemplar based systems is that next to deriving what the current phone or word is, one can easily derive a wealth of meta-information concerning the chunk of audio under investigation. In this work we harvest meta-information from the set of best matching exemplars, that is thought to be relevant for the recognition such as word boundary predictions and speaker entropy. Integrating this meta-information into the recognition framework using segmental conditional random fields, reduced the WER of the exemplar based system on the WSJ Nov92 20k task from 8.2 % to 7.6%. Adding the HMM-score and multiple HMM phone detectors as features further reduced the error rate to 6.6%.

FROM FLAT DIRECT MODELS TO SEGMENTAL CRF MODELS

by Geoffrey Zweig, Patrick Nguyen
"... This paper summarizes recent work at Microsoft on the development of novel direct models. The key characteristic of our approaches is the use of long-span segment level features that relate acoustic properties directly to words. In this approach, the frame-level Markov assumption is replaced by the ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
This paper summarizes recent work at Microsoft on the development of novel direct models. The key characteristic of our approaches is the use of long-span segment level features that relate acoustic properties directly to words. In this approach, the frame-level Markov assumption is replaced by the segment level Markov property, allowing us to extract long-span features. A key issue we address is the definition of generalizable features which allow us to model unseen words. We review two recently developed models that have this property: Flat Direct Models (FDMs), and Segmental CRFs (SCRFs). The first operates in a log-linear framework, and uses utterance level features. The second is also a log-linear model, but defines features at the word-segment level. We present new experimental results comparing the two approaches. We find that both show consistent improvements over a baseline system, and that the extra context available to the FDM enables slightly better performance in a rescoring context. This gain comes at the expense of applicability to first pass decoding, for which the SCRF is better suited.

Continuous Speech Recognition with a TF-IDF Acoustic Model

by Geoffrey Zweig, Patrick Nguyen, Alex Acero
"... Information retrieval methods are frequently used for indexing and retrieving spoken documents, and more recently have been proposed for voice-search amongst a pre-defined set of business entries. In this paper, we show that these methods can be used in an even more fundamental way, as the core comp ..."
Abstract - Add to MetaCart
Information retrieval methods are frequently used for indexing and retrieving spoken documents, and more recently have been proposed for voice-search amongst a pre-defined set of business entries. In this paper, we show that these methods can be used in an even more fundamental way, as the core component in a continuous speech recognizer. Speech is initially processed and represented as a sequence of discrete symbols, specifically phoneme or multi-phone units. Recognition then operates on this sequence. The recognizer is segment-based, and the acoustic score for labeling a segment with a word is based on the TF-IDF similarity between the subword units detected in the segment, and those typically seen in association with the word. We present promising results on both a voice search task and the Wall Street Journal task. The development of this method brings us one step closer to being able to do speech recognition based on the detection of sub-word audio attributes.

DISCRIMINATIVE TEMPLATE EXTRACTION FOR DIRECT MODELING

by Patrick Nguyen, Geoffrey Zweig
"... This paper addresses the problem of developing appropriate features for use in direct modeling approaches to speech recognition, such as those based on Maximum Entropy models or Segmental Conditional Random Fields. We propose a feature based on the detection of word-level templates which are discrim ..."
Abstract - Add to MetaCart
This paper addresses the problem of developing appropriate features for use in direct modeling approaches to speech recognition, such as those based on Maximum Entropy models or Segmental Conditional Random Fields. We propose a feature based on the detection of word-level templates which are discriminatively chosen based on a mutual information criterion. The templates for a word are derived directly from the MFCC feature vectors, based on self-similarity across examples. No pronunciation dictionary is used, and the resulting templates match closely to in-class examples and distantly to out-of-class examples. We utilize template detection events as input to a segmental CRF speech recognizer. We evaluate the entire scheme on a voice search task. The results show that the use of discriminative template based word detector streams improves the speech recognizer’s performance over the baseline HMM results.

unknown title

by unknown authors
"... This paper summarizes the 2010 CLSP Summer Workshop on speech recognition at Johns Hopkins University. The key theme of the workshop was to improve on state-of-the-art speech recognition systems by using Segmental Conditional Random Fields (SCRFs) to integrate multiple types of information. This app ..."
Abstract - Add to MetaCart
This paper summarizes the 2010 CLSP Summer Workshop on speech recognition at Johns Hopkins University. The key theme of the workshop was to improve on state-of-the-art speech recognition systems by using Segmental Conditional Random Fields (SCRFs) to integrate multiple types of information. This approach uses a stateof-the-art baseline as a springboard from which to add a suite of novel features including ones derived from acoustic templates, deep neural net phoneme detections, duration models, modulation features, and whole word point-process models. The SCRF framework is able to appropriately weight these different information sources to produce significant gains on both the Broadcast News and Wall Street Journal tasks.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University