• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Size matters: An empirical study of neural network training for large vocabulary continuous speech recognition (1999)

by D Ellis, N Morgan
Venue:in Proc. ICASSP
Add To MetaCart

Tools

Sorted by:
Results 1 - 8 of 8

Using MLP features in SRI’s conversational speech recognition system

by Qifeng Zhu, Andreas Stolcke, Barry Y. Chen, Nelson Morgan - in Proc. Interspeech , 2005
"... We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acoustic features estimated by multilayer perceptrons (MLP). The acoustic features are based on frame-level phone posterior probabilities, obtained by merging two different MLP esti ..."
Abstract - Cited by 21 (4 self) - Add to MetaCart
We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acoustic features estimated by multilayer perceptrons (MLP). The acoustic features are based on frame-level phone posterior probabilities, obtained by merging two different MLP estimators, one based on PLP-Tandem features, the other based on hidden activation TRAPs (HATs) features. This paper focuses on the challenges arising when incorporating these nonstandard features into a full-scale speech-to-text (STT) system, as used by SRI in the Fall 2004 DARPA STT evaluations. First, we developed a series of time-saving techniques for training feature MLPs on 1800 hours of speech. Second, we investigated which components of a multipass, multi-front-end recognition system are most profitably augmented with MLP features for best overall performance. The final system obtained achieved a 2 % absolute (10 % relative) WER reduction over a comparable baseline system that did not include Tandem/HATs MLP features. 1.

Perceptually Inspired Signal-processing Strategies for Robust Speech Recognition in Reverberant Environments

by Brian E. D. Kingsbury , 1998
"... Natural, hands-free interaction with computers is currently one of the great unfulfilled promises of automatic speech recognition (ASR), in part because ASR systems cannot reliably recognize speech under everyday, reverberant conditions that pose no problems for most human listeners. The specific pr ..."
Abstract - Cited by 12 (0 self) - Add to MetaCart
Natural, hands-free interaction with computers is currently one of the great unfulfilled promises of automatic speech recognition (ASR), in part because ASR systems cannot reliably recognize speech under everyday, reverberant conditions that pose no problems for most human listeners. The specific properties of the auditory representation of speech likely contribute to reliable human speech recognition under such conditions. This dissertation explores the use of perceptually inspired signal-processing strategies -- critical-band-like frequency analysis, an emphasis of slow changes in the spectral structure of the speech signal, adaptation, integration of phonetic information over syllabic durations, and use of multiple signal representations for...

Incorporating tandem/HATs MLP features into SRI’s conversational speech recognition system

by Qifeng Zhu, Andreas Stolcke, Barry Y. Chen, Nelson Morgan - in Proc. DARPA RT Workshop , 2004
"... We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acoustic features estimated by multilayer perceptrons (MLPs). The acoustic features are based on frame-level phone posterior probabilities, obtained by merging two different MLP est ..."
Abstract - Cited by 5 (1 self) - Add to MetaCart
We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acoustic features estimated by multilayer perceptrons (MLPs). The acoustic features are based on frame-level phone posterior probabilities, obtained by merging two different MLP estimators, one based on PLP-Tandem features, the other based on hidden activation TRAPs (HATs) features. These features had previously been shown to give significant accuracy improvements for CTS recognition when used with modest amounts of training data and relatively simple recognition architectures. This paper focuses on the challenges arising when incorporating these nonstandard features into a fullscale speech-to-text (STT) system, as used by SRI in the Fall 2004 DARPA STT evaluations. First, we developed a series of timesaving techniques for training feature MLPs on 1500 hours of speech. Second, we investigated which components of a multipass, multi-front-end recognition system are most profitably augmented with MLP features for best overall performance. The final system obtained achieved a 2 % absolute (10 % relative) WER reduction over a comparable baseline system that did not include Tandem/HATs MLP features. 1.

Hybrid Connectionist-Structural Acoustical Modeling In The Atros System

by M. J. Castro, F. Casacuberta, Departament Sistemes - In Proc. Eurospeech'99 , 1999
"... In this paper, we introduce several hybrid connectionist-structural acoustic models for contextindependent phone-like units in the atros recognition system. The structural part of the acoustic models has been modeled with Markov chains, and a multilayer perceptron (or a committee of multilayer perce ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
In this paper, we introduce several hybrid connectionist-structural acoustic models for contextindependent phone-like units in the atros recognition system. The structural part of the acoustic models has been modeled with Markov chains, and a multilayer perceptron (or a committee of multilayer perceptrons) is used to estimate the emission probabilities of the Markov chains. We compare the recognition performance attained by these models with the performance obtained by classical continuous density hidden Markov models on a semantic restricted task. 1 Introduction Acoustic phonetic-decoding for continuous speech recognition is an open problem in speech research, because the nal performance of an automatic speech recognition system greatly depends on the acoustic modeling quality. Hidden Markov models (HMMs) of phone-like units are the most popular option for modeling speech sounds. Under the statistical framework [1], the problem of speech recognition is to search for a word string ^ ...

Combined Speech And Speaker Recognition With Speaker-Adapted Connectionist Models

by Dominique Genoud Dan, Dan Ellis, Nelson Morgan - In Automatic Speech Recognition and Understanding Workshop , 1999
"... One approach to speaker adaptation for the neural-network acoustic models of a hybrid connectionist-HMM speech recognizer is to adapt a speaker-independent network by performing a small amount of additional training using data from the target speaker, giving an acoustic model specifically tuned to t ..."
Abstract - Add to MetaCart
One approach to speaker adaptation for the neural-network acoustic models of a hybrid connectionist-HMM speech recognizer is to adapt a speaker-independent network by performing a small amount of additional training using data from the target speaker, giving an acoustic model specifically tuned to that speaker. This adapted model might be useful for speaker recognition too, especially since state-of-the-art speaker recognition typically performs a speech-recognition labelling of the input speech as a first stage. However, in order to exploit the discriminant nature of the neural nets, it is better to train a single model to discriminate both between the different phone classes (as in conventional speech recognition) and between the target speaker and the `rest of the world' (a common approach to speaker recognition). We present the results of using such an approach for a set of 12 speakers selected from the DARPA/NIST Broadcast News corpus. The speaker-adapted nets showed a 17% relativ...

Multi-Stream Speech Recognition: Ready For Prime Time?

by Adam Janin Dan, Dan Ellis, Nelson Morgan - Proc. Eurospeech-99 , 1999
"... Multi-stream and multi-band methods can improve the accuracy of speech recognition systems without overly increasing the complexity. However, they cannot be applied blindly. In this paper, we review our experience applying multi-stream and multiband methods to the Broadcast News corpus. We found tha ..."
Abstract - Add to MetaCart
Multi-stream and multi-band methods can improve the accuracy of speech recognition systems without overly increasing the complexity. However, they cannot be applied blindly. In this paper, we review our experience applying multi-stream and multiband methods to the Broadcast News corpus. We found that multi-stream systems using different acoustic front-ends provide a significant improvement over single stream systems. However, despite the fact that they have been successful on smaller tasks, we have not yet been able to show any improvement using multiband methods. We report various insights gained from the experience in applying these methods in a large-vocabulary task.

Programmable Neurocomputing

by Krste Asanovic Mit, Krste Asanović
"... This article first reviews the most significant neurocomputer architectures, discusses their design and use, before concluding with predictions of future trends ..."
Abstract - Add to MetaCart
This article first reviews the most significant neurocomputer architectures, discusses their design and use, before concluding with predictions of future trends

INCORPORATING TANDEM/HATS MLP FEATURES INTO SRI’S CONVERSATIONAL SPEECH RECOGNITION SYSTEM

by unknown authors
"... We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acoustic features estimated by multilayer perceptrons (MLPs). The acoustic features are based on frame-level phone posterior probabilities, obtained by merging two different MLP est ..."
Abstract - Add to MetaCart
We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acoustic features estimated by multilayer perceptrons (MLPs). The acoustic features are based on frame-level phone posterior probabilities, obtained by merging two different MLP estimators, one based on PLP-Tandem features, the other based on hidden activation TRAPs (HATs) features. These features had previously been shown to give significant accuracy improvements for CTS recognition when used with modest amounts of training data and relatively simple recognition architectures. This paper focuses on the challenges arising when incorporating these nonstandard features into a fullscale speech-to-text (STT) system, as used by SRI in the Fall 2004 DARPA STT evaluations. First, we developed a series of timesaving techniques for training feature MLPs on 1500 hours of speech. Second, we investigated which components of a multipass, multi-front-end recognition system are most profitably augmented with MLP features for best overall performance. The final system obtained achieved a 2 % absolute (10 % relative) WER reduction over a comparable baseline system that did not include Tandem/HATs MLP features. 1.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University