• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Use of word level side information to improve speech recognition (2000)

by Dimitra Vergyri
Venue:Proceedings ICASSP
Add To MetaCart

Tools

Sorted by:
Results 1 - 5 of 5

LANDMARK-BASED SPEECH RECOGNITION: REPORT OF THE 2004 Johns Hopkins Summer Workshop

by Mark Hasegawa-Johnson ,James Baker, Steven Greenberg, Katrin Kirchhoff, Jennifer Muller, Kemal Sönmez, Sarah Borys, Ken Chen, Amit Juneja, Karen Livescu, Srividya Mohan, Emily Coogan, Tianyu Wang , 2005
"... ..."
Abstract - Cited by 14 (1 self) - Add to MetaCart
Abstract not found

Prosodic knowledge sources for automatic speech recognition

by Dimitra Vergyri, Andreas Stolcke, Venkata R. R. Gadde, Luciana Ferrer, Elizabeth Shriberg - in Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing , 2003
"... In this work, different prosodic knowledge sources are integrated into a state-of-the-art large vocabulary speech recognition system. Prosody manifests itself on different levels in the speech signal: within the words as a change in phone durations and pitch, inbetween the words as a variation in th ..."
Abstract - Cited by 13 (5 self) - Add to MetaCart
In this work, different prosodic knowledge sources are integrated into a state-of-the-art large vocabulary speech recognition system. Prosody manifests itself on different levels in the speech signal: within the words as a change in phone durations and pitch, inbetween the words as a variation in the pause length, and beyond the words, correlating with higher linguistic structures and nonlexical phenomena. We investigate three models, each exploiting a different level of prosodic information, in rescoring N-best hypotheses according to how well recognized words correspond to prosodic features of the utterance. Experiments on the Switchboard corpus show word accuracy improvements with each prosodic knowledge source. A further improvement is observed with the combination of all models, demonstrating that they each capture somewhat different prosodic characteristics of the speech signal. 1.

DBN based multi-stream models for audio-visual speech recognition

by John N. Gowdy, Amarnag Subramanya, Chris Bartels, Jeff Bilmes - in Proc. ICASSP , 2004
"... In this paper, we propose a model based on Dynamic Bayesian Networks (DBNs) to integrate information from multiple audio and visual streams. We also compare the DBN based system (implemented using the Graphical Model Toolkit (GMTK)) with a classical HMM (implemented in the Hidden Markov Model Toolki ..."
Abstract - Cited by 11 (3 self) - Add to MetaCart
In this paper, we propose a model based on Dynamic Bayesian Networks (DBNs) to integrate information from multiple audio and visual streams. We also compare the DBN based system (implemented using the Graphical Model Toolkit (GMTK)) with a classical HMM (implemented in the Hidden Markov Model Toolkit (HTK)) for both the single and two stream integration problems. We also propose a new model (mixed integration) to integrate information from three or more streams derived from different modalities and compare the new model’s performance with that of a synchronous integration scheme. A new technique to estimate stream confidence measures for the integration of three or more streams is also developed and implemented. Results from our implementation using the Clemson University Audio Visual Experiments (CUAVE) database indicate an absolute improvement of about in word accuracy in the-4 to 10db average case when making use of two audio and one video streams for the mixed integration models over the sychronous models. 1.

Minimum Risk Acoustic Clustering for Multilingual Acoustic Model combination

by Dimitra Vergyri, Stavros Tsakalidis, William Byrne - Proc. of International Conference on Spoken Language Processing , 2000
"... In this paper we describe procedures for combining multiple acoustic models, obtained using training corpora from different languages, in order to improve ASR performance in languages for which large amounts of training data are not available. We treat these models as multiple sources of information ..."
Abstract - Cited by 9 (2 self) - Add to MetaCart
In this paper we describe procedures for combining multiple acoustic models, obtained using training corpora from different languages, in order to improve ASR performance in languages for which large amounts of training data are not available. We treat these models as multiple sources of information whose scores are combined in a log-linear model to compute the hypothesis likelihood. The model combination can either be performed in a static way, with constant combination weights, or in a dynamic way, with parameters that can vary for different segments of a hypothesis. The aim is to optimize the parameters so as to achieve minimum word error rate. In order to achieve robust parameter estimation in the dynamic combination case, the parameters are defined to be piecewise constant on different phonetic classes that form a partition of the space of hypothesis segments. The partition is defined, using phonological knowledge, on segments that correspond to hypothesized phones. We examine different ways to define such a partition, including an automatic approach that gives a binary tree structured partition which tries to achieve the minimum WER with the minimum number of classes. 1.

ICSLP- 2004 New Features based on Multiple Word Graphs for Utterance Verification

by Alberto Sanchis, Alfons Juan, Enrique Vidal
"... The goal of Utterance Verification is to estimate a confidence measure which helps detecting words in the hypothesized sentence that are likely to have been missrecognized. Word graphs have been extensively employed for directly estimating the confidence measure and for extracting important predicto ..."
Abstract - Add to MetaCart
The goal of Utterance Verification is to estimate a confidence measure which helps detecting words in the hypothesized sentence that are likely to have been missrecognized. Word graphs have been extensively employed for directly estimating the confidence measure and for extracting important predictor features. In all the cases, a single word graph which is obtained through the recognition process. In this paper we propose the use of multiple word graphs to compute new features. The experimental study shows that these proposed features outperform those computed on a single word graph and other well-known predictor features. Moreover, the combination of the proposed features along with other kind of features provides improvements in the verification accuracy. 1.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University