• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Arc Minimization in Finite State Decoding Graphs with Cross-Word Acoustic Context (2002)

by G. Zweig, G. Saon, F. Yvon
Venue:In Proc. ICSLP’02
Add To MetaCart

Tools

Sorted by:
Results 1 - 4 of 4

Automatic Recognition of Spontaneous Speech for Access to Multilingual Oral History Archives

by William Byrne, David Doermann, Martin Franz, Senior Member, Samuel Gustman, Dagobert Soergel, Todd Ward, Wei-jing Zhu - IEEE Transactions on Speech and Audio Processing , 2004
"... Abstract—Much is known about the design of automated systems to search broadcast news, but it has only recently become possible to apply similar techniques to large collections of spontaneous speech. This paper presents initial results from experiments with speech recognition, topic segmentation, to ..."
Abstract - Cited by 20 (6 self) - Add to MetaCart
Abstract—Much is known about the design of automated systems to search broadcast news, but it has only recently become possible to apply similar techniques to large collections of spontaneous speech. This paper presents initial results from experiments with speech recognition, topic segmentation, topic categorization, and named entity detection using a large collection of recorded oral histories. The work leverages a massive manual annotation effort on 10 000 h of spontaneous speech to evaluate the degree to which automatic speech recognition (ASR)-based segmentation and categorization techniques can be adapted to approximate decisions made by human annotators. ASR word error rates near 40 % were achieved for both English and Czech for heavily accented, emotional and elderly spontaneous speech based on 65–84 h of transcribed speech. Topical segmentation based on shifts in the recognized English vocabulary resulted in 80 % agreement with

An architecture for rapid decoding of large vocabulary conversational speech

by George Saon, Geoffrey Zweig, Brian Kingsbury, Lidia Mangu, Upendra Chaudhari - in Eurospeech-2003 , 2003
"... This paper addresses the question of how to design a large vocabulary recognition system so that it can simultaneously handle a sophisticated language model, perform state-ofthe-art speaker adaptation, and run in one times real time 1 (1 RT). The architecture we propose is based on classical HMM Vit ..."
Abstract - Cited by 12 (7 self) - Add to MetaCart
This paper addresses the question of how to design a large vocabulary recognition system so that it can simultaneously handle a sophisticated language model, perform state-ofthe-art speaker adaptation, and run in one times real time 1 (1 RT). The architecture we propose is based on classical HMM Viterbi decoding, but uses an extremely fast initial speaker-independent decoding to estimate VTL warp factors, feature-space and model-space MLLR transformations that are used in a final speaker-adapted decoding. We present results on past Switchboard evaluation data that indicate that this strategy compares favorably to published unlimited-time systems (running in several hundred times real-time). Coincidentally, this is the system that IBM fielded in the 2003 EARS Rich Transcription evaluation. 1.

Advances in speech transcriptions at IBM under the DARPA EARS program

by Stanley F. Chen, Brian Kingsbury, Lidia Mangu, Daniel Povey, George Saon, Hagen Soltau, Geoffrey Zweig - IEEE Transactions on Audio, Speech, and Language Processing, accepted for publication , 2000
"... Abstract—This paper describes the technical and system building advances made in IBM’s speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program. At a technical level, these advances include the ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Abstract—This paper describes the technical and system building advances made in IBM’s speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program. At a technical level, these advances include the development of a new form of feature-based minimum phone error training (fMPE), the use of large-scale discriminatively trained full-covariance Gaussian models, the use of septaphone acoustic context in static decoding graphs, and improvements in basic decoding algorithms. At a system building level, the advances include a system architecture based on cross-adaptation and the incorporation of 2100 h of training data in every system component. We present results on English conversational telephony test data from the 2003 and 2004 NIST evaluations. The combination of technical advances and an order of magnitude more training data in 2004 reduced the error rate on the 2003 test set by approximately 21 % relative—from 20.4 % to 16.1%—over the most accurate system in the 2003 evaluation and produced the most accurate results on the 2004 test sets in every speed category. Index Terms—Discriminative training, Effective Affordable Reusable Speech-to-Text (EARS), finite-state transducer, full

Direct Construction of Compact Context-Dependency Transducers From Data

by David Rybach, Michael Riley
"... This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decisiontree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the tra ..."
Abstract - Add to MetaCart
This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decisiontree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable. Index Terms: WFST, LVCSR 1.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University