Results 1 -
4 of
4
Automatic Recognition of Spontaneous Speech for Access to Multilingual Oral History Archives
- IEEE Transactions on Speech and Audio Processing
, 2004
"... Abstract—Much is known about the design of automated systems to search broadcast news, but it has only recently become possible to apply similar techniques to large collections of spontaneous speech. This paper presents initial results from experiments with speech recognition, topic segmentation, to ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
Abstract—Much is known about the design of automated systems to search broadcast news, but it has only recently become possible to apply similar techniques to large collections of spontaneous speech. This paper presents initial results from experiments with speech recognition, topic segmentation, topic categorization, and named entity detection using a large collection of recorded oral histories. The work leverages a massive manual annotation effort on 10 000 h of spontaneous speech to evaluate the degree to which automatic speech recognition (ASR)-based segmentation and categorization techniques can be adapted to approximate decisions made by human annotators. ASR word error rates near 40 % were achieved for both English and Czech for heavily accented, emotional and elderly spontaneous speech based on 65–84 h of transcribed speech. Topical segmentation based on shifts in the recognized English vocabulary resulted in 80 % agreement with
An architecture for rapid decoding of large vocabulary conversational speech
- in Eurospeech-2003
, 2003
"... This paper addresses the question of how to design a large vocabulary recognition system so that it can simultaneously handle a sophisticated language model, perform state-ofthe-art speaker adaptation, and run in one times real time 1 (1 RT). The architecture we propose is based on classical HMM Vit ..."
Abstract
-
Cited by 12 (7 self)
- Add to MetaCart
This paper addresses the question of how to design a large vocabulary recognition system so that it can simultaneously handle a sophisticated language model, perform state-ofthe-art speaker adaptation, and run in one times real time 1 (1 RT). The architecture we propose is based on classical HMM Viterbi decoding, but uses an extremely fast initial speaker-independent decoding to estimate VTL warp factors, feature-space and model-space MLLR transformations that are used in a final speaker-adapted decoding. We present results on past Switchboard evaluation data that indicate that this strategy compares favorably to published unlimited-time systems (running in several hundred times real-time). Coincidentally, this is the system that IBM fielded in the 2003 EARS Rich Transcription evaluation. 1.
Advances in speech transcriptions at IBM under the DARPA EARS program
- IEEE Transactions on Audio, Speech, and Language Processing, accepted for publication
, 2000
"... Abstract—This paper describes the technical and system building advances made in IBM’s speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program. At a technical level, these advances include the ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract—This paper describes the technical and system building advances made in IBM’s speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program. At a technical level, these advances include the development of a new form of feature-based minimum phone error training (fMPE), the use of large-scale discriminatively trained full-covariance Gaussian models, the use of septaphone acoustic context in static decoding graphs, and improvements in basic decoding algorithms. At a system building level, the advances include a system architecture based on cross-adaptation and the incorporation of 2100 h of training data in every system component. We present results on English conversational telephony test data from the 2003 and 2004 NIST evaluations. The combination of technical advances and an order of magnitude more training data in 2004 reduced the error rate on the 2003 test set by approximately 21 % relative—from 20.4 % to 16.1%—over the most accurate system in the 2003 evaluation and produced the most accurate results on the 2004 test sets in every speed category. Index Terms—Discriminative training, Effective Affordable Reusable Speech-to-Text (EARS), finite-state transducer, full
Direct Construction of Compact Context-Dependency Transducers From Data
"... This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decisiontree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the tra ..."
Abstract
- Add to MetaCart
This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decisiontree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable. Index Terms: WFST, LVCSR 1.

