Results 1 -
5 of
5
Development of the 2003 CU-HTK Conversational Telephone Speech Transcription System
- In Proc. ICASSP
, 2004
"... This paper describes the development of the 2003 CU-HTK large vocabulary speech recognition system for Conversational Telephone Speech (CTS). The system was designed based on a multipass, multi-branch structure where the output of all branches is combined using system combination. A number of advan ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
This paper describes the development of the 2003 CU-HTK large vocabulary speech recognition system for Conversational Telephone Speech (CTS). The system was designed based on a multipass, multi-branch structure where the output of all branches is combined using system combination. A number of advanced modelling techniques such as Speaker Adaptive Training, Heteroscedastic Linear Discriminant Analysis, Minimum Phone Error estimation and specially constructed Single Pronunciation dictionaries were employed. The effectiveness of each of these techniques and their potential contribution to the result of system combination was evaluated in the framework of a state-of-the-art LVCSR system with sophisticated adaptation. The final 2003 CU-HTK CTS system constructed from some of these models is described and its performance on the DARPA/NIST 2003 Rich Transcription (RT-03) evaluation test set is discussed.
Recent Advances in Broadcast News Transcription
- in Proc. IEEE ASRU Workshop
, 2003
"... This paper describes recent advances in the CU-HTK Broadcast News English (BN-E) transcription system and its performance in the DARPA/NIST Rich Transcription 2003 Speech-to-Text (RT03) evaluation. Heteroscedastic linear discriminant analysis (HLDA) and discriminative training, which were previousl ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
This paper describes recent advances in the CU-HTK Broadcast News English (BN-E) transcription system and its performance in the DARPA/NIST Rich Transcription 2003 Speech-to-Text (RT03) evaluation. Heteroscedastic linear discriminant analysis (HLDA) and discriminative training, which were previously developed in the context of the recognition of conversational telephone speech, have been successfully applied to the BN-E task for the first time. A number of new features have also been added. These include gender-dependent (GD) discriminative training; and modified discriminative training using lattice re-generation and combination. On the 2003 evaluation set the system gave an overall word error rate of 10.7% in less than 10 times real time (10RT).
Design of Fast LVCSR Systems
, 2003
"... This paper describes the development of fast (less than 10 times real-time) large vocabulary continuous speech recognition (LVCSR) systems based on technology developed for unlimited runtime systems assembled for participation in recent DARPA/NIST LVCSR evaluations. A general system structure for 1 ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper describes the development of fast (less than 10 times real-time) large vocabulary continuous speech recognition (LVCSR) systems based on technology developed for unlimited runtime systems assembled for participation in recent DARPA/NIST LVCSR evaluations. A general system structure for 10 times real-time systems is proposed and two specific systems that have been built for Broadcast News (BN) and Conversational Telephone Speech (CTS) recognition are described. The systems were evaluated in the DARPA/NIST April 2003 Rich Transcription evaluation. Results are reported and contrasted with unlimited runtime systems and previous fast systems.
Pronunciation Change in Conversational Speech and
, 2003
"... Pronunciations in spontaneous speech di#er significantly from citation form and pronunciation modeling for automatic speech recognition has received considerable attention in the last few years. Most methods describe alternate pronunciations of a word using multiple entries in a dictionary or using ..."
Abstract
- Add to MetaCart
Pronunciations in spontaneous speech di#er significantly from citation form and pronunciation modeling for automatic speech recognition has received considerable attention in the last few years. Most methods describe alternate pronunciations of a word using multiple entries in a dictionary or using a network of phones, assuming implicitly that a deviation from the canonical pronunciation results in a "complete" change as described by the alternate pronunciation. We investigate this implicit assumption about pronunciation change in conversational speech and demonstrate here that in most cases, the change is only partial; a phone is not completely deleted or substituted by another phone but is modified only partially. Evidence supporting this conclusion comes from the three-way analysis of features extracted from the acoustic signal for use in a speech recognition system, canonical pronunciations from a dictionary, and careful phonetic transcriptions produced by human labelers. Most often, when a deviation from the canonical pronunciation is marked, neither the canonical nor the manually labeled phones represent the actual acoustics adequately. Further analysis of the manual phonetic transcription reveals a significant number (>20%) of instances where even human labelers disagree on the identity of the surface-form. In light of this evidence, two methods are suggested for accommodating such partial pronunciation change in the automatic recognition of spontaneous speech and experimental results are presented for each method.
Automatic determination of sub-word units for automatic speech recognition
, 2008
"... Current automatic speech recognition (ASR) research is focused on recognition of continuous, spontaneous speech. Spontaneous speech contains a lot of variability in the way words are pronounced, and canonical pronunciations of each word are not true to the variation that is seen in real data.
Two ..."
Abstract
- Add to MetaCart
Current automatic speech recognition (ASR) research is focused on recognition of continuous, spontaneous speech. Spontaneous speech contains a lot of variability in the way words are pronounced, and canonical pronunciations of each word are not true to the variation that is seen in real data.
Two of the components of an ASR system are acoustic models and pronunciation models. The variation within spontaneous speech must be accounted for by these components. Phones, or context-dependent phones are typically used as the base subword unit, and one acoustic model is trained for each sub-word unit. Pronunciation modelling largely takes place in a dictionary, which relates words to sequences of phones. Acoustic modelling and pronunciation modelling overlap, and the two are not clearly separable in modelling pronunciation variation. Techniques that find pronunciation variants in the data and then reflect these in the dictionary have not provided expected gains in recognition.
An alternative approach to modelling pronunciations in terms of phones is to derive units automatically: using data-driven methods to determine an inventory of sub-word units, their acoustic models, and their relationship to words. This thesis presents a method for the automatic derivation of a sub-word unit inventory, whose main components are
1. automatic and simultaneous generation of a sub-word unit inventory and acoustic model set, using an ergodic hidden Markov model whose complexity is controlled using the Bayesian Information Criterion
2. automatic generation of probabilistic dictionaries using joint multigrams
The prerequisites of this approach are fewer than in previous work on unit derivation; notably, the timings of word boundaries are not required here. The approach is language independent since it is entirely data-driven and no linguistic information is required. The dictionary generation method outperforms a supervised method using phonetic data. The automatically derived units and dictionary perform reasonably on a small spontaneous speech task, although not yet outperforming phones.

