Results 11 - 20
of
99
Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System
- in Advances in Neural Information Processing Systems
, 1995
"... A method for incorporating context-dependent phone classes in a connectionist-HMM hybrid speech recognition system is introduced. A modular approach is adopted, where single-layer networks discriminate between different context classes given the phone class and the acoustic data. The context network ..."
Abstract
-
Cited by 37 (7 self)
- Add to MetaCart
A method for incorporating context-dependent phone classes in a connectionist-HMM hybrid speech recognition system is introduced. A modular approach is adopted, where single-layer networks discriminate between different context classes given the phone class and the acoustic data. The context networks are combined with a context-independent (CI) network to generate context-dependent (CD) phone probability estimates. Experiments show an average reduction in word error rate of 16% and 13% from the CI system on ARPA 5,000 word and SQALE 20,000 word tasks respectively. Due to improved modelling, the decoding speed of the CD system is more than twice as fast as the CI system. INTRODUCTION The abbot hybrid connectionist-HMM system performed competitively with many conventional hidden Markov model (HMM) systems in the 1994 ARPA evaluations of speech recognition systems (Hochberg, Cook, Renals, Robinson & Schechtman 1995). This hybrid framework is attractive because it is compact, having far f...
Dynamic Pronunciation Models for Automatic Speech Recognition
, 1999
"... As of this writing, the automatic recognition of spontaneous speech by computer is fraught with errors; many systems transcribe one out of every three to five words incorrectly, whereas humans can transcribe spontaneous speech with one error in twenty words or better. This high error rate is due in ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
As of this writing, the automatic recognition of spontaneous speech by computer is fraught with errors; many systems transcribe one out of every three to five words incorrectly, whereas humans can transcribe spontaneous speech with one error in twenty words or better. This high error rate is due in part to the poor modeling of pronunciations within spontaneous speech. This dissertation examines how pronunciations vary in this speaking style, and how speaking rate and word predictability can be used to predict when greater pronunciation variation can be expected. It includes an investigation of the relationship between speaking rate, word predictability, pronunciations, and errors made by speech recognition systems. The results of these studies suggest that for spontaneous speech, it may be appropriate to build models for syllables and words that can dynamically change the pronunciations used in the speech recognizer based on the extended context (including surrounding words, phones, speaking rate, etc.). Implementation of new pronunciation models automatically derived from data within the ICSI speech recognition system has shown a 4-5% relative improvement on the Broadcast News recognition task. Roughly two thirds of these gains can be attributed to static baseform improvements; adding the ability to dynamically adjust pronunciations within the recognizer provides the other third of the improvement. The Broadcast News task also allows for comparison of performance on different styles of speech: the new pronunciation models do not help for pre-planned speech, but they provide a significant gain for spontaneous speech. Not only do the automatically learned pronunciation models capture some of the linguistic variation due to the speaking style, but they also represent vari...
Trainable Speech Synthesis
, 1996
"... dressed through improvements to its transcription, clustering, and segmentation capabilities. The LP synthesis scheme was replaced by a TD-PSOLA scheme which synthesised speech by concatenating waveform segments selected to represent each clustered state. The final system produced speech which, thou ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
dressed through improvements to its transcription, clustering, and segmentation capabilities. The LP synthesis scheme was replaced by a TD-PSOLA scheme which synthesised speech by concatenating waveform segments selected to represent each clustered state. The final system produced speech which, though in a monotone, was natural sounding, remarkably fluent, and highly intelligible. The segmental intelligibility was measured using the Modified Rhyme Test, and a 5.0% error rate obtained. The speech produced by the system mimicked the voice of the speaker used to record the training database. The system could be retrained on a new voice in less than 48 hours, and has been successfully trained on four voices. Acknowledgements There are a very large number of people to thank in connection with this work. I shall begin at the beginning, by thanking my original supervisor, the late Professor Frank Fallside. To him I am deeply grateful, both for letting me join the CUED Speec
Dynamic Programming Search for Continuous Speech Recognition
, 1999
"... . Initially introduced in the late 1960s and early 1970s, dynamic programming algorithms have become increasingly popular in automatic speech recognition. There are two reasons why this has occurred: First, the dynamic programming strategy can be combined with avery e#cient and practical pruning str ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
. Initially introduced in the late 1960s and early 1970s, dynamic programming algorithms have become increasingly popular in automatic speech recognition. There are two reasons why this has occurred: First, the dynamic programming strategy can be combined with avery e#cient and practical pruning strategy so that very large search spaces can be handled. Second, the dynamic programming strategy has turned out to be extremely #exible in adapting to new requirements. Examples of such requirements are the lexical tree organization of the pronunciation lexicon and the generation of a word graph instead of the single best sentence. In this paper, we attempt to systematically review the use of dynamic programming search strategies for small#vocabulary and large#vocabulary continuous speech recognition. The following methods are described in detail: search using a linear lexicon, search using a lexical tree, language-model look-ahead and word graph generation. 1 Introduction Search strategie...
Uncertainty decoding for noise robust speech recognition
- in Proc. Interspeech
, 2004
"... This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings
Frame Discrimination Training Of HMMs For Large Vocabulary Speech Recognition
- Proc. ICASSP’99
, 1999
"... This paper describes the application of a discriminative HMM parameter estimation technique called Frame Discrimination (FD), to medium and large vocabulary continuous speech recognition. Previous work has shown that FD training can give better results than maximum mutual information (MMI) training ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
This paper describes the application of a discriminative HMM parameter estimation technique called Frame Discrimination (FD), to medium and large vocabulary continuous speech recognition. Previous work has shown that FD training can give better results than maximum mutual information (MMI) training for small tasks. The use of FD for much larger tasks required the development of a technique to be able to rapidly find the most likely set of Gaussians for each frame in the system. Experiments on the Resource Management and North American Business tasks show that FD training can give comparable improvements to MMI, but is less computationally intensive. 1. INTRODUCTION Previous research has shown that the accuracy of a speech recognition system trained using Maximum Likelihood Estimation (MLE) can often be improved further using discriminative training. All such techniques normally give much greater improvements in recognition accuracy on the training data than on the test set except wh...
State-Based Gaussian Selection In Large Vocabulary Continuous Speech Recognition Using HMMs
, 1998
"... This paper investigates the use of Gaussian Selection (GS) to increase the speed of a large vocabulary speech recognition system. Typically 30-70% of the computational time of a continuous density HMM-based speech recogniser is spent calculating probabilities. The aim of GS is to reduce this load ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
This paper investigates the use of Gaussian Selection (GS) to increase the speed of a large vocabulary speech recognition system. Typically 30-70% of the computational time of a continuous density HMM-based speech recogniser is spent calculating probabilities. The aim of GS is to reduce this load by selecting the subset of Gaussian component likelihoods that should be computed given a particular input vector. This paper examines new techniques for obtaining "good" Gaussian subsets or "shortlists". All the new schemes make use of state information, specifically which state each of the Gaussian components belongs to. In this way a maximum number of Gaussian components per state may be specified, hence reducing the size of the shortlist. The first technique introduced is a simple extension of the standard GS method, which uses this state information. Then, more complex schemes based on maximising the likelihood of the training data are proposed. These new approaches are compared with the standard GS scheme on a large vocabulary speech recognition task. On this task, the use of state information reduced the percentage of Gaussians computed to 10-15%, compared with 20-30% for the standard GS scheme, with little degradation in performance. 1 M.J.F.Gales is now at the IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA. 2 K.M. Knill is now at Nuance Communications, 1380 Willow Rd, Menlo Park, CA 94025, USA. List of Tables 1 Change in the average forced alignment likelihood of the ARPA 1994 H1 development data for SGS and SBGS systems, compared to the standard no GS system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 Recognition performance of the standard no GS, SGS and SBGS systems on the ARPA 1994 H...
The 1997 HTK Broadcast News Transcription System
, 1998
"... This paper presents the recent development of the HTK broadcast news transcription system. Previously we have used data type specific modelling based on adapted Wall Street Journal trained HMMs. However, we are now using data for which no manual preclassification or segmentation is available and the ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
This paper presents the recent development of the HTK broadcast news transcription system. Previously we have used data type specific modelling based on adapted Wall Street Journal trained HMMs. However, we are now using data for which no manual preclassification or segmentation is available and therefore automatic techniques are required and compatible acoustic modelling strategies must be adopted. A number of recognition experiments are presented that compare data-type specific and non-specific models; differing amounts of training data; the use of gender-dependent modelling and the effects of automatic data-type classification. Based on these experiments, the HTK system for the 1997 broadcast news evaluation was designed. A detailed description of this system is given which includes a class-based language modelling component. The complete system yields an overall word error rate of 22.0% on the 1996 unpartitioned broadcast news development test data and just 15.8% on the 1997 evalua...
A hidden Markov-model-based trainable speech synthesizer
, 1999
"... This paper presents a new approach to speech synthesis in which a set of cross-word decision-tree state-clustered context-dependent hidden Markov models are used to define a set of subphone units to be used in a concatenation synthesizer. The models, trees, waveform segments and other parameters ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
This paper presents a new approach to speech synthesis in which a set of cross-word decision-tree state-clustered context-dependent hidden Markov models are used to define a set of subphone units to be used in a concatenation synthesizer. The models, trees, waveform segments and other parameters representing each clustered state are obtained completely automatically through training on a 1 hour single-speaker continuous-speech database. During synthesis the required utterance, specified as a string of words of known phonetic pronounciation, is generated as a sequence of these clustered states using a TD-PSOLA waveform concatenation synthesizer. The system produces speech which, though in a monotone, is both natural sounding and highly intelligible. A Modified Rhyme Test conducted to measure segmental intelligibility yielded a 50% error rate. The speech produced by the system mimics the voice of the speaker used to record the training database. The system can be retrained on...
The CU-HTK March 2000 Hub5E Transcription System
, 2000
"... This paper describes the Cambridge University HTK (CU-HTK) system developed for the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). A range of new features have been added to the HTK system used in the 1998 Hub5 evaluation, and the changes taken together ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
This paper describes the Cambridge University HTK (CU-HTK) system developed for the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). A range of new features have been added to the HTK system used in the 1998 Hub5 evaluation, and the changes taken together have resulted in an 11% relative decrease in word error rate on the 1998 evaluation test set. Major changes include the use of maximum mutual information estimation in training as well as conventional maximum likelihood estimation; the use of a full variance transform for adaptation; the inclusion of unigram pronunciation probabilities; and word-level posterior probability estimation using confusion networks for use in minimum word error rate decoding, confidence score estimation and system combination. On the March 2000 Hub5 evaluation set the CU-HTK system gave an overall word error rate of 25.4%, which was the best performance by a statistically significant margin. This paper describes th...

