Results 1 -
5 of
5
Use of syllable nuclei locations to improve ASR
- in Proc. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Kyoto
, 2007
"... This work presents the use of dynamic Bayesian networks (DBNs) to jointly estimate word position and word identity in an automatic speech recognition system. In particular, we have augmented a standard Hidden Markov Model (HMM) with counts and locations of syllable nuclei. Three experiments are pres ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
This work presents the use of dynamic Bayesian networks (DBNs) to jointly estimate word position and word identity in an automatic speech recognition system. In particular, we have augmented a standard Hidden Markov Model (HMM) with counts and locations of syllable nuclei. Three experiments are presented here. The first uses oracle syllable counts, the second uses oracle syllable nuclei locations, and the third uses estimated (non-oracle) syllable nuclei locations. All results are presented on the 10 and 500 word tasks of the SVitchboard corpus. The oracle experiments give relative improvements ranging from 7.0 % to 37.2%. When using estimated syllable nuclei a relative improvement of 3.1 % is obtained on the 10 word task. Index Terms — Automatic speech recognition, dynamic Bayesian networks, syllables, speaking rate
Speech Variation and the Use of Distance Metrics on the Articulatory Feature Space
- ITRW Workshop on Speech Recognition and Intrinsic Variation
, 2006
"... This paper describes ongoing research on the relation between variation in speech in the articulatory-acoustic domain and the variation as represented in the symbolic domain. More specifically, we address variation in speech as represented by articulatory features, and the description of variation i ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
This paper describes ongoing research on the relation between variation in speech in the articulatory-acoustic domain and the variation as represented in the symbolic domain. More specifically, we address variation in speech as represented by articulatory features, and the description of variation in phone annotation and segmentation. Variation in speech is quantified by using distance metrics defined on the space spanned by articulatory features. We will show a very good correspondence between locations of events in the articulatory feature trajectories on the one hand, and the phone boundary locations as defined by manual segmentation on the other. This indicates that the asynchronous articulatory representation at least captures the information in the segmentation on phone level.
Using Syllable Nuclei Locations to Improve Automatic Speech Recognition in the Presence of Burst Noise
"... In this work we combine a conventional phone-based automatic speech recognizer with a classifier that detects syllable locations. This is done using a dynamic Bayesian network. Using oracle syllable detections we achieve a 17 % relative reduction in word error rate on the 500 word task of the SVitch ..."
Abstract
- Add to MetaCart
In this work we combine a conventional phone-based automatic speech recognizer with a classifier that detects syllable locations. This is done using a dynamic Bayesian network. Using oracle syllable detections we achieve a 17 % relative reduction in word error rate on the 500 word task of the SVitchboard corpus. Using estimated locations we achieve a 2.1 % relative reduction which is significant at the 0.02 level. The improvement in the estimated case is from reducing insertions caused by burst noise. Index Terms: Automatic speech recognition, dynamic Bayesian networks, syllables, speaking rate
Applications of Virtual-Evidence based Speech Recognizer Training
"... We present two applications of our previously proposed virtualevidence (VE) based speech recognizer training algorithm [1, 2]. The first relates to two-pass training where segmentations obtained during the first pass are used as VE to train the subsequent pass. We use the TIMIT phone and SVitchboard ..."
Abstract
- Add to MetaCart
We present two applications of our previously proposed virtualevidence (VE) based speech recognizer training algorithm [1, 2]. The first relates to two-pass training where segmentations obtained during the first pass are used as VE to train the subsequent pass. We use the TIMIT phone and SVitchboard continuous speech recognition tasks to demonstrate the benefits of using VE based training in two-pass systems. The second application involves making use of functions that can incorporate prior domain knowledge to generate VE-scores. Here, in the case of TIMIT phone recognition, we show that using the proposed function to generate VE-scores results in about 6 % relative error rate reduction over the baseline.
MODELLING THE PREPAUSAL LENGTHENING EFFECT FOR SPEECH RECOGNITION: A DYNAMIC BAYESIAN NETWORK APPROACH
"... Speech has a property that the speech unit preceding a speech pause tends to lengthen. This work presents the use of a dynamic Bayesian network to model the prepausal lengthening effect for robust speech recognition. Specifically, we introduce two distributions to model inter-state transitions in pr ..."
Abstract
- Add to MetaCart
Speech has a property that the speech unit preceding a speech pause tends to lengthen. This work presents the use of a dynamic Bayesian network to model the prepausal lengthening effect for robust speech recognition. Specifically, we introduce two distributions to model inter-state transitions in prepausal and non-prepausal words, respectively. The selection of the transition distributions depends on a random variable whose value is influenced by whether a pause will appear between the current and the following word. Two experiments are presented here. The first one considers pauses hypothesised during speech decoding. The second one employs an extra component for speech/non-speech determination. By modelling the prepausal lengthening effect we achieve a 5.5 % relative reduction in word error rate on the 500-word task of the SVitchboard corpus. Index Terms — Prepausal lengthening, duration, prosody, robust speech recognition, dynamic Bayesian networks

