Results 1 -
4 of
4
Applying Vocal Tract Length Normalization to Meeting Recordings
, 2005
"... Vocal Tract Length Normalisation (VTLN) is a commonly used technique to normalise for inter-speaker variability. It is based on the speaker-specific warping of the frequency axis, parameterised by a scalar warp factor. This factor is typically estimated using maximum likelihood. We discuss how VTLN ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Vocal Tract Length Normalisation (VTLN) is a commonly used technique to normalise for inter-speaker variability. It is based on the speaker-specific warping of the frequency axis, parameterised by a scalar warp factor. This factor is typically estimated using maximum likelihood. We discuss how VTLN may be applied to multiparty conversations, reporting a substantial decrease in word error rate in experiments using the ICSI meetings corpus. We investigate the behaviour of the VTLN warping factor and show that a stable estimate is not obtained. Instead it appears to be influenced by the context of the meeting, in particular the current conversational partner. These results are consistent with predictions made by the psycholinguistic interactive alignment account of dialogue, when applied at the acoustic and phonological levels.
Mental State Detection of Dialogue System Users via Spoken Language
- ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition
, 2003
"... This paper presents an approach to simulate the mental activities of children during their interaction with computers through their spoken language. The mental activities are categorized into three states: confidence, confusion and frustration. Two knowledge sources are used in the detection. One is ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper presents an approach to simulate the mental activities of children during their interaction with computers through their spoken language. The mental activities are categorized into three states: confidence, confusion and frustration. Two knowledge sources are used in the detection. One is prosody, which indicates utterance type and user's attitude. The other is embedded key words/phrases which help interpret the utterances. Moreover, it is found that children's speech exhibits very different acoustic characteristics from adults. Given the uniqueness of children's speech, this paper applies a vocal-tract-length-normalization (VTLN)-based technique to compensate for both inter-speaker variability and intraspeaker variability in children's speech. The detected key words/phrases are then integrated with prosodic information as the cues for the MAP decision of mental states. Tests on a set of 50 utterances collected from the project experiment showed the classification accuracy was 74%.
Speaker normalization with respect to F_0: a perceptual approach
, 2003
"... A speaker normalization scheme that uses explicit knowledge of acoustic phonetics is presented. The scheme warps the frequency axis linearly in critical band rate with respect to the fundamental frequency F_0. It thus allows an immediate adaption to a new speaker which is an advantage over commonly ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A speaker normalization scheme that uses explicit knowledge of acoustic phonetics is presented. The scheme warps the frequency axis linearly in critical band rate with respect to the fundamental frequency F_0. It thus allows an immediate adaption to a new speaker which is an advantage over commonly used schemes. Variants with different values of F_0 and different parameters have been evaluated on several tasks of SpeechDat(II). The results show significant performance improvements on three tasks with monophone models, the most prominent result is a reduction in WER of 44.5 % for an isolated digit task. However, the results achieved with tied triphone models are very modest. It is argued that the normalization scheme may still be correct but that the MFCC feature extraction erases its effect. Evidence for the need of a new feature extraction method that locates spectral peaks and ignores irrelevant portions of the spectrum is given.
Handling Phonetic Context and Speaker Variation in a Structure-Based Speech Recognizer
"... Recently we have developed a novel type of structure-based speech recognizer, which uses parameterized, non-recursive �hidden � trajectory model of vocal tract resonances (VTR) or formants to capture the dynamic structure of long-range speech coarticulation and reduction. The underlying model of thi ..."
Abstract
- Add to MetaCart
Recently we have developed a novel type of structure-based speech recognizer, which uses parameterized, non-recursive �hidden � trajectory model of vocal tract resonances (VTR) or formants to capture the dynamic structure of long-range speech coarticulation and reduction. The underlying model of this recognizer carries out bi-directional FIR filtering on the piecewise constant sequences of the VTR targets. In this paper, we elaborate on two key aspects of the model. First, the phonetic context controls the movement direction and thus the formation of the VTR trajectories. This provides �structured � context dependency for speech acoustics without using context dependent parameters as required by HMMs. Second, VTR targets as the key context-independent parameters of the model vary across speakers. We describe an effective target-value normalization algorithm that can be applied to both training and unknown test speakers. We report experimental results demonstrating the effectiveness of the normalization algorithm in the context of structure-based speech recognition. We also provide computational analysis on the HTM-based speech decoder. Index Terms: hidden trajectory model, phonetic contexts, normalization, vocal tract resonance, targets

