• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Vocal tract length normalization for large vocabulary continuous speech recognition (1997)

by P Zhan, A Waibel
Add To MetaCart

Tools

Sorted by:
Results 1 - 4 of 4

Applying Vocal Tract Length Normalization to Meeting Recordings

by Giulia Garau , Steve Renals, Thomas Hain , 2005
"... Vocal Tract Length Normalisation (VTLN) is a commonly used technique to normalise for inter-speaker variability. It is based on the speaker-specific warping of the frequency axis, parameterised by a scalar warp factor. This factor is typically estimated using maximum likelihood. We discuss how VTLN ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Vocal Tract Length Normalisation (VTLN) is a commonly used technique to normalise for inter-speaker variability. It is based on the speaker-specific warping of the frequency axis, parameterised by a scalar warp factor. This factor is typically estimated using maximum likelihood. We discuss how VTLN may be applied to multiparty conversations, reporting a substantial decrease in word error rate in experiments using the ICSI meetings corpus. We investigate the behaviour of the VTLN warping factor and show that a stable estimate is not obtained. Instead it appears to be influenced by the context of the meeting, in particular the current conversational partner. These results are consistent with predictions made by the psycholinguistic interactive alignment account of dialogue, when applied at the acoustic and phonological levels.

Mental State Detection of Dialogue System Users via Spoken Language

by Tong Zhang, Mark Hasegawa-johnson, Stephen E. Levinson - ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition , 2003
"... This paper presents an approach to simulate the mental activities of children during their interaction with computers through their spoken language. The mental activities are categorized into three states: confidence, confusion and frustration. Two knowledge sources are used in the detection. One is ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
This paper presents an approach to simulate the mental activities of children during their interaction with computers through their spoken language. The mental activities are categorized into three states: confidence, confusion and frustration. Two knowledge sources are used in the detection. One is prosody, which indicates utterance type and user's attitude. The other is embedded key words/phrases which help interpret the utterances. Moreover, it is found that children's speech exhibits very different acoustic characteristics from adults. Given the uniqueness of children's speech, this paper applies a vocal-tract-length-normalization (VTLN)-based technique to compensate for both inter-speaker variability and intraspeaker variability in children's speech. The detected key words/phrases are then integrated with prosodic information as the cues for the MAP decision of mental states. Tests on a set of 50 utterances collected from the project experiment showed the classification accuracy was 74%.

Speaker normalization with respect to F_0: a perceptual approach

by Eidgenössische Technische Hochschule Zürich, Ulrike Glavitsch, Ulrike Glavitsch, Ulrike Glavitsch , 2003
"... A speaker normalization scheme that uses explicit knowledge of acoustic phonetics is presented. The scheme warps the frequency axis linearly in critical band rate with respect to the fundamental frequency F_0. It thus allows an immediate adaption to a new speaker which is an advantage over commonly ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
A speaker normalization scheme that uses explicit knowledge of acoustic phonetics is presented. The scheme warps the frequency axis linearly in critical band rate with respect to the fundamental frequency F_0. It thus allows an immediate adaption to a new speaker which is an advantage over commonly used schemes. Variants with different values of F_0 and different parameters have been evaluated on several tasks of SpeechDat(II). The results show significant performance improvements on three tasks with monophone models, the most prominent result is a reduction in WER of 44.5 % for an isolated digit task. However, the results achieved with tied triphone models are very modest. It is argued that the normalization scheme may still be correct but that the MFCC feature extraction erases its effect. Evidence for the need of a new feature extraction method that locates spectral peaks and ignores irrelevant portions of the spectrum is given.

Handling Phonetic Context and Speaker Variation in a Structure-Based Speech Recognizer

by Dong Yu, Li Deng, Alex Acero
"... Recently we have developed a novel type of structure-based speech recognizer, which uses parameterized, non-recursive �hidden � trajectory model of vocal tract resonances (VTR) or formants to capture the dynamic structure of long-range speech coarticulation and reduction. The underlying model of thi ..."
Abstract - Add to MetaCart
Recently we have developed a novel type of structure-based speech recognizer, which uses parameterized, non-recursive �hidden � trajectory model of vocal tract resonances (VTR) or formants to capture the dynamic structure of long-range speech coarticulation and reduction. The underlying model of this recognizer carries out bi-directional FIR filtering on the piecewise constant sequences of the VTR targets. In this paper, we elaborate on two key aspects of the model. First, the phonetic context controls the movement direction and thus the formation of the VTR trajectories. This provides �structured � context dependency for speech acoustics without using context dependent parameters as required by HMMs. Second, VTR targets as the key context-independent parameters of the model vary across speakers. We describe an effective target-value normalization algorithm that can be applied to both training and unknown test speakers. We report experimental results demonstrating the effectiveness of the normalization algorithm in the context of structure-based speech recognition. We also provide computational analysis on the HTM-based speech decoder. Index Terms: hidden trajectory model, phonetic contexts, normalization, vocal tract resonance, targets
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University