Results 1 - 10
of
17
Speech Formant Frequency And Bandwidth Tracking Using Multiband Energy Demodulation
- J. Acoust. Soc. Amer
, 1996
"... In this paper, the AM--FM modulation model and a multiband analysis/demodulation scheme is applied to speech formant frequency and bandwidth tracking. Filtering is performed by a bank of Gabor bandpass filters. Each band is demodulated to amplitude envelope and instantaneous frequency signals using ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
In this paper, the AM--FM modulation model and a multiband analysis/demodulation scheme is applied to speech formant frequency and bandwidth tracking. Filtering is performed by a bank of Gabor bandpass filters. Each band is demodulated to amplitude envelope and instantaneous frequency signals using the energy separation algorithm. Short-time formant frequency and bandwidth estimates are obtained from the instantaneous amplitude and frequency signals and their merits are presented. The estimates are used to determine the formant locations and bandwidths. Performance and computational issues (frequency domain implementation) are discussed. Overall, the multiband demodulation approach to formant tracking is easy to implement, provides accurate formant frequency and realistic bandwidth estimates, and performs well in the presence of nasalization. 1. INTRODUCTION Formant tracking is an old problem that has received much attention lately, mainly because of the deficiencies of the well esta...
A Structured Speech Model with Continuous Hidden Dynamics and Prediction-Residual Training for Tracking Vocal Tract Resonances
- Proc. ICASSP
, 2004
"... A novel approach is developed for efficient and accurate tracking of vocal tract resonances, which are natural frequencies of the resonator from larynx to lips, in fluent speech. The tracking algorithm is based on a version of the structured speech model consisting of continuous-valued hidden dynami ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
A novel approach is developed for efficient and accurate tracking of vocal tract resonances, which are natural frequencies of the resonator from larynx to lips, in fluent speech. The tracking algorithm is based on a version of the structured speech model consisting of continuous-valued hidden dynamics and a piecewise-linearized prediction function from resonance frequencies and bandwidths to LPC cepstra. We present details of the piecewise linearization design process and an adaptive training technique for the parameters that characterize the prediction residuals. An iterative tracking algorithm is described and evaluated that embeds both the prediction-residual training and the piecewise linearization design in an adaptive Kalman filtering framework. Experiments on tracking vocal tract resonances in Switchboard speech data demonstrate high accuracy in the results, as well as the effectiveness of residual training embedded in the algorithm. Our approach differs from traditional formant trackers in that it provides meaningful results even during consonantal closures when the supra-laryngeal source may cause no spectral prominences in speech acoustics.
Formant Tracking Using Segmental Phonemic Information
- in Proc. Eur. Conf. Speech Communication and Technology (Eurospeech
, 1999
"... A new formant tracking algorithm using phoneme dependent nominal formant values is tested. The algorithm consists of three phases: (1) analysis, (2) segmentation, and (3) formant tracking. In the analysis phase, formant candidates are obtained by solving for the roots of the linear prediction polyno ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
A new formant tracking algorithm using phoneme dependent nominal formant values is tested. The algorithm consists of three phases: (1) analysis, (2) segmentation, and (3) formant tracking. In the analysis phase, formant candidates are obtained by solving for the roots of the linear prediction polynomial. In the segmentation phase, the input text is converted into a sequence of phonemic symbols. Then the sequence is time aligned with the speech utterance. Finally, a set of formant candidates that are close to the nominal formant estimates while satisfying the continuity constraints are chosen. The new algorithm significantly reduces the formant tracking error rate (3.62%) over a formant tracking algorithmusing only continuity constraints (13.04%). We will also discuss how to further reduce the tracking error rate. INTRODUCTION In the Bell Labs' Text-To-Speech (TTS) system [1], a limited number of acoustic units is stored in the inventory table. Therefore, it is important to be able to...
A database of vocal tract resonance trajectories for research in speech processing
- Proc. ICASSP
, 2006
"... While vocal tract resonances (VTRs, or formants that are defined as such resonances) are known to play a critical role in human speech perception and in computer speech processing, there has been a lack of standard databases needed for the quantitative evaluation of automatic VTR extraction techniqu ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
While vocal tract resonances (VTRs, or formants that are defined as such resonances) are known to play a critical role in human speech perception and in computer speech processing, there has been a lack of standard databases needed for the quantitative evaluation of automatic VTR extraction techniques. We report in this paper on our recent effort to create a publicly available database of the first three VTR frequency trajectories. The database contains a representative subset of the TIMIT corpus with respect to speaker, gender, dialect and phonetic context, with a total of 538 sentences. A Matlab-based labeling tool is developed, with high-resolution wideband spectrograms displayed to assist in visual identification of VTR frequency values which are then recorded via mouse clicks and local spline interpolation. Special attention is paid to VTR values during consonantto-vowel (CV) and vowel-to-consonant (VC) transitions, and to speech segments with vocal tract anti-resonances. Using this database, we quantitatively assess two common automatic VTR tracking techniques in terms of their average tracking errors analyzed within each of the six major broad phonetic classes as well as during CV and VC transitions. The potential use of the VTR database for research in several areas of speech processing is discussed. 1.
Characterising Formant Trajectories By Tracking Vocal Tract Resonances
, 1995
"... This article presents a characterisation of formant trajectories based on the tracking of each resonance of the vocal tract. Thanks to an original method called nomograms with decoupled cavities, the optimal constriction locations of the area functions of ten prototypical French vowels are given, to ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
This article presents a characterisation of formant trajectories based on the tracking of each resonance of the vocal tract. Thanks to an original method called nomograms with decoupled cavities, the optimal constriction locations of the area functions of ten prototypical French vowels are given, together with the main affiliations of each formant: the formants affiliated with the back cavity are noted as R 1 , R 3 : : : while the formants affiliated with the front one are noted as R 2 , R 4 : : : The typology of the focal points (convergence of formants due to a change of affiliation between formants) is confirmed by an extensive analysis of natural vowel-vowel transitions. An original vowel space (R 1 -R 2 ) is then proposed which maximises the distances between vowels and favours the effective tracking of the vocal tract resonances by assuming an active filtering of R 3 . A simple normalisation of the formant space highlights the promising performance of a speakerindependent vowel identification system based on such an adaptive filtering. Articulatory and perceptual experiments provide evidence converging towards an effective control of an R 1 -R 3 relation independent of R 2 .
Acoustic-Feature-Based Frequency Warping for Speaker Normalization
, 1998
"... xi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1 ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
xi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1
PUBLISHED AS
, 2003
"... for his love and his continuous support in good and bad times throughout this thesis To Laura Lou for her smiles and the energy they gave me when I needed it most To my parents for their perspective about the relative importance of a thesis and other things in life ii State-of-the-art automatic spee ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
for his love and his continuous support in good and bad times throughout this thesis To Laura Lou for her smiles and the energy they gave me when I needed it most To my parents for their perspective about the relative importance of a thesis and other things in life ii State-of-the-art automatic speech recognition (ASR) techniques are typically based on hidden Markov models (HMMs) for the modeling of temporal sequences of feature vectors extracted from the speech
Formant Tracking Using Context-Dependent Phonemic Information
"... Abstract—A new formant-tracking algorithm using phoneme information is proposed. Conventional formant-tracking algorithms obtain formant tracks by analyzing the acoustic speech signal using continuity constraints without any additional information. The formant-tracking error rate of the conventional ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—A new formant-tracking algorithm using phoneme information is proposed. Conventional formant-tracking algorithms obtain formant tracks by analyzing the acoustic speech signal using continuity constraints without any additional information. The formant-tracking error rate of the conventional methods is reportedly in the range of 10%–20%. In this paper, we show that if text or phoneme transcription of speech utterances is available, the error rate can be significantly reduced. The basic idea behind this approach is that given the phoneme identity, formant-tracking algorithms can have a better clue of where to look for formants. The algorithm consists of three phases: 1) analysis, 2) segmentation and alignment, and 3) formant tracking by the Viterbi searching algorithm. In the analysis phase, formant candidates are obtained for each analysis frame by solving the linear prediction polynomial. In the segmentation and alignment phase,
Pitch-based Gender Identification with Two-stage Classification
"... In this paper, we address the speech-based gender identification problem. Mel-Frequency Cepstral Coefficients (MFCC) of voice samples are typically used as the features for gender identification. However, MFCC-based classification incurs high complexity. This paper proposes a novel pitch-based gende ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper, we address the speech-based gender identification problem. Mel-Frequency Cepstral Coefficients (MFCC) of voice samples are typically used as the features for gender identification. However, MFCC-based classification incurs high complexity. This paper proposes a novel pitch-based gender identification system with a two-stage classifier to ensure accurate identification and low complexity. The first stage of the classifier identifies and labels all the speakers whose pitch clearly indicates the gender of the speaker; the complexity of this stage is very low since only threshold-based decision rule on a scalar (i.e., pitch) is used. The ambiguous voice samples from all the other speakers (which cannot be classified with high accuracy by the first stage, and can be regarded as suspicious speakers or difficult cases) are forwarded to the second-stage for finer examination; the second-stage of our classifier uses Gaussian Mixture Model (GMM) to accurately isolate voice samples based on gender. Experiment results show that our system is speech language/content independent, microphone independent, and robust against noisy recording conditions. Our system is extremely accurate with probability of correct classification of 98.65%, and very efficient with about 5 seconds required for feature extraction and classification.

