Results 1 - 10
of
15
A Structured Speech Model with Continuous Hidden Dynamics and Prediction-Residual Training for Tracking Vocal Tract Resonances
- Proc. ICASSP
, 2004
"... A novel approach is developed for efficient and accurate tracking of vocal tract resonances, which are natural frequencies of the resonator from larynx to lips, in fluent speech. The tracking algorithm is based on a version of the structured speech model consisting of continuous-valued hidden dynami ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
A novel approach is developed for efficient and accurate tracking of vocal tract resonances, which are natural frequencies of the resonator from larynx to lips, in fluent speech. The tracking algorithm is based on a version of the structured speech model consisting of continuous-valued hidden dynamics and a piecewise-linearized prediction function from resonance frequencies and bandwidths to LPC cepstra. We present details of the piecewise linearization design process and an adaptive training technique for the parameters that characterize the prediction residuals. An iterative tracking algorithm is described and evaluated that embeds both the prediction-residual training and the piecewise linearization design in an adaptive Kalman filtering framework. Experiments on tracking vocal tract resonances in Switchboard speech data demonstrate high accuracy in the results, as well as the effectiveness of residual training embedded in the algorithm. Our approach differs from traditional formant trackers in that it provides meaningful results even during consonantal closures when the supra-laryngeal source may cause no spectral prominences in speech acoustics.
Formant Tracking Using Segmental Phonemic Information
- in Proc. Eur. Conf. Speech Communication and Technology (Eurospeech
, 1999
"... A new formant tracking algorithm using phoneme dependent nominal formant values is tested. The algorithm consists of three phases: (1) analysis, (2) segmentation, and (3) formant tracking. In the analysis phase, formant candidates are obtained by solving for the roots of the linear prediction polyno ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
A new formant tracking algorithm using phoneme dependent nominal formant values is tested. The algorithm consists of three phases: (1) analysis, (2) segmentation, and (3) formant tracking. In the analysis phase, formant candidates are obtained by solving for the roots of the linear prediction polynomial. In the segmentation phase, the input text is converted into a sequence of phonemic symbols. Then the sequence is time aligned with the speech utterance. Finally, a set of formant candidates that are close to the nominal formant estimates while satisfying the continuity constraints are chosen. The new algorithm significantly reduces the formant tracking error rate (3.62%) over a formant tracking algorithmusing only continuity constraints (13.04%). We will also discuss how to further reduce the tracking error rate. INTRODUCTION In the Bell Labs' Text-To-Speech (TTS) system [1], a limited number of acoustic units is stored in the inventory table. Therefore, it is important to be able to...
A database of vocal tract resonance trajectories for research in speech processing
- Proc. ICASSP
, 2006
"... While vocal tract resonances (VTRs, or formants that are defined as such resonances) are known to play a critical role in human speech perception and in computer speech processing, there has been a lack of standard databases needed for the quantitative evaluation of automatic VTR extraction techniqu ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
While vocal tract resonances (VTRs, or formants that are defined as such resonances) are known to play a critical role in human speech perception and in computer speech processing, there has been a lack of standard databases needed for the quantitative evaluation of automatic VTR extraction techniques. We report in this paper on our recent effort to create a publicly available database of the first three VTR frequency trajectories. The database contains a representative subset of the TIMIT corpus with respect to speaker, gender, dialect and phonetic context, with a total of 538 sentences. A Matlab-based labeling tool is developed, with high-resolution wideband spectrograms displayed to assist in visual identification of VTR frequency values which are then recorded via mouse clicks and local spline interpolation. Special attention is paid to VTR values during consonantto-vowel (CV) and vowel-to-consonant (VC) transitions, and to speech segments with vocal tract anti-resonances. Using this database, we quantitatively assess two common automatic VTR tracking techniques in terms of their average tracking errors analyzed within each of the six major broad phonetic classes as well as during CV and VC transitions. The potential use of the VTR database for research in several areas of speech processing is discussed. 1.
Corpus-based unit selection for natural-sounding speech synthesis
, 2003
"... Speech synthesis is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic waveform. In the past decade or so, a recent trend toward a non-parametric, corpus-based approach has focused on using real human speech as s ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Speech synthesis is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic waveform. In the past decade or so, a recent trend toward a non-parametric, corpus-based approach has focused on using real human speech as source material for producing novel natural-sounding speech. This work proposes a communication-theoretic formulation in which unit selection is a noisy channel through which an input sequence of symbols passes and an output sequence, possibly corrupted due to the coverage limits of the corpus, emerges. The penalty of approximation is quantified by substitution and concatenation costs which grade what unit contexts are interchangeable and where concatenations are not perceivable. These costs are semi-automatically derived from data and are found to agree with acoustic-phonetic knowledge.
Robust, N-Best Formant Tracking
"... We describe a robust, N--best formant tracker. The 2 stage algorithm initially finds single formants or parts thereof. In the second stage a robust dynamic programming search with a wild card mechanism is employed to find the N best consistent interpretation of the initial formant information. The s ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We describe a robust, N--best formant tracker. The 2 stage algorithm initially finds single formants or parts thereof. In the second stage a robust dynamic programming search with a wild card mechanism is employed to find the N best consistent interpretation of the initial formant information. The selection of the correct formant tracks is delayed until after the phonetic search, thus overcoming the lack of robustness of traditional formant trackers by delaying the final decision until after phonemic classification. 1. INTRODUCTION We are building a knowledge--based, segmental speech recognition system. Such systems have traditionally used cepstral, spectral or related features as a basis for segmentation and classification [1, 2]. However, from our experience with spectrogram reading, we know that formants are the single most important source of evidence for the classification of phonetic segments. Formants (especially their relative positioning) are the primary indicator for the cl...
A Gaussian Mixture Model Spectral Representation for Speech Recognition
"... Summary Most modern speech recognition systems use either Mel-frequency cepstral coefficients or per-ceptual linear prediction as acoustic features. Recently, there has been some interest in alter-native speech parameterisations based on using formant features. Formants are the resonant frequencies ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Summary Most modern speech recognition systems use either Mel-frequency cepstral coefficients or per-ceptual linear prediction as acoustic features. Recently, there has been some interest in alter-native speech parameterisations based on using formant features. Formants are the resonant frequencies in the vocal tract which form the characteristic shape of the speech spectrum. How-ever, formants are difficult to reliably and robustly estimate from the speech signal and in some cases may not be clearly present. Rather than estimating the resonant frequencies, formant-like features can be used instead. Formant-like features use the characteristics of the spectral peaks to represent the spectrum. In this work, novel features are developed based on estimating a Gaussian mixture model (GMM) from the speech spectrum. This approach has previously been used sucessfully as a speech codec. The EM algorithm is used to estimate the parameters of the GMM. The extracted parameters: the means, standard deviations and component weights can be related to the for-mant locations, bandwidths and magnitudes. As the features directly represent the linear spec-trum, it is possibly to apply techniques for vocal tract length normalisation and additive noise
PUBLISHED AS
, 2003
"... for his love and his continuous support in good and bad times throughout this thesis To Laura Lou for her smiles and the energy they gave me when I needed it most To my parents for their perspective about the relative importance of a thesis and other things in life ii State-of-the-art automatic spee ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
for his love and his continuous support in good and bad times throughout this thesis To Laura Lou for her smiles and the energy they gave me when I needed it most To my parents for their perspective about the relative importance of a thesis and other things in life ii State-of-the-art automatic speech recognition (ASR) techniques are typically based on hidden Markov models (HMMs) for the modeling of temporal sequences of feature vectors extracted from the speech
Formant Tracking Using Context-Dependent Phonemic Information
"... Abstract—A new formant-tracking algorithm using phoneme information is proposed. Conventional formant-tracking algorithms obtain formant tracks by analyzing the acoustic speech signal using continuity constraints without any additional information. The formant-tracking error rate of the conventional ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—A new formant-tracking algorithm using phoneme information is proposed. Conventional formant-tracking algorithms obtain formant tracks by analyzing the acoustic speech signal using continuity constraints without any additional information. The formant-tracking error rate of the conventional methods is reportedly in the range of 10%–20%. In this paper, we show that if text or phoneme transcription of speech utterances is available, the error rate can be significantly reduced. The basic idea behind this approach is that given the phoneme identity, formant-tracking algorithms can have a better clue of where to look for formants. The algorithm consists of three phases: 1) analysis, 2) segmentation and alignment, and 3) formant tracking by the Viterbi searching algorithm. In the analysis phase, formant candidates are obtained for each analysis frame by solving the linear prediction polynomial. In the segmentation and alignment phase,

