Results 1 -
8 of
8
From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition
, 1996
"... ..."
Near-Miss Modeling: A Segment-Based Approach to Speech Recognition
, 1998
"... Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance. In contrast, segment-based approaches represent speech as a temporal graph of feature vectors and facilitate the incorporation of a wide range of modeling strategies. However, difficulties in segmentbased recognition have impeded the realization of potential advantages in modeling. This thesis
A Model For Efficient Formant Estimation
- in ICASSP-96
, 1996
"... This paper presents a new method for estimating formant frequencies. The formant model is based on a digital resonator. Each resonator represents a segment of the short--time power spectrum. The complete spectrum is modeled by a set of digital resonators connected in parallel. An algorithm based on ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper presents a new method for estimating formant frequencies. The formant model is based on a digital resonator. Each resonator represents a segment of the short--time power spectrum. The complete spectrum is modeled by a set of digital resonators connected in parallel. An algorithm based on dynamic programming produces both the model parameters and segment boundaries that optimally match the spectrum. The main results of this paper are: 1) Modeling formants by digital resonators allows a reliable estimation of formant frequencies. 2) Digital resonators can be used efficiently in connection with dynamic programming. 3) A recognition test with formant frequencies results in a string error rate of 4.8% on the adult corpus of the TI digit string database. 1. INTRODUCTION An efficient and compact representation of the time-- varying characteristics of speech offers potential benefits for speech recognition. Therefore a variety of approaches such as formant tracking [7, 4, 10], ar...
Stop consonant classification by dynamic formant trajectory
- in ICSLP, Jeju Island, Korea
, 2004
"... LPC analysis is one of the most powerful techniques in speech analysis. Spectral zeros during consonant or consonant-vowel transition regions introduce difficulties in estimating LPC parameters. In this paper, we propose to estimate formant frequencies from LPC model by MUSIC (Multiple Signal Classi ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
LPC analysis is one of the most powerful techniques in speech analysis. Spectral zeros during consonant or consonant-vowel transition regions introduce difficulties in estimating LPC parameters. In this paper, we propose to estimate formant frequencies from LPC model by MUSIC (Multiple Signal Classification) and ES-PRIT (Estimation of Signal Parameters via Rotational Invariance Techniques). Formant candidates estimated by LS (Least Square), MUSIC and ESPRIT are combined to find an optimal solution. The effectiveness of this algorithm is verified by place classification task of stop consonants. 1. OVERVIEW Classification of stop consonants remains one of the most challenging problems in speech recognition. Halberstadt (1998) [3] reported classification of phones in the TIMIT database using heterogeneous
The Stochastic Segment Model for Continuous Speech Recognition
- In Proceedings The 25th Asilomar Conference on Signals, Systems and Computers
, 1991
"... A new direction in speech recognition via statistical methods is to move from frame-based models, such as Hidden Markov Models (HMMs), to segment-based models that provide a better framework for modeling the dynamics of the speech production mechanism. The Stochastic Segment Model (SSM) is a joint m ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
A new direction in speech recognition via statistical methods is to move from frame-based models, such as Hidden Markov Models (HMMs), to segment-based models that provide a better framework for modeling the dynamics of the speech production mechanism. The Stochastic Segment Model (SSM) is a joint model for a sequence of observations, which provides explicit modeling of time correlation as well as a formalism for incorporating segmental features. In this work, the focus is on modeling time correlation within a segment. We consider three Gaussian model variations based on different assumptions about the form of statistical dependency, including a Gauss-Markov model, a dynamical system model and a target state model, all of which can be formulated in terms of the dynamical system model. Evaluation of the different modeling assumptions is in terms of both phoneme classification performance and the predictive power of linear models. 1 Introduction Most of the existing speaker-independent ...
Explicit N-Best Formant Features fo Segment-Based Speech Recognition
, 1996
"... This thesis investigates the use of explicit speech knowledge in computer speech-recognition. Speech knowledge is generally expressed in terms of acoustic events occurring near phonetic segment boundaries and the location, shape and dynamics of formant trajectories. This suggests the creation of a s ..."
Abstract
- Add to MetaCart
This thesis investigates the use of explicit speech knowledge in computer speech-recognition. Speech knowledge is generally expressed in terms of acoustic events occurring near phonetic segment boundaries and the location, shape and dynamics of formant trajectories. This suggests the creation of a segment-based recognition framework and the use of explicit formant features in a flexible integration scheme to ultimately improve the phonetic recognition accuracy. We describe a segmentation algorithm that produces a lattice of segment hypotheses, each with an associated broad phonetic identity. We build a single phonetic segment classifier along with separate vowel/semi-vowel and consonant classifiers based on traditional cepstral features paying attention to reducing the mismatch between training and deployment conditions. We develop a robust, N-best formant tracking algorithm that generates a list of up to N consistent formant interpretations. The use of the N best feature paradigum is based on the observation that there are generally only a handful of reasonable interpretation of the given formant information. Instead of finding the best formant interpretation through the use of a global cost function that includes energy maximization and smoothness terms, we delay the selection of the correct formant interpretation until after the segment classification and phonetic search. We use the formant interpretations to extract features for a vowel/semi-vowel segment classifier. The formant trajectories are approximated either by three line segments or by a third-order Legendre polynomial. We show that together with formant amplitude, formant bandwidth, pitch, and segment durations we can produce a classifier of comparable performance to a cepstral-based classifier. We further demonstrate the potential of the N best classification paradigm and show that a combination of formant and cepstral features further improves the classification accuracy. Finally, the validity of the entire approach of using a segment-based approach, separate classifiers for vowels and consontans, and explicit formant features is verified by phonetic recognition experiments.
Robust Estimation of Stocchastic Segment Models for Word Recognition
, 1990
"... In this work, we develop robust estimation techniques for a continuous-word recognition system using the Stochastic Segment model (SSM). This work is done under the N-best rescoring formalism, where a less complex system than the SSM is used to generate candidate hypotheses which are then rescored a ..."
Abstract
- Add to MetaCart
In this work, we develop robust estimation techniques for a continuous-word recognition system using the Stochastic Segment model (SSM). This work is done under the N-best rescoring formalism, where a less complex system than the SSM is used to generate candidate hypotheses which are then rescored and reranked by the SSM. Components of the system that are the focus of this work include estimation of weights for score combination and robust parameter estimation using clustering techniques to model context. In particular, we develop several agglomerative and divisive clustering techniques for multivariate Gaussian distributions, which we use to cluster triphone models. This leads to better estimates with fewer parameters resulting in reduction in word error and storage/computation costs over using unclustered triphones. We also implement an SSM system based on microsegments which combines mixture modeling with trajectory modeling and examine the tradeoffs involved between the allocation ...
Multivariate-State Hidden Markov Models For Simultaneous Transcription Of Phones And Formants
"... A multivariate-state HMM | an HMM with a vector state variable | can be used to nd jointly optimal phonetic and formant transcriptions of an utterance. The complexity of searching a multivariate state space using the BaumWelch algorithm is substantial, but may be signicantly reduced if the formant f ..."
Abstract
- Add to MetaCart
A multivariate-state HMM | an HMM with a vector state variable | can be used to nd jointly optimal phonetic and formant transcriptions of an utterance. The complexity of searching a multivariate state space using the BaumWelch algorithm is substantial, but may be signicantly reduced if the formant frequencies are assumed to be conditionally independent given knowledge of the phone. Operating with a known phonetic transcription, the multivariatestate model can provide a maximum a posteriori formant trajectory, complete with condence limits on each of the formant frequency measurements. The model can also be used as a phonetic classier by adding the probabilities of all possible formant trajectories. A test system is described which requires only nine trainable parameters per formant per phonetic state: ve parameters to model formant transitions, and four to model spectral observations. Further simplications were achieved through parameter tying. 1. INTRODUCTION This article prop...

