Results 1 -
9 of
9
From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition
, 1996
"... ..."
Heterogeneous Acoustic Measurement And Multiple Classifiers For Speech Recognition
, 1998
"... The acoustic-phonetic modeling component of most current speech recognition systems calculates a small set of homogeneous frame-based measurements at a single, #xed time-frequency resolution. This thesis presents evidence indicating that recognition performance can be signi#cantly improved through a ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
The acoustic-phonetic modeling component of most current speech recognition systems calculates a small set of homogeneous frame-based measurements at a single, #xed time-frequency resolution. This thesis presents evidence indicating that recognition performance can be signi#cantly improved through a contrasting approach using more detailed and more diverse acoustic measurements, which we refer to as heterogeneous measurements.
Production Models As A Structural Basis For Automatic Speech Recognition
, 1996
"... We postulate in this paper that highly structured speech production models will have much to contribute to the ultimate success of speech recognition in view of the weaknesses of the theoretical foundation underpinning current technology. These weaknesses are analyzed in terms of phonological modeli ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
We postulate in this paper that highly structured speech production models will have much to contribute to the ultimate success of speech recognition in view of the weaknesses of the theoretical foundation underpinning current technology. These weaknesses are analyzed in terms of phonological modeling and of phonetic-interface modeling. We conclude by suggesting that many of the advantages to be gained from interaction between speech production and speech recognition communities will develop from integrating models from the production community with the probabilistic analysis-by-synthesis strategy currently used by the technology community. R ' ESUM ' EE Dans cet article, nous proposons que les mod`eles de production de la parole contribueront beaucoup `a la r'eussite eventuelle des mod`eles de reconnaissance automatique, limit'es en ce moment par les faiblesses de la base th'eorique de la technologie actuelle. Nous analysons ces faiblesses au niveau des mod`eles phonologiques et mod`...
The Use of Distinctive Features for Automatic Speech Recognition
, 1991
"... One of the most critical and yet unsolved problems in phonetic recognition is the transformation of the continuous speech signal to a discrete representation for accessing words in the lexicon. In order to find an efficient description of speech for recognition tasks, our research investigates the u ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
One of the most critical and yet unsolved problems in phonetic recognition is the transformation of the continuous speech signal to a discrete representation for accessing words in the lexicon. In order to find an efficient description of speech for recognition tasks, our research investigates the use of distinctive features. Distinctive features are a small set of linguistic units which have the potential advantage of enabling us to describe contextual and coarticulatory variations in speech more parsimoniously and thus make more effective use of available training data.
Phonetically Motivated Acoustic Parameters For Continuous Speech Recognition Using Artificial Neural Networks
, 1992
"... In the framework of an ANN/HMM hybrid system for phone recognition three specialized ANNs were designed and evaluated. One of these ANNs detects the manner of articulation. The other two ANNs describe the speech signal in terms of place of articulation. One of these is used for plosive and nasal cla ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
In the framework of an ANN/HMM hybrid system for phone recognition three specialized ANNs were designed and evaluated. One of these ANNs detects the manner of articulation. The other two ANNs describe the speech signal in terms of place of articulation. One of these is used for plosive and nasal classification, and the other one is used for fricative classification. The design of these networks was inspired by acoustic-phonetic knowledge. Input parameters, ANN topology, and desired output representation have been optimized for the specific task of the network. A main advantage of ANNs over statistical classifiers like HMMs is seen in the possibility to use a large unconstrained feature set which can be setup in order to contain all necessary information rather than to fulfill statistical constraints. Experiments are reported for the TIMIT database. 1 Introduction State of the art acoustic-phonetic decoders for speaker independent continuous speech recognition are based on a statistic...
The Stochastic Segment Model for Continuous Speech Recognition
- In Proceedings The 25th Asilomar Conference on Signals, Systems and Computers
, 1991
"... A new direction in speech recognition via statistical methods is to move from frame-based models, such as Hidden Markov Models (HMMs), to segment-based models that provide a better framework for modeling the dynamics of the speech production mechanism. The Stochastic Segment Model (SSM) is a joint m ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
A new direction in speech recognition via statistical methods is to move from frame-based models, such as Hidden Markov Models (HMMs), to segment-based models that provide a better framework for modeling the dynamics of the speech production mechanism. The Stochastic Segment Model (SSM) is a joint model for a sequence of observations, which provides explicit modeling of time correlation as well as a formalism for incorporating segmental features. In this work, the focus is on modeling time correlation within a segment. We consider three Gaussian model variations based on different assumptions about the form of statistical dependency, including a Gauss-Markov model, a dynamical system model and a target state model, all of which can be formulated in terms of the dynamical system model. Evaluation of the different modeling assumptions is in terms of both phoneme classification performance and the predictive power of linear models. 1 Introduction Most of the existing speaker-independent ...
Explicit N-Best Formant Features fo Segment-Based Speech Recognition
, 1996
"... This thesis investigates the use of explicit speech knowledge in computer speech-recognition. Speech knowledge is generally expressed in terms of acoustic events occurring near phonetic segment boundaries and the location, shape and dynamics of formant trajectories. This suggests the creation of a s ..."
Abstract
- Add to MetaCart
This thesis investigates the use of explicit speech knowledge in computer speech-recognition. Speech knowledge is generally expressed in terms of acoustic events occurring near phonetic segment boundaries and the location, shape and dynamics of formant trajectories. This suggests the creation of a segment-based recognition framework and the use of explicit formant features in a flexible integration scheme to ultimately improve the phonetic recognition accuracy. We describe a segmentation algorithm that produces a lattice of segment hypotheses, each with an associated broad phonetic identity. We build a single phonetic segment classifier along with separate vowel/semi-vowel and consonant classifiers based on traditional cepstral features paying attention to reducing the mismatch between training and deployment conditions. We develop a robust, N-best formant tracking algorithm that generates a list of up to N consistent formant interpretations. The use of the N best feature paradigum is based on the observation that there are generally only a handful of reasonable interpretation of the given formant information. Instead of finding the best formant interpretation through the use of a global cost function that includes energy maximization and smoothness terms, we delay the selection of the correct formant interpretation until after the segment classification and phonetic search. We use the formant interpretations to extract features for a vowel/semi-vowel segment classifier. The formant trajectories are approximated either by three line segments or by a third-order Legendre polynomial. We show that together with formant amplitude, formant bandwidth, pitch, and segment durations we can produce a classifier of comparable performance to a cepstral-based classifier. We further demonstrate the potential of the N best classification paradigm and show that a combination of formant and cepstral features further improves the classification accuracy. Finally, the validity of the entire approach of using a segment-based approach, separate classifiers for vowels and consontans, and explicit formant features is verified by phonetic recognition experiments.
Speech Perception Using . . . The BeBe System
, 1997
"... We define a new approach to speech recognition based on auditory perception and modeled after the human brain's tendency to automatically categorize speech sounds [House 1962; Liberman 1957]. As background, today's speech recognition systems are knowledge-driven since they require the existence of w ..."
Abstract
- Add to MetaCart
We define a new approach to speech recognition based on auditory perception and modeled after the human brain's tendency to automatically categorize speech sounds [House 1962; Liberman 1957]. As background, today's speech recognition systems are knowledge-driven since they require the existence of word and syntax-level knowledge to identify a word from the sound. In contrast, our system uses no higher-level knowledge. Its architecture consists of competing parallel detectors which in real time identify phonemes in the waveform. Each detector, which is a simple algorithm, continuously samples the sound and reports the degree to which the samples contain its designated phoneme. The phoneme detector with the highest precedence and the greatest certainty above a minimal threshold prevails and its phoneme is added to an output queue. In preliminary experiments, four such detectors were tested and they properly identified 83-100% of their designated phonemes in both discrete and continuous speech, independent of the speaker, suggesting that an overall system which incorporates our approach would be much more robust and flexible than traditional systems.

