Results 1  10
of
18
From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition
, 1996
"... ..."
NearMiss Modeling: A SegmentBased Approach to Speech Recognition
, 1998
"... Currently, most approaches to speech recognition are framebased in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
Currently, most approaches to speech recognition are framebased in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance. In contrast, segmentbased approaches represent speech as a temporal graph of feature vectors and facilitate the incorporation of a wide range of modeling strategies. However, difficulties in segmentbased recognition have impeded the realization of potential advantages in modeling. This thesis
Ligature modeling for online cursive script recognition
 IEEE Trans. Pattern Anal. Mach. Intell
, 1997
"... Abstractâ€”Online recognition of cursive words is a difficult task owing to variable shape and ambiguous letter boundaries. The approach proposed in this paper is based on hidden Markov modeling of letters and interletter patterns called ligatures occurring in cursive script. For each of the letters ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
(Show Context)
Abstractâ€”Online recognition of cursive words is a difficult task owing to variable shape and ambiguous letter boundaries. The approach proposed in this paper is based on hidden Markov modeling of letters and interletter patterns called ligatures occurring in cursive script. For each of the letters and the ligatures we create one HMM that models temporal and spatial variability of handwriting. By networking the two kinds of HMMs, we can design a network model for all words or composite characters. The network incorporates the knowledge sources of grammatical and structural constraints so that it can better capture the characteristics of handwriting. Given the network, the problem of recognition is formulated into that of finding the most likely path from the start node to the end node. A dynamic programmingbased search for the optimal inputnetwork alignment performs character recognition and letter segmentation simultaneously and efficiently. Experiments on Korean character showed correct recognition of up to 93.3 percent on unconstrained samples. It has also been compared with several other schemes of HMMbased recognition to characterize the proposed approach.
A Model For Efficient Formant Estimation
 in ICASSP96
, 1996
"... This paper presents a new method for estimating formant frequencies. The formant model is based on a digital resonator. Each resonator represents a segment of the shorttime power spectrum. The complete spectrum is modeled by a set of digital resonators connected in parallel. An algorithm based on ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
This paper presents a new method for estimating formant frequencies. The formant model is based on a digital resonator. Each resonator represents a segment of the shorttime power spectrum. The complete spectrum is modeled by a set of digital resonators connected in parallel. An algorithm based on dynamic programming produces both the model parameters and segment boundaries that optimally match the spectrum. The main results of this paper are: 1) Modeling formants by digital resonators allows a reliable estimation of formant frequencies. 2) Digital resonators can be used efficiently in connection with dynamic programming. 3) A recognition test with formant frequencies results in a string error rate of 4.8% on the adult corpus of the TI digit string database. 1. INTRODUCTION An efficient and compact representation of the time varying characteristics of speech offers potential benefits for speech recognition. Therefore a variety of approaches such as formant tracking [7, 4, 10], ar...
SpeakerIndependent Digit Recognition Using a Neural Network with TimeDelayed Connections
, 1992
"... The capability of a small neural network to perform speakerindependent recognition of spoken digits in connected speech has been investigated. The network uses time delays to organize rapidly changing outputs of symbol detectors over the time scale of a word. The network is data driven and unclocke ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
The capability of a small neural network to perform speakerindependent recognition of spoken digits in connected speech has been investigated. The network uses time delays to organize rapidly changing outputs of symbol detectors over the time scale of a word. The network is data driven and unclocked. To achieve useful accuracy in a speakerindependent setting, many new ideas and procedures were developed. These include improving the feature detectors, selfrecognition of word ends, reduction in network size, and dividing speakers into natural classes. Quantitative experiments based on Texas Instruments (TI) digit data bases are described.
Improvements in the stochastic segment model for phoneme recognition
 Proceedings of DARPA Workshop on Speech and Natural Language
, 1989
"... The heart of a speech recognition system is the acoustic model of subword units (e.g., phonemes). In this work we discuss refinements of the stochastic segment model, an alternative to hidden Markov models for representation of the acoustic variability of phonemes. We concentrate on mechanisms for ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
The heart of a speech recognition system is the acoustic model of subword units (e.g., phonemes). In this work we discuss refinements of the stochastic segment model, an alternative to hidden Markov models for representation of the acoustic variability of phonemes. We concentrate on mechanisms for better modelling time correlation of features across an entire segment. Results are presented for speakerindependent phoneme classification in continuous speech based on the 'lIMIT 0a!~base.
The Stochastic Segment Model for Continuous Speech Recognition
 In Proceedings The 25th Asilomar Conference on Signals, Systems and Computers
, 1991
"... A new direction in speech recognition via statistical methods is to move from framebased models, such as Hidden Markov Models (HMMs), to segmentbased models that provide a better framework for modeling the dynamics of the speech production mechanism. The Stochastic Segment Model (SSM) is a joint m ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
A new direction in speech recognition via statistical methods is to move from framebased models, such as Hidden Markov Models (HMMs), to segmentbased models that provide a better framework for modeling the dynamics of the speech production mechanism. The Stochastic Segment Model (SSM) is a joint model for a sequence of observations, which provides explicit modeling of time correlation as well as a formalism for incorporating segmental features. In this work, the focus is on modeling time correlation within a segment. We consider three Gaussian model variations based on different assumptions about the form of statistical dependency, including a GaussMarkov model, a dynamical system model and a target state model, all of which can be formulated in terms of the dynamical system model. Evaluation of the different modeling assumptions is in terms of both phoneme classification performance and the predictive power of linear models. 1 Introduction Most of the existing speakerindependent ...
Stop consonant classification by dynamic formant trajectory
 in ICSLP, Jeju Island, Korea
, 2004
"... LPC analysis is one of the most powerful techniques in speech analysis. Spectral zeros during consonant or consonantvowel transition regions introduce difficulties in estimating LPC parameters. In this paper, we propose to estimate formant frequencies from LPC model by MUSIC (Multiple Signal Classi ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
LPC analysis is one of the most powerful techniques in speech analysis. Spectral zeros during consonant or consonantvowel transition regions introduce difficulties in estimating LPC parameters. In this paper, we propose to estimate formant frequencies from LPC model by MUSIC (Multiple Signal Classification) and ESPRIT (Estimation of Signal Parameters via Rotational Invariance Techniques). Formant candidates estimated by LS (Least Square), MUSIC and ESPRIT are combined to find an optimal solution. The effectiveness of this algorithm is verified by place classification task of stop consonants. 1. OVERVIEW Classification of stop consonants remains one of the most challenging problems in speech recognition. Halberstadt (1998) [3] reported classification of phones in the TIMIT database using heterogeneous
MultivariateState Hidden Markov Models For Simultaneous Transcription Of Phones And Formants
"... A multivariatestate HMM  an HMM with a vector state variable  can be used to nd jointly optimal phonetic and formant transcriptions of an utterance. The complexity of searching a multivariate state space using the BaumWelch algorithm is substantial, but may be signicantly reduced if the formant f ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
A multivariatestate HMM  an HMM with a vector state variable  can be used to nd jointly optimal phonetic and formant transcriptions of an utterance. The complexity of searching a multivariate state space using the BaumWelch algorithm is substantial, but may be signicantly reduced if the formant frequencies are assumed to be conditionally independent given knowledge of the phone. Operating with a known phonetic transcription, the multivariatestate model can provide a maximum a posteriori formant trajectory, complete with condence limits on each of the formant frequency measurements. The model can also be used as a phonetic classier by adding the probabilities of all possible formant trajectories. A test system is described which requires only nine trainable parameters per formant per phonetic state: ve parameters to model formant transitions, and four to model spectral observations. Further simplications were achieved through parameter tying. 1. INTRODUCTION This article prop...
Timing models for prosody and crossword coarticulation in connected speech
 In Proc. of the
, 1989
"... Gauging durations of acoustic intervals is useful for recognizing the phrasing and stress pattern of an utterance. It aids in the recognition of segments that are differentiated by duration, and it can improve segment recognition in general because knowing the stress and phrasing reduces the vocabul ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Gauging durations of acoustic intervals is useful for recognizing the phrasing and stress pattern of an utterance. It aids in the recognition of segments that are differentiated by duration, and it can improve segment recognition in general because knowing the stress and phrasing reduces the vocabulary search space. However, models of speech timing that compute acoustic segment lengths cannot capture spectral dynamics, and they rapidly become unwieldy in connected speech, where many effects interact to determine interval durations. I will review two results from recent work on articulatory dynamics that suggest a more workable alternative. Browman and Goldstein have developed a general model of the timing of articulatory gestures. Using this model they can describe many assimilations and apparent deletions of segments at word boundaries in terms of simple manipulations of intergestural timing, an account which should be useful for predicting the lenition pattern and for interpreting the resulting spectra in order to recover the underlying form. Beckman, Edwards, and Fletcher have applied Browman and Goldstein's model in examining articulatory correlates of global tempo decrease, phrasefinal position, and sentence accent. Their data show that these three different lengthening effects are functionally distinct and suggest that the kinematics of formant transitions and amplitude curves can be used for distinguishing among the effects to parse the prosodic organization of an utterance.