Results 1 - 10
of
12
A Probabilistic Framework For Segment-Based Speech Recognition
, 2003
"... Most current speech recognizers use an observatE9 space based on atS8VV al sequence of measur extn ct from fixed-lengt "frames" (e.g., Mel-cepst-ce Given ahypot9; ical word or sub-word sequence, te acoustO likelihood computp;VW always involves allobservat ion frames,t,;LI t, mapping beting individ ..."
Abstract
-
Cited by 108 (33 self)
- Add to MetaCart
Most current speech recognizers use an observatE9 space based on atS8VV al sequence of measur extn ct from fixed-lengt "frames" (e.g., Mel-cepst-ce Given ahypot9; ical word or sub-word sequence, te acoustO likelihood computp;VW always involves allobservat ion frames,t,;LI t, mapping beting individual frames andintV nal recognizerstr;E will depend on t;hypotEO; zed segmentme;LH There is anotLO tot of recognizer whoseobservat ion space isbetI r represente as anet ork, or graph, where each arc in t; graph correspondst a hypotL;) zed variable-lengt segment tm is represente by a fixed-dimensional "featO e". In suchfeatSE;)E sed recognizers, eachhypotO99 zed segmentme;L will correspondt a segment sequence, orpatH ttHSV tt overall segme ntme aph th; is associato wit a subset of all possible feat revectI s intV tVLI observatEV space. Int;E work we examine a maximum apostW iori decoding stcodin forfeat ure-based recognizers and develop a normalizat ioncrit9S on useful for a segme ntme; ed VitOLO or A # search. Experiment arereport ed for bot phoneto and word recognitco tcog .
Support vector machines for speech recognition
- Proceedings of the International Conference on Spoken Language Processing
, 1998
"... Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative informati ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative information and are prone to overfitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. In this paper, we show that SVMs provide a significant improvement in performance on a static pattern classification task based on the Deterding vowel data. We also describe an application of SVMs to large vocabulary speech recognition, and demonstrate an improvement in error rate on a continuous alphadigit task (OGI Aphadigits) and a large vocabulary conversational speech task (Switchboard). Issues related to the development and optimization of an SVM/HMM hybrid system are discussed.
Lexical Modeling Of Non-Native Speech For Automatic Speech Recognition
, 2000
"... This paper examines the recognition of non-native speech in jupiter, a speaker-independent, spontaneous-speech conversational system. Because the non-native speech in this domain is limited and varied, speaker- and accent-specific methods are impractical. We therefore chose to model all of the non-n ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
This paper examines the recognition of non-native speech in jupiter, a speaker-independent, spontaneous-speech conversational system. Because the non-native speech in this domain is limited and varied, speaker- and accent-specific methods are impractical. We therefore chose to model all of the non-native data with a single model. In particular, this paper describes an attempt to better model non-native lexical patterns. These patterns are incorporated by applying context-independent phonetic confusion rules, whose probabilities are estimated from training data. Using this approach, the word error rate on a non-native test set is reduced from 20.9% to 18.8%. 1. INTRODUCTION Speech recognition accuracy has been observed to be drastically lower for non-native speakers of the target language than for native speakers [3, 13, 14]. Research on both nonnative accent modeling and dialect-specific modeling shows that large gains in performance can be achieved when the acoustics [1, 9, 14] and ...
Pronunciation Modeling Using a Finite-State Transducer Representation
- in Proc. ISCA Tutorial and Research Workshop on Pronunciation Modeling and Lexicon Adaptation
, 2002
"... The MIT SUMMIT speech recognition system models pronunciation using a phonemic baseform dictionary along with rewrite rules for modeling phonological variation and multi-word reductions. Each pronunciation component is encoded within a finitestate transducer (FST) representation whose transition wei ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
The MIT SUMMIT speech recognition system models pronunciation using a phonemic baseform dictionary along with rewrite rules for modeling phonological variation and multi-word reductions. Each pronunciation component is encoded within a finitestate transducer (FST) representation whose transition weights can be probabilistically trained using a modified EM algorithm for finite-state networks. This paper explains the modeling approach we use and the details of its realization. We demonstrate the benefits and weaknesses of the approach both conceptually and empirically using the recognizer for our JUPITER weather information system. Our experiments demonstrate that the use of phonological rewrite rules within our system reduces word error rates by between 4% and 8% over different test sets when compared against a system using no phonological rewrite rules.
A Segment-Based Audio-Visual Speech Recognizer: Data Collection, Development, and Initial Experiments
- In Proc. ICMI
, 2004
"... This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total h ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR system which incorporates a novel audio-visual integration scheme using segment-constrained Hidden Markov Models (HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition process.
Lexical Stress Modeling for Improved Speech Recognition of Spontaneous Telephone Speech in the JUPITER Domain
, 2001
"... This paper examines an approach of using lexical stress models to improve the speech recognition performance on spontaneous telephone speech. We analyzed the correlation of various pitch, energy, and duration measurements with lexical stress on a large corpus of spontaneous utterances, and identifie ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper examines an approach of using lexical stress models to improve the speech recognition performance on spontaneous telephone speech. We analyzed the correlation of various pitch, energy, and duration measurements with lexical stress on a large corpus of spontaneous utterances, and identified the most informative features of stress using classification experiments. We incorporated the stress models into the recognizer first-pass Viterbi search and obtained modest but statistically significant improvements over a state-of-the-art real-time performance on the JUPITER weather information domain [1]. 1.
Training Of Finite-State Transducers and its Application to Pronunciation Modeling
, 2002
"... to be useful for a number of applications in speech and language processing. FST operations such as composition, determinization, and minimization make manipulating FSTs very simple. In this paper, we present a method to learn weights for arbitrary FSTs using the EM algorithm. We show that this FST ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
to be useful for a number of applications in speech and language processing. FST operations such as composition, determinization, and minimization make manipulating FSTs very simple. In this paper, we present a method to learn weights for arbitrary FSTs using the EM algorithm. We show that this FST EM algorithm is able to learn pronunciation weights that improve the word error rate for a spontaneous speech recognition task.
Discriminative training of Acoustic Models in a Segment-Based Speech Recognizer
, 2000
"... This thesis explores the use of discriminative training to improve acoustic modeling in a segment-based speech recognizer. In contrast with the more commonly used Maximum Likelihood training, discriminative training considers the likelihoods of competing classes when determining the parameters for a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This thesis explores the use of discriminative training to improve acoustic modeling in a segment-based speech recognizer. In contrast with the more commonly used Maximum Likelihood training, discriminative training considers the likelihoods of competing classes when determining the parameters for a given class's model. Thus, discriminative training works directly to minimize the number of errors made in the recognition of the training data.
Infrastructure Development for Integration of Lip Reading into the SUMMIT Speech Recognizer
, 2003
"... This thesis describes a method for augmenting an audio-only speech recognizer with visual lip-reading information, in order to improve the performance and robustness of the recognizer. The speech recognizer's variable length audio segments are resolved with the fixed length video frames using segmen ..."
Abstract
- Add to MetaCart
This thesis describes a method for augmenting an audio-only speech recognizer with visual lip-reading information, in order to improve the performance and robustness of the recognizer. The speech recognizer's variable length audio segments are resolved with the fixed length video frames using segment constrained Hidden Markov Modeling. A Viterbi search over the per-segment Hidden Markov Model resolves the variable asynchrony between the audio and video streams. The two streams are combined according to a relative weighting scheme, which is determined by optimizing on a held-out data set. Although a full audio-visual system has yet not been implemented, this thesis describes the infrastructure that has been developed to accommodate integration with a visual lip-reading module that will be completed in the near future.
EVALUATION OF SOFT SEGMENT MODELING ON A CONTEXT INDEPENDENT PHONEME CLASSIFICATION SYSTEM
, 2004
"... The geometric distribution of states' duration is one of the main performance limiting assumptions of hidden Markov modeling of speech signals. Stochastic segment models, generally, and segmental HMM, specifically, overcome this deficiency partly at the cost of more complexity in both training and r ..."
Abstract
- Add to MetaCart
The geometric distribution of states' duration is one of the main performance limiting assumptions of hidden Markov modeling of speech signals. Stochastic segment models, generally, and segmental HMM, specifically, overcome this deficiency partly at the cost of more complexity in both training and recognition phases. In addition to this assumption, the gradual temporal changes of speech statistics has not been modeled in HMM. In this paper, a new duration modeling approach is presented. The main idea of the model is to consider the effect of adjacent segments on the probability density function estimation and evaluation of each acoustic segment. This idea not only makes the model robust against

