Results 1 - 10
of
13
A Probabilistic Framework For Segment-Based Speech Recognition
, 2003
"... Most current speech recognizers use an observatE9 space based on atS8VV al sequence of measur extn ct from fixed-lengt "frames" (e.g., Mel-cepst-ce Given ahypot9; ical word or sub-word sequence, te acoustO likelihood computp;VW always involves allobservat ion frames,t,;LI t, mapping beting individ ..."
Abstract
-
Cited by 108 (33 self)
- Add to MetaCart
Most current speech recognizers use an observatE9 space based on atS8VV al sequence of measur extn ct from fixed-lengt "frames" (e.g., Mel-cepst-ce Given ahypot9; ical word or sub-word sequence, te acoustO likelihood computp;VW always involves allobservat ion frames,t,;LI t, mapping beting individual frames andintV nal recognizerstr;E will depend on t;hypotEO; zed segmentme;LH There is anotLO tot of recognizer whoseobservat ion space isbetI r represente as anet ork, or graph, where each arc in t; graph correspondst a hypotL;) zed variable-lengt segment tm is represente by a fixed-dimensional "featO e". In suchfeatSE;)E sed recognizers, eachhypotO99 zed segmentme;L will correspondt a segment sequence, orpatH ttHSV tt overall segme ntme aph th; is associato wit a subset of all possible feat revectI s intV tVLI observatEV space. Int;E work we examine a maximum apostW iori decoding stcodin forfeat ure-based recognizers and develop a normalizat ioncrit9S on useful for a segme ntme; ed VitOLO or A # search. Experiment arereport ed for bot phoneto and word recognitco tcog .
A Probabilistic Framework For Feature-Based Speech Recognition
, 1996
"... Most current speech recognizers use an observation space which is based on a temporal sequence of "frames" (e.g., Mel-cepstra). There is another class of recognizer which further processes these frames to produce a segment-based network, and represents each segment by fixed-dimensional "features." I ..."
Abstract
-
Cited by 101 (24 self)
- Add to MetaCart
Most current speech recognizers use an observation space which is based on a temporal sequence of "frames" (e.g., Mel-cepstra). There is another class of recognizer which further processes these frames to produce a segment-based network, and represents each segment by fixed-dimensional "features." In such feature-based recognizers the observation space takes the form of a temporal network of feature vectors, so that a single segmentation of an utterance will use a subset of all possible feature vectors. In this work we examine amaximuma posteriori decoding strategy for feature-based recognizers and develop a normalization criterion useful for a segmentbased Viterbi or A* search. We report experimental results for the task of phonetic recognition on the TIMIT corpus where we achieved context-independent and context-dependent (using diphones) results on the core test set of 64.1% and 69.5% respectively.
High Performance Speaker-Independent Phone Recognition Using CDHMM
- In Proc. Eurospeech
, 1993
"... In this paper we report high phone accuracies on three corpora: WSJ0, BREF and TIMIT. The main characteristics of the phone recognizer are: high dimensional feature vector (48), context- and genderdependent phone models with duration distribution, continuous density HMM with Gaussian mixtures, and n ..."
Abstract
-
Cited by 41 (11 self)
- Add to MetaCart
In this paper we report high phone accuracies on three corpora: WSJ0, BREF and TIMIT. The main characteristics of the phone recognizer are: high dimensional feature vector (48), context- and genderdependent phone models with duration distribution, continuous density HMM with Gaussian mixtures, and n-gram probabilities for the phonotatic constraints. These models are trained on speech data that have either phonetic or orthographic transcriptions using maximum likelihood and maximum a posteriori estimation techniques. On the WSJ0 corpus with a 46 phone set we obtain phone accuraciesof 72.4% and 74.4% using 500 and 1600 CD phone units, respectively. Accuracy on BREF with 35 phones is as high as 78.7% with only 428 CD phone units. On TIMIT using the 61 phone symbols and only 500 CD phone units, we obtain a phoneaccuracyof 67.2% which correspond to 73.4% when the recognizer output is mapped to the commonly used 39 phone set. Making reference to our work on large vocabularyCSR, we show that ...
Automatic Generation Of Detailed Pronunciation Lexicons
, 1995
"... We explore different ways of "spelling" a word in a speech recognizer's lexicon and how to obtain those spellings. In particular, we compare using as the source of sub-words units for which we build acoustic models (1) a coarse phonemic representation, (2) a single, fine phonetic realization, and (3 ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
We explore different ways of "spelling" a word in a speech recognizer's lexicon and how to obtain those spellings. In particular, we compare using as the source of sub-words units for which we build acoustic models (1) a coarse phonemic representation, (2) a single, fine phonetic realization, and (3) multiple phonetic realizations with associated likelihoods. We describe how we obtain these different pronunciations from text-to-speech systems and from procedures that build decision trees trained on phonetically-labeled corpora. We evaluate these methods applied to speech recognition with the DARPA Resource Management (RM) and the North American Business News (NAB) tasks. For the RM task (with perplexity 60 grammar), we obtain 93.4% word accuracy using phonemic pronunciations, 94.1% using a single phonetic pronunciation per word, and 96.3% using multiple phonetic pronunciations per word with associated likelihoods. For the NAB task (with 60K vocabulary and 34M 1-5 grams), we obtain 87.3% word accuracy with phonemic pronunciations and 90.0% using multiple phonetic pronunciations
Near-Miss Modeling: A Segment-Based Approach to Speech Recognition
, 1998
"... Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance. In contrast, segment-based approaches represent speech as a temporal graph of feature vectors and facilitate the incorporation of a wide range of modeling strategies. However, difficulties in segmentbased recognition have impeded the realization of potential advantages in modeling. This thesis
Phone Clustering Using The Bhattacharyya Distance
- In Proceedings of the International Conference on Spoken Language Processing
, 1996
"... In this paper we study using the classification-based Bhattacharyya distance measure to guide biphone clustering. The Bhattacharyya distance is a theoretical distance measure between two Gaussian distributions which is equivalent to an upper bound on the optimal Bayesian classification error probabi ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
In this paper we study using the classification-based Bhattacharyya distance measure to guide biphone clustering. The Bhattacharyya distance is a theoretical distance measure between two Gaussian distributions which is equivalent to an upper bound on the optimal Bayesian classification error probability. It also has the desirable properties of being computationally simple and extensible to more Gaussian mixtures. Using the Bhattacharyya distance measure in a datadriven approach together with a novel 2-Level Agglomerative Hierarchical Biphone Clustering algorithm, generalized left/right biphones (BGBs) are derived. A neural-net based phone recognizer trained on the BGBs is found to have better frame-level phone recognition than one trained on generalized biphones (BCGBs) derived from a set of commonly used broad categories. We further evaluate the new BGBs on an isolated-word recognition task of perplexity 40 and obtain a 16.2% error reduction over the broad-category generalized biphones (BCGBs) and a 41.8% error reduction over the monophones.
Context-Dependent Modeling in a Segment-Based Speech Recognition System
- S.M. thesis, MIT
, 1997
"... The goal of this thesis is to explore various strategies for incorporating contextual information into a segment-based speech recognition system, while maintaining computational costs at a level acceptable for implementation in a real-time system. The latter is achieved by using context-independent ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The goal of this thesis is to explore various strategies for incorporating contextual information into a segment-based speech recognition system, while maintaining computational costs at a level acceptable for implementation in a real-time system. The latter is achieved by using context-independent models in the search, while contextdependent models are reserved for re-scoring the hypotheses proposed by the contextindependent system. Within this framework, several types of context-dependent sub-word units were evaluated, including word-dependent, biphone, and triphone units. In each case, deleted interpolation was used to compensate for the lack of training data for the models. Other types of context-dependent modeling, such as context-dependent boundary modeling and "offset" modeling, were also used successfully in the re-scoring pass. The evaluation of the system was performed using the Resource Management task. Context-dependent segment models were able to reduce the error rate of t...
Context-sensitive hidden Markov models for modeling long-range dependencies in symbol sequences
- IEEE Trans. Signal Processing
, 2006
"... The hidden Markov model (HMM) has been widely used in signal processing and digital communication applications. It is well-known for its efficiency in modeling short-term dependencies between adjacent symbols. However, it cannot be used for modeling long-range interactions between symbols that are d ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
The hidden Markov model (HMM) has been widely used in signal processing and digital communication applications. It is well-known for its efficiency in modeling short-term dependencies between adjacent symbols. However, it cannot be used for modeling long-range interactions between symbols that are distant from each other. In this paper, we introduce the concept of context-sensitive HMM. The proposed model is capable of modeling strong pairwise correlations between distant symbols. Based on this model, we propose dynamic programming algorithms that can be used for finding the optimal state sequence and for computing the probability of an observed symbol string. Furthermore, we also introduce a parameter re-estimation algorithm, which can be used for optimizing the model parameters based on the given training sequences. 1
A Back-off Discriminative Acoustic Model for Automatic Speech Recognition
"... In this paper we propose a back-off discriminative acoustic model for Automatic Speech Recognition (ASR). We use a set of broad phonetic classes to divide the classification problem originating from context-dependent modeling into a set of subproblems. By appropriately combining the scores from clas ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
In this paper we propose a back-off discriminative acoustic model for Automatic Speech Recognition (ASR). We use a set of broad phonetic classes to divide the classification problem originating from context-dependent modeling into a set of subproblems. By appropriately combining the scores from classifiers designed for the sub-problems, we can guarantee that the back-off acoustic score for different context-dependent units will be different. The back-off model can be combined with discriminative training algorithms to further improve the performance. Experimental results on a large vocabulary lecture transcription task show that the proposed back-off discriminative acoustic model has more than a 2.0 % absolute word error rate reduction compared to clustering-based acoustic model. Index Terms: context-dependent acoustic modeling, back-off acoustic models, discriminative training,
Joint work with
, 1996
"... Text and speech processing: hard problems Theory of automata Appropriate level of abstraction ..."
Abstract
- Add to MetaCart
Text and speech processing: hard problems Theory of automata Appropriate level of abstraction

