Results 1 - 10
of
28
Speaker Adaptation Using Constrained Estimation of Gaussian Mixtures
- IEEE Transactions on Speech and Audio Processing
, 1995
"... A recent trend in automatic speech recognition systems is the use of continuous mixture-density hidden Markov models (HMMs). Despite the good recognition performance that these systems achieve on average in large vocabulary applications, there is a large variability in performance across speakers. P ..."
Abstract
-
Cited by 65 (2 self)
- Add to MetaCart
A recent trend in automatic speech recognition systems is the use of continuous mixture-density hidden Markov models (HMMs). Despite the good recognition performance that these systems achieve on average in large vocabulary applications, there is a large variability in performance across speakers. Performance degrades dramatically when the user is radically different from the training population. A popular technique that can improve the performance and robustness of a speech recognition system is adapting speech models to the speaker, and more generally to the channel and the task. In continuous mixture-density HMMs the number of component densities is typically very large, and it may not be feasible to acquire a sufficient amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, we propose a constrained estimation technique for Gaussian mixture densities. The algorithm is evaluated on the large-vocabulary Wall Street Journal corpus for both ...
The SRI March 2000 Hub-5 conversational speech transcription system
- In Proceedings of the NIST Speech Transcription Workshop
, 2000
"... We describe SRI’s large vocabulary conversational speech recognition system as used in the March 2000 NIST Hub-5E evaluation. The system performs four recognition passes: (1) bigram recognition with phone-loop-adapted, within-word triphone acoustic models, (2) lattice generation with transcription-m ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
We describe SRI’s large vocabulary conversational speech recognition system as used in the March 2000 NIST Hub-5E evaluation. The system performs four recognition passes: (1) bigram recognition with phone-loop-adapted, within-word triphone acoustic models, (2) lattice generation with transcription-mode-adapted models, (3) trigram lattice recognition with adapted cross-word triphone models, and (4) N-best rescoring and reranking with various additional knowledge sources. The system incorporates two new kinds of acoustic model: triphone models conditioned on speaking rate, and an explicit joint model of within-word phone durations. We also obtained an unusually large improvement from modeling crossword pronunciation variants in “multiword ” vocabulary items. The language model (LM) was enhanced with an “anti-LM ” representing acoustically confusable word sequences. Finally, we applied a generalized ROVER algorithm to combine the N-best hypotheses from several systems based on different acoustic models. 1.
Using Self-Organizing Maps and Learning Vector Quantization for Mixture Density Hidden Markov Models
, 1997
"... This work presents experiments to recognize pattern sequences using hidden Markov models (HMMs). The pattern sequences in the experiments are computed from speech signals and the recognition task is to decode the corresponding phoneme sequences. The training of the HMMs of the phonemes using the col ..."
Abstract
-
Cited by 19 (8 self)
- Add to MetaCart
This work presents experiments to recognize pattern sequences using hidden Markov models (HMMs). The pattern sequences in the experiments are computed from speech signals and the recognition task is to decode the corresponding phoneme sequences. The training of the HMMs of the phonemes using the collected speech samples is a difficult task because of the natural variation in the speech. Two neural computing paradigms, the Self-Organizing Map (SOM) and the Learning Vector Quantization (LVQ) are used in the experiments to improve the recognition performance of the models. A HMM consists of sequential states which are trained to model the feature changes in the signal produced during the modeled process. The output densities applied in this work are mixtures of Gaussian density functions. SOMs are applied to initialize and train the mixtures to give a smooth and faithful presentation of the feature vector space defined by the corresponding training samples. The SOM maps similar feature vect...
Automatic Scoring of Pronunciation Quality
- Speech Communication
, 1999
"... We present a paradigm for the automatic assessment of pronunciation quality by machine. ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
We present a paradigm for the automatic assessment of pronunciation quality by machine.
Training Data Clustering For Improved Speech Recognition
- in Proceedings of EUROSPEECH
, 1995
"... We present an approach to cluster the training data for automatic speech recognition (ASR). A relativeentropy based distance metric between training data clusters is defined. This metric is used to hierarchically cluster the training data. The metric can also be used to select the closest training d ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
We present an approach to cluster the training data for automatic speech recognition (ASR). A relativeentropy based distance metric between training data clusters is defined. This metric is used to hierarchically cluster the training data. The metric can also be used to select the closest training data clusters given a small amount of data from the test speaker. The selected clusters are then used to estimate a set of hidden Markov models (HMMs) for recognizing the speech from the test speaker. We present preliminary experimental results of the clustering algorithm and its application to ASR. 1 Introduction While progress in ASR has been encouraging, it has become increasingly clear that ASR systems must perform well in the presence of mismatches between the training and testing environments. ASR systems trained in one environment often perform poorly in a new environment due to mismatches between the training and testing conditions. Common sources of mismatches include different tran...
Maximum-likelihood stochastic-transformation adaptation of hidden Markov models
- IEEE Trans. on Speech Audio Processing
, 1999
"... Abstract—The recognition accuracy in recent large vocabulary automatic speech recognition (ASR) systems is highly related to the existing mismatch between the training and testing sets. For example, dialect differences across the training and testing speakers result to a significant degradation in r ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract—The recognition accuracy in recent large vocabulary automatic speech recognition (ASR) systems is highly related to the existing mismatch between the training and testing sets. For example, dialect differences across the training and testing speakers result to a significant degradation in recognition performance. Some popular adaptation approaches improve the recognition performance of speech recognizers based on hidden Markov models with continuous mixture densities by using linear transformations to adapt the means, and possibly the covariances of the mixture Gaussians. The linear assumption, however, is too restrictive, and in this paper we propose a novel adaptation technique that adapts the means and, optionally, the covariances of the mixture Gaussians by using multiple stochastic transformations. We perform both speaker and dialect adaptation experiments, and we show that our method significantly improves the recognition accuracy and the robustness of our system. The experiments are carried out with SRI’s DECIPHER TM speech recognition system. Index Terms—Speaker adaptation, speech recognition, robust recognition. I.
A study of multilingual speech recognition
- In Proc. European Conf. on Speech Communication and Technology
, 1997
"... This paper describes our work in developing multilingual (Swedish and English) speech recognition systems in the ATIS domain. The acoustic component of the multilingual systems is realized through sharing Gaussian codebooks across Swedish and English allophones. The language model (LM) components ar ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This paper describes our work in developing multilingual (Swedish and English) speech recognition systems in the ATIS domain. The acoustic component of the multilingual systems is realized through sharing Gaussian codebooks across Swedish and English allophones. The language model (LM) components are constructed by training a statistical bigram model, with a common backoff node, on bilingual texts, and by combining two monolingual LMs into a probabilistic finite state grammar. This system uses a single decoder for Swedish and English sentences, and is capable of recognizing sentences with words from both languages. Preliminary experiments show that sharing acoustic models across the two languages has not resulted in improved performance, while sharing a backoff node at the LM component provides flexibility and ease in recognizing bilingual sentences at the expense of a slight increase in word error rate in some cases. As a by-product, the bilingual decoder also achieves good performance on language identification (LID). 1.
Improving And Predicting Performance Of Statistical Language Models In Sparse Domains
, 1998
"... Standard statistical language models, or n-gram models, which represent the probability of word sequences, suffer from sparse-data problems in tasks where large amounts of domain-specific text are not available. This thesis focuses on improving the estimation of domain-dependent n-gram models by usi ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Standard statistical language models, or n-gram models, which represent the probability of word sequences, suffer from sparse-data problems in tasks where large amounts of domain-specific text are not available. This thesis focuses on improving the estimation of domain-dependent n-gram models by using out-of-domain text data. Previous approaches for estimating language models from multi-domain data have not accounted for the characteristic variations of style and content across domains. In contrast, this thesis introduces two approaches that compensate for multi-domain differences, both representing "style" by part-of-speech (POS) sequences and "content" by the particular choice of words. First, data from multiple domains is combined using similarity weighting schemes that discriminate for content and style relevance prior to pooling multi-domain text. Second, n-gram distributions from multiple domains are combined, via a POS-dependent n-gram framework that separately compensate for word and POS usage differences. Two variations are explored: explicitly transforming the out-of-domain distribution before combining with an in-domain model, and vi separately estimating components of the POS-dependent n-gram model using multidomain data. Finally, measures to analyze and predict recognition performance of language models are also investigated, resulting in an algorithm for predicting performance differences associated with localized changes in language models given a recognition system.
Discriminative Mixture Weight Estimation For Large Gaussian Mixture Models
, 1999
"... This paper describes a new approach to acoustic modeling for large vocabulary continuous speech recognition (LVCSR) systems. Each phone is modeled with a large Gaussian mixture model (GMM) whose context-dependent mixture weights are estimated with a sentence-level discriminative training criterion. ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper describes a new approach to acoustic modeling for large vocabulary continuous speech recognition (LVCSR) systems. Each phone is modeled with a large Gaussian mixture model (GMM) whose context-dependent mixture weights are estimated with a sentence-level discriminative training criterion. The estimation problem is casted in a neural network framework, which enables the incorporation of the appropriate constraints on the mixture weight vectors, and allows a straight-forward training procedure, based on steepest descent. Experiments conducted on the Callhome-English and Switchboard databases show a significant improvement of the acoustic model performance, and a somewhat lesser improvement with the combined acoustic and language models. 1. INTRODUCTION Many factors contribute to the relatively high error rates observed in LVCSR systems (e.g. diversity of speaking styles, pronunciation variants, variable degrees of articulation, noises, channel effects). By enlarging the set ...
On-Line Adaptation Of Hidden Markov Models Using Incremental Estimation Algorithms
- IEEE Trans. Speech Audio Processing
, 1999
"... The mismatch that frequently occurs between the training and testing conditions of an automatic speech recognizer can be efficiently reduced by adapting the parameters of the recognizer to the testing conditions. The maximum likelihood adaptation algorithms for continuous -density hidden-Markov-mode ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The mismatch that frequently occurs between the training and testing conditions of an automatic speech recognizer can be efficiently reduced by adapting the parameters of the recognizer to the testing conditions. The maximum likelihood adaptation algorithms for continuous -density hidden-Markov-model (HMM) based speech recognizers are fast, in the sense that a small amount of data is required for adaptation. They are, however, based on reestimating the model parameters using the batch version of the expectation-maximization (EM) algorithm. The multiple iterations required for the EM algorithm to converge make these adaptation schemes computationally expensive and not suitable for on-line applications, since multiple passes through the adaptation data are required. In this paper we show how incremental versions of the EM and the segmental k-means algorithm can be used to improve the convergence of these adaptation methods so that they can be used in on-line applications. 1. INTRODUCTIO...

