Results 1 - 10
of
32
Structural Maximum A Posteriori Linear Regression For Fast HMM Adaptation
, 2000
"... Transformation-based model adaptation techniques like maximum likelihood linear regression (MLLR) rely on an accurate selection of the number of transformations for a given amount of adaptation data. If too many transformations are used, the transformation parameters may be poorly estimated, can ove ..."
Abstract
-
Cited by 54 (5 self)
- Add to MetaCart
Transformation-based model adaptation techniques like maximum likelihood linear regression (MLLR) rely on an accurate selection of the number of transformations for a given amount of adaptation data. If too many transformations are used, the transformation parameters may be poorly estimated, can overfit the adaptation data, and offer poor generalization. On the other hand, if the number of transformations is too small, the adapted models can only provide a moderate improvement over the baseline models. An adaptation approach should therefore be flexible in order to estimate reliably a large number of transformations when the amount of adaptation data is large, and a small number of transformations when only a few adaptation utterances are available. In this work, we show that a significant improvement can be obtained over MLLR with dynamic regression classes, first by replacing the maximum likelihood estimation criterion by a maximum a posteriori criterion, then by introducing a tree-structure for the prior densities of the transformations. The effectiveness of the proposed approach is illustrated on the Spoke3 1993 test set of the WSJ task. Using the same regression classes as MLLR, it is shown that the proposed approach reduces the risk of overfitting and exploit the adaptation data much more efficiently than MLLR, leading to a significant reduction of the word error rate with as little as one adaptation utterance.
Speaker Clustering And Transformation For Speaker Adaptation In Large-Vocabulary Speech Recognition Systems
- IEEE TRANS. SPEECH AND SIGNAL PROCESSING
, 1995
"... A speaker adaptation strategy is described that is based on finding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to re-estimate the system parameters. Further, a li ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
A speaker adaptation strategy is described that is based on finding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to re-estimate the system parameters. Further, a linear transformation is computed for every one of the selected training speakers to better map the training speaker's data to the test speaker's acoustic space. Finally, the system parameters (Gaussian means) are re-estimated specifically for the test speaker using the transformed data from the selected training speakers. Experiments showed that this scheme is capable of reducing the error rate by 10-15% with the use of as little as 3 sentences of adaptation data.
On adaptive decision rules and decision parameter adaptation for automatic speech recognition
- Proc. IEEE
, 2000
"... Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximum-likelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for high-performance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine
Maximum-likelihood stochastic-transformation adaptation of hidden Markov models
- IEEE Trans. on Speech Audio Processing
, 1999
"... Abstract—The recognition accuracy in recent large vocabulary automatic speech recognition (ASR) systems is highly related to the existing mismatch between the training and testing sets. For example, dialect differences across the training and testing speakers result to a significant degradation in r ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract—The recognition accuracy in recent large vocabulary automatic speech recognition (ASR) systems is highly related to the existing mismatch between the training and testing sets. For example, dialect differences across the training and testing speakers result to a significant degradation in recognition performance. Some popular adaptation approaches improve the recognition performance of speech recognizers based on hidden Markov models with continuous mixture densities by using linear transformations to adapt the means, and possibly the covariances of the mixture Gaussians. The linear assumption, however, is too restrictive, and in this paper we propose a novel adaptation technique that adapts the means and, optionally, the covariances of the mixture Gaussians by using multiple stochastic transformations. We perform both speaker and dialect adaptation experiments, and we show that our method significantly improves the recognition accuracy and the robustness of our system. The experiments are carried out with SRI’s DECIPHER TM speech recognition system. Index Terms—Speaker adaptation, speech recognition, robust recognition. I.
The SRI EduSpeak System: Recognition and pronunciation scoring for language learning
- PROC. OF INTEGRATING SPEECH TECHNOLOGY IN LANGUAGE LEARNING
"... The EduSpeak system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. We first report results on the application of adaptation techniques to recognize both native and ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
The EduSpeak system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. We first report results on the application of adaptation techniques to recognize both native and nonnative speech in a speaker-independent manner. We discuss our pronunciation scoring paradigm and show experimental results in the form of correlations between the pronunciation quality estimators included in the toolkit and grades given by human listeners. We review phone-level pronunciation estimation schemes and describe the phone-level mispronunciation detection functionality that we have incorporated in the toolkit. Finally, we mention some of the EduSpeak toolkit system features that facilitate the creation and deployment of computer-assisted language learning (CALL) applications.
Statistical Modelling in Continuous Speech Recognition (CSR)
- IN CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE
, 2001
"... Automatic continuous speech recognition (CSR) is sufficiently ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Automatic continuous speech recognition (CSR) is sufficiently
The Use of Speaker Correlation Information for Automatic Speech Recognition
, 1998
"... This dissertation addresses the independence of observations assumption whichis typically made by today's automatic speech recognition systems. This assumption ignores within-speaker correlations which are known to exist. The assumption clearly damages the recognition ability of standard speaker in ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
This dissertation addresses the independence of observations assumption whichis typically made by today's automatic speech recognition systems. This assumption ignores within-speaker correlations which are known to exist. The assumption clearly damages the recognition ability of standard speaker independent systems, as can seen by the severe drop in performance exhibited by systems between their speaker dependent mode and their speaker independent mode. The typical solution to this problem is to apply speaker adaptation to the models of the speaker independent system. This approach is examined in this thesis with the explicit goal of improving the rapid adaptation capabilities of the system by incorporating within-speaker correlation information into the adaptation process. This is achieved through the creation of an adaptation technique called referencespeaker weighting and in the development of a speaker clustering technique called speaker cluster weighting. However, speaker adaptation is just one way in which the independence assumption can be attacked. This dissertation also introduces a novel speech recognition technique called consistency modeling. This technique utilizes a priori knowledge about the within-speaker correlations which exist between di#erent phonetic events for the purpose of incorporating speaker constraintinto a speech recognition system without explicitly applying speaker adaptation. These new techniques are implemented within a segment-based speech recognition system and evaluation results are reported on the DARPA Resource Management recognition task.
Online Bayesian tree-structured transformation of HMMs with optimal model selection for speaker adaptation
- IEEE Trans. Speech and Audio Proc
, 2001
"... Abstract—This paper presents a new recursive Bayesian learning approach for transformation parameter estimation in speaker adaptation. Our goal is to incrementally transform or adapt a set of hidden Markov model (HMM) parameters for a new speaker and gain large performance improvement from a small a ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract—This paper presents a new recursive Bayesian learning approach for transformation parameter estimation in speaker adaptation. Our goal is to incrementally transform or adapt a set of hidden Markov model (HMM) parameters for a new speaker and gain large performance improvement from a small amount of adaptation data. By constructing a clustering tree of HMM Gaussian mixture components, the linear regression (LR) or affine transformation parameters for HMM Gaussian mixture components are dynamically searched. An online Bayesian learning technique is proposed for recursive maximum a posteriori (MAP) estimation of LR and affine transformation parameters. This technique has the advantages of being able to accommodate flexible forms of transformation functions as well as a priori probability density functions (pdfs). To balance between model complexity and goodness of fit to adaptation data, a dynamic programming algorithm is developed for selecting models using a Bayesian variant of the “minimum description length ” (MDL) principle. Speaker adaptation experiments with a 26-letter English alphabet vocabulary were conducted, and the results confirmed effectiveness of the online learning framework. Index Terms—Affine transformation, Bayesian model selection, hidden Markov models (HMMs), linear regression (LR), model
Improved Modeling and Efficiency for Automatic Transcription of Broadcast News
, 2000
"... Over the last few years, the DARPA-sponsored Hub4 continuous speech recognition evaluations have pushed speech recognition technology for the very interesting and difficult task of automatically transcribing broadcast news. In this paper, we report on our research and progress on this problem. We fo ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Over the last few years, the DARPA-sponsored Hub4 continuous speech recognition evaluations have pushed speech recognition technology for the very interesting and difficult task of automatically transcribing broadcast news. In this paper, we report on our research and progress on this problem. We focus on individual techniques we developed, rather than on descriptions of our evaluation systems. We provide comparative experimental results showing the improvements obtained with the novel approaches we developed. 1 Introduction In recent years there has been increasing interest in developing large-vocabulary continuous speech recognition (LVCSR) systems for speech found in real sources. Broadcast news, in particular, has been the testbed for the DARPA-sponsored Hub4 continuous speech recognition (CSR) evaluations over the last few years, and represents a significant challenge to speech recognition researchers. Many interesting problems are associated with the automatic recognition of b...
Detection of OOV Words Using Generalized Word Models and a Semantic Class Language Model
, 2001
"... This paper describes an approach to detect out-of-vocabulary words in spontaneous speech using a language model built on semantic categories and a new type of generalized word models consisting of a mixture of specific and general acoustic units. We demonstrate the construction of the generalized wo ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper describes an approach to detect out-of-vocabulary words in spontaneous speech using a language model built on semantic categories and a new type of generalized word models consisting of a mixture of specific and general acoustic units. We demonstrate the construction of the generalized word models as replacements for surnames in a German spontaneous travel planning task GSST [1]. We show that the use of our generalized word models improves recognition accuracy in cases where out-of-vocabulary words appear and does not lead to a degradation of the overall recognition accuracy. In our experiments we measured recall and precision rates of OOV-detection which are close to their theoretic optimum. Furthermore, we compared the effect of using cross-word-triphones vs. using context-independent cross-word models. We show that when using generalized word models with cross-word-triphones, the expected number of consequential errors following an OOV word can be reduced significantly by 37%.

