Results 1 - 10
of
27
Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition
- Computer Speech and Language
, 1998
"... This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias ..."
Abstract
-
Cited by 275 (44 self)
- Add to MetaCart
This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias, strict linear feature-space transformations are inappropriate in this case. Hence, only model-based linear transforms are considered. The paper compares the two possible forms of model-based transforms: (i) unconstrained, where any combination of mean and variance transform may be used, and (ii) constrained, which requires the variance transform to have the same form as the mean transform (sometimes referred to as feature-space transforms). Re-estimation formulae for all appropriate cases of transform are given. This includes a new and efficient "full" variance transform and the extension of the constrained model-space transform from the simple diagonal case to the full or block-diagonal case. The constrained and unconstrained transforms are evaluated in terms of computational cost, recognition time efficiency, and use for speaker adaptive training. The recognition performance of the two model-space transforms on a large vocabulary speech recognition task using incremental adaptation is investigated. In addition, initial experiments using the constrained model-space transform for speaker adaptive training are detailed. 1 The author is now at the IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA 1
Mean and Variance Adaptation within the MLLR Framework
- Computer Speech & Language
, 1996
"... One of the key issues for adaptation algorithms is to modify a large number of parameters with only a small amount of adaptation data. Speaker adaptation techniques try to obtain near speaker dependent (SD) performance with only small amounts of speaker specific data, and are often based on initi ..."
Abstract
-
Cited by 80 (15 self)
- Add to MetaCart
One of the key issues for adaptation algorithms is to modify a large number of parameters with only a small amount of adaptation data. Speaker adaptation techniques try to obtain near speaker dependent (SD) performance with only small amounts of speaker specific data, and are often based on initial speaker independent (SI) recognition systems. Some of these speaker adaptation techniques may also be applied to the task of adaptation to a new acoustic environment. In this case a SI recognition system trained in, typically, a clean acoustic environment is adapted to operate in a new, noise-corrupted, acoustic environment. This paper examines the Maximum Likelihood Linear Regression (MLLR) adaptation technique. MLLR estimates linear transformations for groups of models parameters to maximise the likelihood of the adaptation data. Previously, MLLR has been applied to the mean parameters in mixture Gaussian HMM systems. In this paper MLLR is extended to also update the Gaussian variances and re-estimation formulae are derived for these variance transforms. MLLR with variance compensation is evaluated on several large vocabulary recognition tasks. The use of mean and variance MLLR adaptation was found to give an additional 2% to 7% decrease in word error rate over mean-only MLLR adaptation. 1
The Generation And Use Of Regression Class Trees For Mllr Adaptation
, 1996
"... Maximum likelihood linear regression (MLLR) is an adaptation technique suitable for both speaker and environmental model-based adaptation. The models are adapted using a set of linear transformations, estimated in a maximum likelihood fashion from the available adaptation data. As these transformati ..."
Abstract
-
Cited by 51 (8 self)
- Add to MetaCart
Maximum likelihood linear regression (MLLR) is an adaptation technique suitable for both speaker and environmental model-based adaptation. The models are adapted using a set of linear transformations, estimated in a maximum likelihood fashion from the available adaptation data. As these transformations can capture general relationships between the original model set and the current speaker, or new acoustic environment, they can be effective in adapting all the HMM distributions with limited adaptation data. Two important decisions that must be made are (i) how to cluster components together, such that they all have a similar transformation matrix, and (ii) how many transformation matrices to generate for a given block of adaptation data. This paper addresses both problems. Firstly it describes two optimal clustering techniques, in the sense of maximising the likelihood of the adaptation data. The first assigns each component to one of the regression classes. This may be used to generat...
Cluster Adaptive Training Of Hidden Markov Models
- IEEE Transactions on Speech and Audio Processing
, 1999
"... When performing speaker adaptation there are two conicting requirements. First the transform must be powerful enough to represent the speaker. Second the transform must be quickly and easily estimated for any particular speaker. The most popular adaptation schemes have used many parameters to adapt ..."
Abstract
-
Cited by 36 (11 self)
- Add to MetaCart
When performing speaker adaptation there are two conicting requirements. First the transform must be powerful enough to represent the speaker. Second the transform must be quickly and easily estimated for any particular speaker. The most popular adaptation schemes have used many parameters to adapt the models to be representative of an individual speaker. This limits how rapidly the models may be adapted to a new speaker or acoustic environment. This paper examines an adaptation scheme requiring very few parameters, cluster adaptive training (CAT). CAT may be viewed as a simple extension to speaker clustering. Rather than selecting a single cluster as representative of a particular speaker, a linear interpolation of all the cluster means is used as the mean of the particular speaker. This scheme naturally falls into an adaptive training framework. Maximum likelihood estimates of the interpolation weights are given. Furthermore, simple re-estimation formulae for cluster means, represented both explicitly and by sets of transforms of some canonical mean, are given. On a speakerindependent task CAT reduced the word error rate using very little adaptation data. In addition when combined with other adaptation schemes it gave a 5% reduction in word error rate over adapting a speaker-independent model set. 2 1
Linear Gaussian models for speech recognition
- CAMBRIDGE UNIVERSITY
, 2004
"... Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete stat ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete state that generated them is not a good assumption for speech recognition. State space models may be used to address some shortcomings of this assumption. State space models are based on a continuous state vector evolving through time according to a state evo-
Unsupervised discriminative adaptation using discriminative mapping transforms
- IN PROC. ICASSP, LAS VEGAS, NV
, 2008
"... The most commonly used approaches to speaker adaptation are based on linear transforms, as these can be robustly estimated using limited adaptation data. Although significant gains can be obtained using discriminative criteria for training acoustic models, maximum likelihood (ML) estimated transform ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
The most commonly used approaches to speaker adaptation are based on linear transforms, as these can be robustly estimated using limited adaptation data. Although significant gains can be obtained using discriminative criteria for training acoustic models, maximum likelihood (ML) estimated transforms are used for unsupervised adaptation. This is because discriminatively trained transforms are highly sensitive to errors in the adaptation hypothesis. This paper describes a new framework for estimating transforms that are discriminative in nature, but are less sensitive to this hypothesis issue. A discriminative, speaker-independent, mapping transformation is estimated during training. This transform is obtained after a speaker-specific ML-estimated transform has been applied. During recognition an ML speaker-specific transform is found and the speaker-independent discriminative mapping transform then applied. This allows a transform which is discriminative in nature to be indirectly estimated, whilst only requiring an ML speaker-specific transform to be found during recognition. The scheme is evaluated on an English conversational telephone speech task, where it significantly outperforms both standard ML and discriminatively trained transforms.
Use of Speech Recognition in Computer-assisted Language Learning
, 1999
"... inear Model Combination and Model Merging. These algorithms are based on the assumption that the mother-tongue of a non-native speaker is known. The basic idea underlying most ndings of this thesis is that non-native speech can be modeled with a mixture of sounds of a speaker's native language and t ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
inear Model Combination and Model Merging. These algorithms are based on the assumption that the mother-tongue of a non-native speaker is known. The basic idea underlying most ndings of this thesis is that non-native speech can be modeled with a mixture of sounds of a speaker's native language and the target language. The newly developed speaker adaptation algorithms combine the acoustic models of the source and target language of a nonnative speaker. The algorithms only dier with regard to the details how the model sets are combined. A database of non-native English was recorded for the purpose of testing these adaptation algorithms. This database mostly consists of utterances of Japanese and Latin-American Spanish accented English. The recordings were transcribed by trained phoneticians to obtain transcriptions corresponding to the actual phoneme sequence uttered by the student as opposed to canonical transcriptions obtained ii iii from a standard
A Map-Like Weighting Scheme for MLLR Speaker Adaptation
, 1999
"... This paper presents an approach for fast, unsupervised, online MLLR speaker adaptation using two MAP-like weighting schemes, a static and a dynamic one. While for the standard MLLR approach several sentences are necessary before a reliable estimation of the transformations is possible, the weighted ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
This paper presents an approach for fast, unsupervised, online MLLR speaker adaptation using two MAP-like weighting schemes, a static and a dynamic one. While for the standard MLLR approach several sentences are necessary before a reliable estimation of the transformations is possible, the weighted approach shows good results even if adaptation is conducted after only a few short utterances. Experimental results show that using the static approach can improve the word error rate by approx. 27% if adaptation is conducted after every 4 utterances (single words or short phrases). Using the dynamic approach, results can be improved by 28%. The most important advantage of the dynamic weight is that it is rather insensitive with respect to the initial weight whereas for the static approach it is very critical which initial weight to chose. Moreover, useful values for the weights in the static case depend very much on the corpus. If the standard MLLR approach is used, even a drastic increase in sentence error rate can be observed for these small amounts of adaptation data.
Adaptive Training for Large Vocabulary Continuous Speech Recognition
, 2006
"... Summary In recent years, there has been a trend towards training large vocabulary continuous speech recognition (LVCSR) systems on a large amount of found data. Found data is recorded from spontaneous speech without careful control of the recording acoustic conditions, for example, conversational te ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Summary In recent years, there has been a trend towards training large vocabulary continuous speech recognition (LVCSR) systems on a large amount of found data. Found data is recorded from spontaneous speech without careful control of the recording acoustic conditions, for example, conversational telephone speech. Hence, it typically has greater variability in terms of speaker and acoustic conditions than specially collected data. Thus, in addition to the desired speech variability required to discriminate between words, it also includes various non-speech variabil-ities, for example, the change of speakers or acoustic environments. The standard approach to handle this type of data is to train hidden Markov models (HMMs) on the whole data set as if all data comes from a single acoustic condition. This is referred to as multi-style training, for exam-ple speaker-independent training. Effectively, the non-speech variabilities are ignored. Though good performance has been obtained with multi-style systems, these systems account for all variabilities. Improvement may be obtained if the two types of variabilities in the found data are modelled separately. Adaptive training has been proposed for this purpose. In contrast to multi-style training, a set of transforms is used to represent the non-speech variabilities. A canonical
Inter-Class MLLR for Speaker Adaptation
- Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing
, 2000
"... This paper examines the use of interdependencies of parameter classes in transformation-based speaker adaptation algorithms such as maximum likelihood linear regression (MLLR). In transformation-based adaptation, increasing the number of transformation classes can provide more detailed information f ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
This paper examines the use of interdependencies of parameter classes in transformation-based speaker adaptation algorithms such as maximum likelihood linear regression (MLLR). In transformation-based adaptation, increasing the number of transformation classes can provide more detailed information for adaptation, but at the expense of greater estimation error with small amounts of data. In this paper we introduce a new procedure, inter-class MLLR, which utilizes relationship between different classes to achieve both detailed and reliable transformation-based adaptation using limited data. In this method, the inter-class relation is given by a linear regression which is estimated from training data. In experiments using non-native English speakers from the Spoke 3 data in the 1994 DARPA Wall Street Journal evaluation, interclass MLLR provided a relative reduction in word error rates of 11.3 % compared to conventional MLLR. 1.

