Results 1  10
of
53
Maximum Likelihood Linear Transformations for HMMBased Speech Recognition
 Computer Speech and Language
, 1998
"... This paper examines the application of linear transformations for speaker and environmental adaptation in an HMMbased speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias ..."
Abstract

Cited by 408 (56 self)
 Add to MetaCart
This paper examines the application of linear transformations for speaker and environmental adaptation in an HMMbased speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias, strict linear featurespace transformations are inappropriate in this case. Hence, only modelbased linear transforms are considered. The paper compares the two possible forms of modelbased transforms: (i) unconstrained, where any combination of mean and variance transform may be used, and (ii) constrained, which requires the variance transform to have the same form as the mean transform (sometimes referred to as featurespace transforms). Reestimation formulae for all appropriate cases of transform are given. This includes a new and efficient "full" variance transform and the extension of the constrained modelspace transform from the simple diagonal case to the full or blockdiagonal case. The constrained and unconstrained transforms are evaluated in terms of computational cost, recognition time efficiency, and use for speaker adaptive training. The recognition performance of the two modelspace transforms on a large vocabulary speech recognition task using incremental adaptation is investigated. In addition, initial experiments using the constrained modelspace transform for speaker adaptive training are detailed. 1 The author is now at the IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA 1
SemiTied Covariance Matrices For Hidden Markov Models
 IEEE Transactions on Speech and Audio Processing
, 1999
"... There is normally a simple choice made in the form of the covariance matrix to be used with continuousdensity HMMs. Either a diagonal covariance matrix is used, with the underlying assumption that elements of the feature vector are independent, or a full or blockdiagonal matrix is used, where all ..."
Abstract

Cited by 181 (27 self)
 Add to MetaCart
There is normally a simple choice made in the form of the covariance matrix to be used with continuousdensity HMMs. Either a diagonal covariance matrix is used, with the underlying assumption that elements of the feature vector are independent, or a full or blockdiagonal matrix is used, where all or some of the correlations are explicitly modelled. Unfortunately when using full or blockdiagonal covariance matrices there tends to be a dramatic increase in the number of parameters per Gaussian component, limiting the number of components which may be robustly estimated. This paper introduces a new form of covariance matrix which allows a few \full" covariance matrices to be shared over many distributions, whilst each distribution maintains its own \diagonal" covariance matrix. In contrast to other schemes which have hypothesised a similar form, this technique ts within the standard maximumlikelihood criterion used for training HMMs. The new form of covariance matrix is evaluated on a largevocabulary speechrecognition task. In initial experiments the performance of the standard system was achieved using approximately half the number of parameters. Moreover, a 10% reduction in word error rate compared to a standard system can be achieved with less than a 1% increase in the number of parameters and little increase in recognition time. 2 1
Mean and Variance Adaptation within the MLLR Framework
 Computer Speech & Language
, 1996
"... One of the key issues for adaptation algorithms is to modify a large number of parameters with only a small amount of adaptation data. Speaker adaptation techniques try to obtain near speaker dependent (SD) performance with only small amounts of speaker specific data, and are often based on initi ..."
Abstract

Cited by 109 (15 self)
 Add to MetaCart
One of the key issues for adaptation algorithms is to modify a large number of parameters with only a small amount of adaptation data. Speaker adaptation techniques try to obtain near speaker dependent (SD) performance with only small amounts of speaker specific data, and are often based on initial speaker independent (SI) recognition systems. Some of these speaker adaptation techniques may also be applied to the task of adaptation to a new acoustic environment. In this case a SI recognition system trained in, typically, a clean acoustic environment is adapted to operate in a new, noisecorrupted, acoustic environment. This paper examines the Maximum Likelihood Linear Regression (MLLR) adaptation technique. MLLR estimates linear transformations for groups of models parameters to maximise the likelihood of the adaptation data. Previously, MLLR has been applied to the mean parameters in mixture Gaussian HMM systems. In this paper MLLR is extended to also update the Gaussian variances and reestimation formulae are derived for these variance transforms. MLLR with variance compensation is evaluated on several large vocabulary recognition tasks. The use of mean and variance MLLR adaptation was found to give an additional 2% to 7% decrease in word error rate over meanonly MLLR adaptation. 1
Ney: Confidence Measures for Large Vocabulary Continuous Speech Recognition
 IEEE Trans. on Speech and Audio Processing
, 2001
"... Abstract—In this paper, we present several confidence measures for large vocabulary continuous speech recognition. We propose to estimate the confidence of a hypothesized word directly as its posterior probability, given all acoustic observations of the utterance. These probabilities are computed on ..."
Abstract

Cited by 102 (14 self)
 Add to MetaCart
Abstract—In this paper, we present several confidence measures for large vocabulary continuous speech recognition. We propose to estimate the confidence of a hypothesized word directly as its posterior probability, given all acoustic observations of the utterance. These probabilities are computed on word graphs using a forward–backward algorithm. We also study the estimation of posterior probabilities onbest lists instead of word graphs and compare both algorithms in detail. In addition, we compare the posterior probabilities with two alternative confidence measures, i.e., the acoustic stability and the hypothesis density. We present experimental results on five different corpora: the Dutch ARISE 1k evaluation corpus, the German Verbmobil ’98 7k evaluation corpus, the English North American Business ’94 20k and 64k development corpora, and the English Broadcast News ’96 65k evaluation corpus. We show that the posterior probabilities computed on word graphs outperform all other confidence measures. The relative reduction in confidence error rate ranges between 19 % and 35 % compared to the baseline confidence error rate. Index Terms—Confidence measures, forward–backward algorithm,best lists, posterior probabilities, speech recognition, word graphs. I.
Analysis and Synthesis of Intonation using the Tilt Model
 Journal of the Acoustical Society of America
"... This paper introduces the tilt intonational model and describes how this model can be used to automatically analyse and synthesize intonation. In the model, intonation is represented as a linear sequence of events, which can be pitch accents or boundary tones. Each event is characterised by continuo ..."
Abstract

Cited by 93 (3 self)
 Add to MetaCart
This paper introduces the tilt intonational model and describes how this model can be used to automatically analyse and synthesize intonation. In the model, intonation is represented as a linear sequence of events, which can be pitch accents or boundary tones. Each event is characterised by continuous parameters representing amplitude, duration and tilt (a measure of the shape of the event). The paper describes a event detector, in effect an intonational recognition system, which produces a transcription of an utterance's intonation. The features and parameters of the event detector are discussed and performance figures are shown on a variety of read and spontaneous speaker independent conversational speech databases. Given the event locations, algorithms are described which produce an automatic analysis of each event in terms of the Tilt parameters. Synthesis algorithms are also presented which generate F0 contours from Tilt representations. The accuracy of these is shown by comparing...
The Generation And Use Of Regression Class Trees For Mllr Adaptation
, 1996
"... Maximum likelihood linear regression (MLLR) is an adaptation technique suitable for both speaker and environmental modelbased adaptation. The models are adapted using a set of linear transformations, estimated in a maximum likelihood fashion from the available adaptation data. As these transformati ..."
Abstract

Cited by 62 (8 self)
 Add to MetaCart
Maximum likelihood linear regression (MLLR) is an adaptation technique suitable for both speaker and environmental modelbased adaptation. The models are adapted using a set of linear transformations, estimated in a maximum likelihood fashion from the available adaptation data. As these transformations can capture general relationships between the original model set and the current speaker, or new acoustic environment, they can be effective in adapting all the HMM distributions with limited adaptation data. Two important decisions that must be made are (i) how to cluster components together, such that they all have a similar transformation matrix, and (ii) how many transformation matrices to generate for a given block of adaptation data. This paper addresses both problems. Firstly it describes two optimal clustering techniques, in the sense of maximising the likelihood of the adaptation data. The first assigns each component to one of the regression classes. This may be used to generat...
MMIE training of large vocabulary recognition systems
, 1997
"... This paper describes a framework for optimising the structure and parameters of a continuous density HMMbased large Z. vocabulary recognition system using the Maximum Mutual Information Estimation MMIE criterion. To reduce the computational complexity of the MMIE training algorithm, confusable seg ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
This paper describes a framework for optimising the structure and parameters of a continuous density HMMbased large Z. vocabulary recognition system using the Maximum Mutual Information Estimation MMIE criterion. To reduce the computational complexity of the MMIE training algorithm, confusable segments of speech are identified and stored as word lattices of alternative utterance hypotheses. An iterative mixture splitting procedure is also employed to adjust the number of mixture components in each state during training such that the optimal balance between the number of parameters and the available training data is achieved. Experiments are presented on various test sets from the Wall Street Journal database using up to 66 hours of acoustic training data. These demonstrate that the use of lattices makes MMIE training practicable for very complex recognition systems and large training sets. Furthermore, the experimental results show that MMIE optimisation of system structure and param...
Errorresponsive feedback mechanisms for speech recognizers
, 1997
"... This thesis is about modeling, analyzing, and predicting errorful behavior in large vocabulary continuous speech recognition systems. Because today's stateoftheart recognizers are not designed to be situated naturally in an error feedback loop, they are illpositioned for inclusion in multimodal ..."
Abstract

Cited by 47 (4 self)
 Add to MetaCart
This thesis is about modeling, analyzing, and predicting errorful behavior in large vocabulary continuous speech recognition systems. Because today's stateoftheart recognizers are not designed to be situated naturally in an error feedback loop, they are illpositioned for inclusion in multimodal interfaces, multimedia databases, and other interesting applications. I make improvements to the current approach to predicting and analyzing error behaviors, which is currently based only on the measurement ofword error rate. The speech recognizer's functionality is extended to include con dence annotations, which are \metalevel " markings that indicate how certain the recognizer is that it has decoded its input correctly. This is accomplished by feeding externally de ned error conditions back to the recognizer. Error feedback enables the construction of statistical models that map measurements of the recognizer's internal states and behaviors to externally de ned error conditions.
Variance Compensation Within The Mllr Framework For Robust Speech Recognition And Speaker Adaptation
, 1996
"... This paper investigates the use of maximum likelihood linear regression (MLLR) for both speaker and environment adaptation. MLLR transforms the mean and variance parameters of a set of HMMs. In this paper a number of different types of linear transformations of the variances are examined including f ..."
Abstract

Cited by 44 (8 self)
 Add to MetaCart
This paper investigates the use of maximum likelihood linear regression (MLLR) for both speaker and environment adaptation. MLLR transforms the mean and variance parameters of a set of HMMs. In this paper a number of different types of linear transformations of the variances are examined including full, block diagonal, and diagonal transformation matrices. Experiments on large vocabulary speaker independent data sets are described. On all the data sets examined the use of MLLR mean and variance compensation reduced the error rate compared to meanonly compensation. Furthermore, the use of a block diagonal or full transformation of the variances on the clean data task showed slight improvements over the diagonal case. However, when some environmental mismatch was present there was no difference in performance between using multiple diagonal variance transformations and a more complex single variance transform.