• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models (1995)

by C J Leggetter, P C Woodland
Venue:Computer Speech and Language
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 819
Next 10 →

Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition

by M.J.F. Gales - COMPUTER SPEECH AND LANGUAGE , 1998
"... This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias ..."
Abstract - Cited by 570 (68 self) - Add to MetaCart
This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias, strict linear feature-space transformations are inappropriate in this case. Hence, only model-based linear transforms are considered. The paper compares the two possible forms of model-based transforms: (i) unconstrained, where any combination of mean and variance transform may be used, and (ii) constrained, which requires the variance transform to have the same form as the mean transform (sometimes referred to as feature-space transforms). Re-estimation formulae for all appropriate cases of transform are given. This includes a new and efficient "full" variance transform and the extension of the constrained model-space transform from the simple diagonal case to the full or block-diagonal case. The constrained and unconstrained transforms are evaluated in terms of computational cost, recognition time efficiency, and use for speaker adaptive training. The recognition performance of the two model-space transforms on a large vocabulary speech recognition task using incremental adaptation is investigated. In addition, initial experiments using the constrained model-space transform for speaker adaptive training are detailed.
(Show Context)

Citation Context

...on Research Center, Yorktown Heights, NY 10598, USA 1 Introduction In recent years there has been a vast amount of work done on estimating and applying linear transformations to HMM-based recognisers =-=[2, 4, 13, 17]-=-. Though not the only possible model adaptation scheme, for example maximum a-posteriori adaptation [10] may be used, linear transforms have been shown to be a powerful tool for both speaker and envir...

Semi-tied covariance matrices for hidden Markov models,”

by M J F Gales - IEEE Trans. Speech and Audio Processing, , 1999
"... ..."
Abstract - Cited by 262 (35 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

.... Thus, when using equation 22, the inverse of G (ri) may not exist. There are two solutions to this problem, similar to those used to ensure robustness in maximum likelihood linear regression (MLLR) =-=[15]-=-. Thesrst is to use block diagonal transformations, thus dramatically reducing the chance of nonfull rank matrices. Furthermore it decreases both the computational load (it is cheaper to invert three ...

Statistical parametric speech synthesis

by Alan W Black, Heiga Zen, Keiichi Tokuda - in Proc. ICASSP, 2007 , 2007
"... This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This ..."
Abstract - Cited by 179 (18 self) - Add to MetaCart
This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This paper also contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years. Advantages and disadvantages of statistical parametric synthesis are highlighted as well as identifying where we expect the key developments to appear in the immediate future. Index Terms — Speech synthesis, hidden Markov models 1. BACKGROUND With the increase in power and resources of computer technology, building natural sounding synthetic voices has progressed from a

Recent advances in the automatic recognition of audiovisual speech

by Gerasimos Potamianos, Chalapathy Neti, Guillaume Gravier, Ashutosh Garg, Student Member, Andrew W. Senior - Proceedings of the IEEE
"... Abstract — Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual autom ..."
Abstract - Cited by 172 (16 self) - Add to MetaCart
Abstract — Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the later topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audio-visual speaker adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded at both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, however less so for visually challenging environments and large vocabulary tasks. Index Terms — Audio-visual speech recognition, speechreading, visual feature extraction, audio-visual fusion, hidden Markov model, multi-stream HMM, product HMM, reliability estimation, adaptation, audio-visual databases. I.
(Show Context)

Citation Context

... the parameters of an HMM that has been originally trained in a speaker-independent fashion and/or under different conditions. Two popular such methods are maximum likelihood linear regression (MLLR) =-=[111]-=- and maximum-a-posteriori ] ; (MAP) adaptation [110]. Other adaptation techniques proceed to transform the extracted features instead, so that they are better modeled by the available HMMs with no ret...

An overview of text-independent speaker recognition: from features to supervectors

by Tomi Kinnunen, Haizhou Li , 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of ..."
Abstract - Cited by 156 (37 self) - Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.

The Kaldi speech recognition toolkit,” in

by Daniel Povey , Arnab Ghoshal , Gilles Boulianne , Lukáš Burget , Ondřej Glembek , Nagendra Goel , Mirko Hannemann , Petr Motlíček , Yanmin Qian , Petr Schwarz , Jan Silovský , Georg Stemmer , Karel Veselý - Proc. ASRU, , 2011
"... Abstract-We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognitio ..."
Abstract - Cited by 147 (16 self) - Add to MetaCart
Abstract-We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.
(Show Context)

Citation Context

...states, and allows the user to prespecify tying of the p.d.f.’s in different HMM states. D. Speaker adaptation We support both model-space adaptation using maximum likelihood linear regression (MLLR) =-=[8]-=- and feature-space adaptation using feature-space MLLR (fMLLR), also known as constrained MLLR [9]. For both MLLR and fMLLR, multiple transforms can be estimated using a regression tree [10]. When a s...

Mean and Variance Adaptation within the MLLR Framework

by M.J.F. Gales, P.C. Woodland - Computer Speech & Language , 1996
"... One of the key issues for adaptation algorithms is to modify a large number of parameters with only a small amount of adaptation data. Speaker adaptation techniques try to obtain near speaker dependent (SD) performance with only small amounts of speaker specific data, and are often based on initi ..."
Abstract - Cited by 145 (15 self) - Add to MetaCart
One of the key issues for adaptation algorithms is to modify a large number of parameters with only a small amount of adaptation data. Speaker adaptation techniques try to obtain near speaker dependent (SD) performance with only small amounts of speaker specific data, and are often based on initial speaker independent (SI) recognition systems. Some of these speaker adaptation techniques may also be applied to the task of adaptation to a new acoustic environment. In this case a SI recognition system trained in, typically, a clean acoustic environment is adapted to operate in a new, noise-corrupted, acoustic environment. This paper examines the Maximum Likelihood Linear Regression (MLLR) adaptation technique. MLLR estimates linear transformations for groups of models parameters to maximise the likelihood of the adaptation data. Previously, MLLR has been applied to the mean parameters in mixture Gaussian HMM systems. In this paper MLLR is extended to also update the Gaussian variances and re-estimation formulae are derived for these variance transforms. MLLR with variance compensation is evaluated on several large vocabulary recognition tasks. The use of mean and variance MLLR adaptation was found to give an additional 2% to 7% decrease in word error rate over mean-only MLLR adaptation. 1

The LIMSI Broadcast News Transcription System

by Jean-luc Gauvain, Lori Lamel, Gilles Adda - Speech Communication , 2002
"... This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluatio ..."
Abstract - Cited by 131 (12 self) - Add to MetaCart
This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluations. Two main problems needed to be addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers).
(Show Context)

Citation Context

...ta using single Gaussian state models, penalized by the number of tied-states. Unsupervised acoustic model adaptation (both means and variances) is performed for each cluster using the MLLR technique =-=[32]-=- after each decoding pass. The mean vectors are adapted using a single block-diagonal regression matrix (where a block is used for each parameter stream, i.e. cepstrum, delta-cepstrum and delta-delta ...

Audio-visual automatic speech recognition: An overview.

by G Potamianos, C Neti, J Luettin, I Matthews - In Issues in Visual and Audio-Visual Speech Processing. , 2004
"... ..."
Abstract - Cited by 110 (0 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...aptation is traditionally used in practical audio-only ASR systems to improve speaker-independent system performance, when little data from a speaker of interest are available (Gauvain and Lee, 1994; =-=Leggetter and Woodland, 1995-=-; Neumeyer et al., 1995; Anastasakos et al., 1997; Gales, 1999). Adaptation is also of interest across tasks or environments. In the audio-visual ASR domain, adaptation is of great importance, since a...

Video-Based Face Recognition Using Adaptive Hidden Markov Models

by Xiaoming Liu, Tsuhan Chen , 2003
"... While traditional face recognition is typically based on still images, face recognition from video sequences has become popular recently. In this paper, we propose to use adaptive Hidden Markov Models (HMM) to perform videobased face recognition. During the training process, the statistics of traini ..."
Abstract - Cited by 106 (3 self) - Add to MetaCart
While traditional face recognition is typically based on still images, face recognition from video sequences has become popular recently. In this paper, we propose to use adaptive Hidden Markov Models (HMM) to perform videobased face recognition. During the training process, the statistics of training video sequences of each subject, and the temporal dynamics, are learned by an HMM. During the recognition process, the temporal characteristics of the test video sequence are analyzed over time by the HMM corresponding to each subject. The likelihood scores provided by the HMMs are compared, and the highest score provides the identity of the test video sequence. Furthermore, with unsupervised learning, each HMM is adapted with the test video sequence, which results in better modeling over time. Based on extensive experiments with various databases, we show that the proposed algorithm provides better performance than using majority voting of image-based recognition results.
(Show Context)

Citation Context

...imes worse than that of speakerdependent systems. However, the latter requires large amounts of training data from the designated speaker. To address this issue, the concept of speaker adaptation [16]=-=[17]-=- has been introduced, where a small amount of data from the specific speaker are used to modify the speakerindependent system and improve its performance. Similarly, in the vision community, Liu et al...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University