Results 1 - 10
of
819
Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition
- COMPUTER SPEECH AND LANGUAGE
, 1998
"... This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias ..."
Abstract
-
Cited by 570 (68 self)
- Add to MetaCart
(Show Context)
This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias, strict linear feature-space transformations are inappropriate in this case. Hence, only model-based linear transforms are considered. The paper compares the two possible forms of model-based transforms: (i) unconstrained, where any combination of mean and variance transform may be used, and (ii) constrained, which requires the variance transform to have the same form as the mean transform (sometimes referred to as feature-space transforms). Re-estimation formulae for all appropriate cases of transform are given. This includes a new and efficient "full" variance transform and the extension of the constrained model-space transform from the simple diagonal case to the full or block-diagonal case. The constrained and unconstrained transforms are evaluated in terms of computational cost, recognition time efficiency, and use for speaker adaptive training. The recognition performance of the two model-space transforms on a large vocabulary speech recognition task using incremental adaptation is investigated. In addition, initial experiments using the constrained model-space transform for speaker adaptive training are detailed.
Semi-tied covariance matrices for hidden Markov models,”
- IEEE Trans. Speech and Audio Processing,
, 1999
"... ..."
(Show Context)
Statistical parametric speech synthesis
- in Proc. ICASSP, 2007
, 2007
"... This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This ..."
Abstract
-
Cited by 179 (18 self)
- Add to MetaCart
This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This paper also contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years. Advantages and disadvantages of statistical parametric synthesis are highlighted as well as identifying where we expect the key developments to appear in the immediate future. Index Terms — Speech synthesis, hidden Markov models 1. BACKGROUND With the increase in power and resources of computer technology, building natural sounding synthetic voices has progressed from a
Recent advances in the automatic recognition of audiovisual speech
- Proceedings of the IEEE
"... Abstract — Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual autom ..."
Abstract
-
Cited by 172 (16 self)
- Add to MetaCart
(Show Context)
Abstract — Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the later topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audio-visual speaker adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded at both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, however less so for visually challenging environments and large vocabulary tasks. Index Terms — Audio-visual speech recognition, speechreading, visual feature extraction, audio-visual fusion, hidden Markov model, multi-stream HMM, product HMM, reliability estimation, adaptation, audio-visual databases. I.
An overview of text-independent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of ..."
Abstract
-
Cited by 156 (37 self)
- Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
The Kaldi speech recognition toolkit,” in
- Proc. ASRU,
, 2011
"... Abstract-We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognitio ..."
Abstract
-
Cited by 147 (16 self)
- Add to MetaCart
(Show Context)
Abstract-We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.
Mean and Variance Adaptation within the MLLR Framework
- Computer Speech & Language
, 1996
"... One of the key issues for adaptation algorithms is to modify a large number of parameters with only a small amount of adaptation data. Speaker adaptation techniques try to obtain near speaker dependent (SD) performance with only small amounts of speaker specific data, and are often based on initi ..."
Abstract
-
Cited by 145 (15 self)
- Add to MetaCart
One of the key issues for adaptation algorithms is to modify a large number of parameters with only a small amount of adaptation data. Speaker adaptation techniques try to obtain near speaker dependent (SD) performance with only small amounts of speaker specific data, and are often based on initial speaker independent (SI) recognition systems. Some of these speaker adaptation techniques may also be applied to the task of adaptation to a new acoustic environment. In this case a SI recognition system trained in, typically, a clean acoustic environment is adapted to operate in a new, noise-corrupted, acoustic environment. This paper examines the Maximum Likelihood Linear Regression (MLLR) adaptation technique. MLLR estimates linear transformations for groups of models parameters to maximise the likelihood of the adaptation data. Previously, MLLR has been applied to the mean parameters in mixture Gaussian HMM systems. In this paper MLLR is extended to also update the Gaussian variances and re-estimation formulae are derived for these variance transforms. MLLR with variance compensation is evaluated on several large vocabulary recognition tasks. The use of mean and variance MLLR adaptation was found to give an additional 2% to 7% decrease in word error rate over mean-only MLLR adaptation. 1
The LIMSI Broadcast News Transcription System
- Speech Communication
, 2002
"... This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluatio ..."
Abstract
-
Cited by 131 (12 self)
- Add to MetaCart
(Show Context)
This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluations. Two main problems needed to be addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers).
Audio-visual automatic speech recognition: An overview.
- In Issues in Visual and Audio-Visual Speech Processing.
, 2004
"... ..."
(Show Context)
Video-Based Face Recognition Using Adaptive Hidden Markov Models
, 2003
"... While traditional face recognition is typically based on still images, face recognition from video sequences has become popular recently. In this paper, we propose to use adaptive Hidden Markov Models (HMM) to perform videobased face recognition. During the training process, the statistics of traini ..."
Abstract
-
Cited by 106 (3 self)
- Add to MetaCart
(Show Context)
While traditional face recognition is typically based on still images, face recognition from video sequences has become popular recently. In this paper, we propose to use adaptive Hidden Markov Models (HMM) to perform videobased face recognition. During the training process, the statistics of training video sequences of each subject, and the temporal dynamics, are learned by an HMM. During the recognition process, the temporal characteristics of the test video sequence are analyzed over time by the HMM corresponding to each subject. The likelihood scores provided by the HMMs are compared, and the highest score provides the identity of the test video sequence. Furthermore, with unsupervised learning, each HMM is adapted with the test video sequence, which results in better modeling over time. Based on extensive experiments with various databases, we show that the proposed algorithm provides better performance than using majority voting of image-based recognition results.