Results 11 - 20
of
32
Connectionist Speaker Normalization And Adaptation
- in Eurospeech
, 1995
"... In a speaker-independent, large-vocabulary continuous speech recognition systems, recognition accuracy varies considerably from speaker to speaker, and performance may be significantly degraded for outlier speakers such as nonnative talkers. In this paper, we explore supervised speaker adaptation an ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In a speaker-independent, large-vocabulary continuous speech recognition systems, recognition accuracy varies considerably from speaker to speaker, and performance may be significantly degraded for outlier speakers such as nonnative talkers. In this paper, we explore supervised speaker adaptation and normalization in the MLP component of a hybrid hidden Markov model/ multilayer perceptron version of SRI's DECIPHER TM speech recognition system. Normalization is implemented through an additional transformation network that preprocesses the cepstral input to the MLP. Adaptation is accomplished through incremental retraining of the MLP weights on adaptation data. Our approach combines both adaptation and normalization in a single, consistent manner, works with limited adaptation data, and is text-independent. We show significant improvement in recognition accuracy. 1. INTRODUCTION In a speaker-independent (SI), large-vocabulary continuous speech recognition system, recognition accuracy ...
Enhancements to Transformation-Based Speaker Adaptation: Principal Component and Inter-Class Maximum Likelihood Linear Regression
, 2000
"... iii Abstract In this thesis we improve speech recognition accuracy by obtaining better estimation of linear transformation functions with a small amount of adaptation data in speaker adaptation. The major contributions of this thesis are the developments of two new adaptation algorithms to improve ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
iii Abstract In this thesis we improve speech recognition accuracy by obtaining better estimation of linear transformation functions with a small amount of adaptation data in speaker adaptation. The major contributions of this thesis are the developments of two new adaptation algorithms to improve maximum likelihood linear regression. The first one is called principal component MLLR (PC-MLLR), and it reduces the variance of the estimate of the MLLR matrix using principal component analysis. The second one is called inter-class MLLR, and it utilizes relationships among different transformation functions to achieve more reliable estimates of MLLR parameters across multiple classes. The main idea of PC-MLLR is that if we estimate the MLLR matrix in the eigendomain, the variances of the components of the estimates are inversely proportional to their eigenvalues. Therefore we can select more reliable components to reduce the variances of the resulting estimates and to improve speech recognition accuracy. PC-MLLR eliminates highly variable components and chooses the principal components corresponding to the largest eigenvalues. If all the component are used, PC-MLLR becomes the same as conventional MLLR. Choosing fewer principal components increases the bias of the estimates which can reduce recognition accuracy. To compensate for this problem, we developed weighted principal component MLLR (WPC-MLLR). Instead of eliminating some of the components, all the components in WPC-MLLR are used after applying weights that minimize the mean square error. The component corresponding to a larger eigenvalue has a larger weight than the component corresponding to a smaller eigenvalue. As more adaptation data become available, the benefits from these methods may become smaller because ...
Automatic Speech Recognition for second language learning: How and why it actually works
- In Proceeding of International Congresses of Phonetic Sciences
, 2003
"... In this paper, we examine various studies and reviews on the usability of Automatic Speech Recognition (ASR) technology as a tool to train pronunciation in the second language (L2). We show that part of the criticism that has been addressed to this technology is not warranted, being rather the resul ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In this paper, we examine various studies and reviews on the usability of Automatic Speech Recognition (ASR) technology as a tool to train pronunciation in the second language (L2). We show that part of the criticism that has been addressed to this technology is not warranted, being rather the result of limited familiarity with ASR technology and with broader Computer Assisted Language Learning (CALL) courseware design matters. In our analysis we also consider actual problems of state-of-the-art ASR technology, with a view to indicating how ASR can be employed to develop courseware that is both pedagogically sound and reliable.
Rapid Speaker Adaptation for Neural Network Speech Recognizers
, 1997
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Thesis Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 2 Speech Recognition with Neural N ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Thesis Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 2 Speech Recognition with Neural Networks : : : : : : : : : : : : : : : : : : 4 2.1 The Speech Recognition Problem : : : : : : : : : : : : : : : : : : : : : : : : 4 2.2 Hybrid Systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 2.2.1 Architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 2.2.2 Evaluation and Training : : : : : : : : : : : : : : : : : : : : : : : : : 8 3 Review of Adaptation Literature : : : : : : : : : : : : : : : : : : : : : : : : 13 3.1 Speaker Adaptation/Normalization : : : : : : : : : : : : : : : : : : : : : : : 13 3.1.1 Speaker Categorization Approaches : : : : : : : : : : : : : : : : : : : 16 3.1.2 Data/Feature Transformation Approaches : : : : : : : : : ...
Noise-Resistant Feature Extraction And Model Training For Robust Speech Recognition
- in Proceedings of the 1996 DARPA CSR Workshop
, 1996
"... In this paper we report on our recent work on noise-robust feature extraction and model training to alleviate the mismatch caused by different microphones and ambient room noise in the context of the 1995 DARPA-sponsored H3 benchmark test, which used the unlimited-vocabulary North American Business ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
In this paper we report on our recent work on noise-robust feature extraction and model training to alleviate the mismatch caused by different microphones and ambient room noise in the context of the 1995 DARPA-sponsored H3 benchmark test, which used the unlimited-vocabulary North American Business News (NABN) database. We present a novel noise-robust feature extraction algorithm that is a combination of our previously developed minimum mean square error (MMSE) log-energy estimation algorithm and the probabilistic optimum filtering (POF) algorithm. We also studied an approach based on training the automatic speech recognition (ASR) system with previously collected noisy speech. While both the above approaches gave significant improvements, it was found that combining them gave the best results. We also report on a new part-of-speech (POS) language model that makes it possible to train robust POS language models that incorporate longer contexts than is possible with word-based language ...
Generating Training Data for Medical Dictations
- In Proceedings of the NAACL
, 2001
"... In automatic speech recognition (ASR) enabled applications for medical dictations, corpora of literal transcriptions of speech are critical for training both speaker independent and speaker adapted acoustic models. Obtaining these transcriptions is both costly and time consuming. Non-literal transcr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In automatic speech recognition (ASR) enabled applications for medical dictations, corpora of literal transcriptions of speech are critical for training both speaker independent and speaker adapted acoustic models. Obtaining these transcriptions is both costly and time consuming. Non-literal transcriptions, on the other hand, are easy to obtain because they are generated in the normal course of a medical transcription operation. This paper presents a method of automatically generating texts that can take the place of literal transcriptions for training acoustic and language models. ATRS is an automatic transcription reconstruction system that can produce near-literal transcriptions with almost no human labor. We will show that (i) adapted acoustic models trained on ATRS data perform as well as or better than adapted acoustic models trained on literal transcriptions (as measured by recognition accuracy) and (ii) language models trained on ATRS data have lower perplexity than language models trained on non-literal data.
Rate-Dependent Acoustic Modeling For Large Vocabulary Conversational Speech Recognition
- In Proceedings NIST Speech Transcription Workshop
, 2000
"... Variations in rate of speech (ROS) produce changes in both spectral features and word pronunciations that affect automatic speech recognition (ASR) systems. To deal with these ROS effects, we propose to use parallel, rate-specific, acoustic models: one for fast speech, the other for slow speech. Rat ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Variations in rate of speech (ROS) produce changes in both spectral features and word pronunciations that affect automatic speech recognition (ASR) systems. To deal with these ROS effects, we propose to use parallel, rate-specific, acoustic models: one for fast speech, the other for slow speech. Rate switching is permitted at word boundaries, to allow modeling within-sentence speech rate variation, which is common in conversational speech. Due to the parallel structure of ratespecific models and the maximum likelihood decoding method, we do not need high-quality ROS estimation before recognition, which is usually hard to achieve. In this paper, we evaluate our approach on a large-vocabulary conversational speech recognition (LVCSR) task over the telephone, with several minimal pair comparisons based on different baseline systems. Experiments show that on a development set for the 2000 Hub-5 evaluation, introducing word-level ROSdependent models results in a 1.9% absolute win over a bas...
Maximum Likelihood Stochastic Transformation Adaptation for Medium and Small Data Sets
, 2001
"... Speaker adaptation is recognized as an essential part of today's large-vocabulary automatic speech recognition systems. A family of techniques that has been extensively applied for limited adaptation data is transformation-based adaptation. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Speaker adaptation is recognized as an essential part of today's large-vocabulary automatic speech recognition systems. A family of techniques that has been extensively applied for limited adaptation data is transformation-based adaptation.
Robust Automatic Speech Recognition With Unreliable Data
, 1999
"... Theoretical and practical issues of some of the problems in robust automatic speech recognition (ASR) and some of the techniques that address them are presented in this report. The problem of the robustness of the ASR in real--life (as opposed to laboratory) conditions is paramount to the widespread ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Theoretical and practical issues of some of the problems in robust automatic speech recognition (ASR) and some of the techniques that address them are presented in this report. The problem of the robustness of the ASR in real--life (as opposed to laboratory) conditions is paramount to the widespread deployment of speech enabled products. The report reviews techniques used so far for robust ASR, ranging from simple spectrum subtraction to various types of model adaptation. A possible connection of robust ASR with the computational auditory scene analysis (CASA), methods for local Signal--to--Noise Ratio (SNR) estimation and classification/scoring with on--line adapted statistical models is discussed. The main focus is on the techniques that would allow for incorporation of CASA and local SNR estimates (used as methods for speech/non--speech separation) into the present prevailing stochastic pattern matching paradigms -- Hidden Markov models (HMM) and artificial neural networks (ANN). Th...
Joint Maximum A Posteriori Estimation of Transformation and Hodden Markov . . .
- IEEE Transactions on Speech and Audio Processing
, 2001
"... Model adaptation techniques can usually be divided into indirect and direct approaches. On one hand, indirect or transformationbased techniques assume that a general transformation shared amongst different acoustic units is applied to clusters of model parameters. Such approaches (e.g. MLLR) are qui ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Model adaptation techniques can usually be divided into indirect and direct approaches. On one hand, indirect or transformationbased techniques assume that a general transformation shared amongst different acoustic units is applied to clusters of model parameters. Such approaches (e.g. MLLR) are quite efficient when the amount of adaptation data is limited, but have poor asymptotic properties as the amount of adaptation data increases. On the other hand, direct adaptation approaches, like maximum a posteriori (MAP) estimation have nice asymptotic properties but provide only a moderate improvement when the amount of adaptation data is small. In this work, we jointly optimize a direct and indirect adaptation to take advantage of both approaches. Contrary to published approaches where direct and indirect adaptation are performed one after the other with a very loose interaction and no joint estimation criterion, we propose to estimate a MLLR-like transformation as well as the HMM mean vectors simultaneously, using a MAP estimation criterion. The optimal interaction between the direct and indirect adaptation associated with the prior knowledge provided by the MAP criterion leads to improvement over MLLR and MAP for all size of adaptation data evaluated.

