Results 1 - 10
of
27
Speaker recognition with session variability normalization based on MLLR adaptation transforms
- IEEE TRANS. AUDIO
, 2007
"... We present a new modeling approach for speaker recognition that uses the maximum-likelihood linear regression (MLLR) adaptation transforms employed by a speech recognition system as features for support vector machine (SVM) speaker models. This approach is attractive because, unlike standard frame- ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
We present a new modeling approach for speaker recognition that uses the maximum-likelihood linear regression (MLLR) adaptation transforms employed by a speech recognition system as features for support vector machine (SVM) speaker models. This approach is attractive because, unlike standard frame-based cepstral speaker recognition models, it normalizes for the choice of spoken words in text-independent speaker verification without data fragmentation. We discuss the basics of the MLLR-SVM approach, and show how it can be enhanced by combining transforms relative to multiple reference models, with excellent results on recent English NIST evaluation sets. We then show how the approach can be applied even if no full word-level recognition system is available, which allows its use on non-English data even without matching speech recognizers. Finally, we examine how two recently proposed algorithms for intersession variability compensation perform in conjunction with MLLR-SVM.
Automatic Scoring of Pronunciation Quality
- Speech Communication
, 1999
"... We present a paradigm for the automatic assessment of pronunciation quality by machine. ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
We present a paradigm for the automatic assessment of pronunciation quality by machine.
Training Data Clustering For Improved Speech Recognition
- in Proceedings of EUROSPEECH
, 1995
"... We present an approach to cluster the training data for automatic speech recognition (ASR). A relativeentropy based distance metric between training data clusters is defined. This metric is used to hierarchically cluster the training data. The metric can also be used to select the closest training d ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
We present an approach to cluster the training data for automatic speech recognition (ASR). A relativeentropy based distance metric between training data clusters is defined. This metric is used to hierarchically cluster the training data. The metric can also be used to select the closest training data clusters given a small amount of data from the test speaker. The selected clusters are then used to estimate a set of hidden Markov models (HMMs) for recognizing the speech from the test speaker. We present preliminary experimental results of the clustering algorithm and its application to ASR. 1 Introduction While progress in ASR has been encouraging, it has become increasingly clear that ASR systems must perform well in the presence of mismatches between the training and testing environments. ASR systems trained in one environment often perform poorly in a new environment due to mismatches between the training and testing conditions. Common sources of mismatches include different tran...
Robust Text-Independent Speaker Identification over Telephone Channels
- IEEE Trans. on Speech and Audio Processing
, 1997
"... This paper addresses the issue of closed-set text-independent speaker identification from samples of speech recorded over the telephone. It focuses on the effects of acoustic mismatches between training and testing data, and concentrates on two approaches: extracting features that are robust against ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper addresses the issue of closed-set text-independent speaker identification from samples of speech recorded over the telephone. It focuses on the effects of acoustic mismatches between training and testing data, and concentrates on two approaches: extracting features that are robust against channel variations, and transforming the speaker models to compensate for channel effects. First, an experimental study shows that optimizing the front end processing of the speech signal can significantly improve speaker recognition performance. A new filterbank design is introduced to improve the robustness of the speech spectrum computation in the front-end unit. Next, a new feature based on spectral slopes is described. Its ability to discriminate between speakers is shown to be superior to that of the traditional cepstrum. This feature can be used alone or combined with the cepstrum. The second part of the paper presents two model transformation methods that further reduce channel effe...
Model Transformation For Robust Speaker Recognition From Telephone Data
- in ICASSP-97
, 1997
"... In the context of automatic speaker recognition, we propose a model transformation technique that renders speaker models more robust to acoustic mismatches and to data scarcity by appropriately increasing their variances. We use a stereo database containing speech recorded simultaneously under diffe ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
In the context of automatic speaker recognition, we propose a model transformation technique that renders speaker models more robust to acoustic mismatches and to data scarcity by appropriately increasing their variances. We use a stereo database containing speech recorded simultaneously under different acoustic conditions to derive a synthetic variance distribution. This distribution is then used to modify the variances of other speaker models from other telephone databases. The technique is illustrated with experiments conducted on a locally collected database and on the NIST'95 and '96 subsets of the Switchboard Corpus. 1. INTRODUCTION Many applications of speaker identification systems (speaker-ID for short) assume that the users access the system remotely. Typically, the channel involved in the communication is that of the telephone. Because the handset and the line can vary from call to call, there is often an acoustic mismatch between the data collected to train the speaker mo...
The development of SRI’s 1997 Broadcast News transcription system
- In Proceedings DARPA BroadcastNews Transcription and Understanding Workshop
"... This paper describes SRI’s 1997 broadcastnews transcription system used for the 1997 DARPA H4 evaluations. Our system had several novel components. These include automatic segmentation of entire broadcast shows, word-internal and crossword acoustic models robustly estimated with a new Gaussian Mergi ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
This paper describes SRI’s 1997 broadcastnews transcription system used for the 1997 DARPA H4 evaluations. Our system had several novel components. These include automatic segmentation of entire broadcast shows, word-internal and crossword acoustic models robustly estimated with a new Gaussian Merging-Splitting (GMS) algorithm, the use of trigram language models (LMs) in lattices instead of for rescoring N-best lists, and an LM pruning algorithm that allows efficient representation of high-order (like 4- or 5-gram) LMs. We briefly describe these features and give comparative experimental results. We achieved a 18.7 % relative improvement in performance on our 1996 H4 partitioned evaluation (PE) development test set as compared to our 1996 H4 PE evaluation system. 1.
Stochastic Feature Transformation with Divergence-Based Out-of-Handset Rejection for Robust Speaker Verification
- EURASIP J. on Applied Signal Processing
, 2004
"... The performance of telephone-based speaker verification systems can be severely degraded by linear and non-linear acoustic distortion caused by telephone handsets. This paper proposes to combine a handset selector with stochastic feature transformation to reduce the distortion. Specifically, a GMMba ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
The performance of telephone-based speaker verification systems can be severely degraded by linear and non-linear acoustic distortion caused by telephone handsets. This paper proposes to combine a handset selector with stochastic feature transformation to reduce the distortion. Specifically, a GMMbased handset selector is trained to identify the most likely handset used by the claimants, and then handset-specific stochastic feature transformations are applied to the distorted feature vectors. This paper also proposes a divergence-based handset selector with out-of-handset (OOH) rejection capability to identify the `unseen' handsets. This is achieved by measuring the Jensen di#erence between the selector's output and a constant vector with identical elements. The resulting handset selector is combined with the proposed feature transformation technique for telephone-based speaker verification. Experimental results based on 150 speakers of the HTIMIT corpus show that the handset selector, either with or without OOH rejection capability, is able to identify the `seen' handsets accurately (98.3% in both cases). Results also demonstrate that feature transformation performs significantly better than the classical cepstral mean normalization approach. Finally, by using the transformation parameters of the `seen' handsets to transform the utterances with correctly identified handsets and processing those utterances with `unseen' handsets by cepstral mean subtraction, verification error rates are reduced significantly (from 12.41% to 6.59% on average).
Maximum-likelihood stochastic-transformation adaptation of hidden Markov models
- IEEE Trans. on Speech Audio Processing
, 1999
"... Abstract—The recognition accuracy in recent large vocabulary automatic speech recognition (ASR) systems is highly related to the existing mismatch between the training and testing sets. For example, dialect differences across the training and testing speakers result to a significant degradation in r ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract—The recognition accuracy in recent large vocabulary automatic speech recognition (ASR) systems is highly related to the existing mismatch between the training and testing sets. For example, dialect differences across the training and testing speakers result to a significant degradation in recognition performance. Some popular adaptation approaches improve the recognition performance of speech recognizers based on hidden Markov models with continuous mixture densities by using linear transformations to adapt the means, and possibly the covariances of the mixture Gaussians. The linear assumption, however, is too restrictive, and in this paper we propose a novel adaptation technique that adapts the means and, optionally, the covariances of the mixture Gaussians by using multiple stochastic transformations. We perform both speaker and dialect adaptation experiments, and we show that our method significantly improves the recognition accuracy and the robustness of our system. The experiments are carried out with SRI’s DECIPHER TM speech recognition system. Index Terms—Speaker adaptation, speech recognition, robust recognition. I.
The Use of Speaker Correlation Information for Automatic Speech Recognition
, 1998
"... This dissertation addresses the independence of observations assumption whichis typically made by today's automatic speech recognition systems. This assumption ignores within-speaker correlations which are known to exist. The assumption clearly damages the recognition ability of standard speaker in ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
This dissertation addresses the independence of observations assumption whichis typically made by today's automatic speech recognition systems. This assumption ignores within-speaker correlations which are known to exist. The assumption clearly damages the recognition ability of standard speaker independent systems, as can seen by the severe drop in performance exhibited by systems between their speaker dependent mode and their speaker independent mode. The typical solution to this problem is to apply speaker adaptation to the models of the speaker independent system. This approach is examined in this thesis with the explicit goal of improving the rapid adaptation capabilities of the system by incorporating within-speaker correlation information into the adaptation process. This is achieved through the creation of an adaptation technique called referencespeaker weighting and in the development of a speaker clustering technique called speaker cluster weighting. However, speaker adaptation is just one way in which the independence assumption can be attacked. This dissertation also introduces a novel speech recognition technique called consistency modeling. This technique utilizes a priori knowledge about the within-speaker correlations which exist between di#erent phonetic events for the purpose of incorporating speaker constraintinto a speech recognition system without explicitly applying speaker adaptation. These new techniques are implemented within a segment-based speech recognition system and evaluation results are reported on the DARPA Resource Management recognition task.
On-Line Adaptation Of Hidden Markov Models Using Incremental Estimation Algorithms
- IEEE Trans. Speech Audio Processing
, 1999
"... The mismatch that frequently occurs between the training and testing conditions of an automatic speech recognizer can be efficiently reduced by adapting the parameters of the recognizer to the testing conditions. The maximum likelihood adaptation algorithms for continuous -density hidden-Markov-mode ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The mismatch that frequently occurs between the training and testing conditions of an automatic speech recognizer can be efficiently reduced by adapting the parameters of the recognizer to the testing conditions. The maximum likelihood adaptation algorithms for continuous -density hidden-Markov-model (HMM) based speech recognizers are fast, in the sense that a small amount of data is required for adaptation. They are, however, based on reestimating the model parameters using the batch version of the expectation-maximization (EM) algorithm. The multiple iterations required for the EM algorithm to converge make these adaptation schemes computationally expensive and not suitable for on-line applications, since multiple passes through the adaptation data are required. In this paper we show how incremental versions of the EM and the segmental k-means algorithm can be used to improve the convergence of these adaptation methods so that they can be used in on-line applications. 1. INTRODUCTIO...

