Results 1 - 10
of
31
Speaker verification using Adapted Gaussian mixture models
- Digital Signal Processing
, 2000
"... In this paper we describe the major elements of MIT Lincoln Laboratory’s Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but ef ..."
Abstract
-
Cited by 385 (15 self)
- Add to MetaCart
In this paper we describe the major elements of MIT Lincoln Laboratory’s Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. The development and use of a handset detector and score normalization to greatly improve verification performance is also described and discussed. Finally, representative performance benchmarks and system behavior experiments on NIST SRE corpora are presented. © 2000 Academic Press Key Words: speaker recognition; Gaussian mixture models; likelihood ratio detector; universal background model; handset normalization; NIST evaluation. 1.
Divergence-Based Out-Of-Class Rejection For Telephone Handset
- in Proc. ICSLP’02, 2002
, 2002
"... Research has shown that handset selectors can be used to assist telephone-based speech/speaker recognition. Most handset selectors, however, simply select the most likely handset from a set of known handsets even for speech coming from an `unseen' handset. This paper proposes a divergence-based hand ..."
Abstract
-
Cited by 12 (12 self)
- Add to MetaCart
Research has shown that handset selectors can be used to assist telephone-based speech/speaker recognition. Most handset selectors, however, simply select the most likely handset from a set of known handsets even for speech coming from an `unseen' handset. This paper proposes a divergence-based handset selector with out-of-handset (OOH) rejection capability to identify the `unseen' handsets. This is achieved by measuring the Jensen difference between the selector's output and a constant vector with identical elements. The resulting handset selector is combined with a feature-based channel compensation algorithm for telephonebased speaker verification. Utterances whose handsets were identified as `unseen' are either transformed by a global bias vector or normalized by cepstral mean subtraction (CMS). On the other hand, if the handset can be identified (considered as `seen'), its corresponding transformation parameters will be used to transform the utterances. Experiments based on ten handsets of the HTIMIT corpus show that using the transformation parameters of the `seen' handsets to transform the utterances with correctly identified handsets and processing those utterances with `unseen' handsets by CMS achieve the best result.
Robust Speaker Verification From GSM-Transcoded Speech Based On Decision Fusion And Feature Transformation
- in Proc. IEEE ICASSP’03
, 2003
"... In speaker verification, a claimant may produce two or more utterances. Typically, the scores of the speech patterns extracted from these utterances are averaged and the resulting mean score is compared with a decision threshold. Rather than simply computing the mean score, we propose to compute the ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
In speaker verification, a claimant may produce two or more utterances. Typically, the scores of the speech patterns extracted from these utterances are averaged and the resulting mean score is compared with a decision threshold. Rather than simply computing the mean score, we propose to compute the optimal weights for fusing the scores based on the score distribution of the independent utterances and our prior knowledge about the score statistics. More specifically, we use enrollment data to compute the mean scores of client speakers and impostors and consider them to be the prior scores. During verification, we set the fusion weights for individual speech patterns to be a function of the dispersion between the scores of these speech patterns and the prior scores. Experimental results based on the GSM-transcoded speech of 150 speakers from the HTIMIT corpus demonstrate that the proposed fusion algorithm can increase the dispersion between the mean speaker scores and the mean impostor scores. Compared with a baseline approach where equal weights are assigned to all scores, the proposed approach provides a relative error reduction of 19%.
Stochastic Feature Transformation with Divergence-Based Out-of-Handset Rejection for Robust Speaker Verification
- EURASIP J. on Applied Signal Processing
, 2004
"... The performance of telephone-based speaker verification systems can be severely degraded by linear and non-linear acoustic distortion caused by telephone handsets. This paper proposes to combine a handset selector with stochastic feature transformation to reduce the distortion. Specifically, a GMMba ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
The performance of telephone-based speaker verification systems can be severely degraded by linear and non-linear acoustic distortion caused by telephone handsets. This paper proposes to combine a handset selector with stochastic feature transformation to reduce the distortion. Specifically, a GMMbased handset selector is trained to identify the most likely handset used by the claimants, and then handset-specific stochastic feature transformations are applied to the distorted feature vectors. This paper also proposes a divergence-based handset selector with out-of-handset (OOH) rejection capability to identify the `unseen' handsets. This is achieved by measuring the Jensen di#erence between the selector's output and a constant vector with identical elements. The resulting handset selector is combined with the proposed feature transformation technique for telephone-based speaker verification. Experimental results based on 150 speakers of the HTIMIT corpus show that the handset selector, either with or without OOH rejection capability, is able to identify the `seen' handsets accurately (98.3% in both cases). Results also demonstrate that feature transformation performs significantly better than the classical cepstral mean normalization approach. Finally, by using the transformation parameters of the `seen' handsets to transform the utterances with correctly identified handsets and processing those utterances with `unseen' handsets by cepstral mean subtraction, verification error rates are reduced significantly (from 12.41% to 6.59% on average).
A New Speaker Change Detection Method For Two-Speaker Segmentation
- Proc. of IEEE ICASSP, Volume 4, IV-3908
"... In absence of prior information about speakers, an important step in speaker segmentation is to obtain initial estimates for training speaker models. In this paper, we present a new method for obtaining these estimates. The method assumes that a conversation must be initiated by one of the speakers. ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
In absence of prior information about speakers, an important step in speaker segmentation is to obtain initial estimates for training speaker models. In this paper, we present a new method for obtaining these estimates. The method assumes that a conversation must be initiated by one of the speakers. Thus one speaker model is estimated from the small segment at the beginning of the conversation and the segment that has the largest distance from the initial segment is used to train second speaker model. We describe a system based on this method and evaluate it on two different tasks: a controlled task with variations in the duration of the initial speaker segment and amount of overlapped speech and 2001 NIST Speaker Recognition Evaluation task that contains natural conversations. This system shows significant improvements over the conventional system in absence of overlapped speech on the controlled task.
Environment Adaptation for Robust Speaker Verification
- IN EUROSPEECH’03
, 2003
"... In speaker verification over public telephone networks, utterances can be obtained from different types of handsets. Different handsets may introduce different degrees of distortion to the speech signals. This paper attempts to combine a handset selector with (1) handset-specific transformations and ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
In speaker verification over public telephone networks, utterances can be obtained from different types of handsets. Different handsets may introduce different degrees of distortion to the speech signals. This paper attempts to combine a handset selector with (1) handset-specific transformations and (2) handset-dependent speaker models to reduce the effect caused by the acoustic distortion. Specifically, a number of Gaussian mixture models are independently trained to identify the most likely handset given a test utterance; then during recognition, the speaker model and background model are either transformed by MLLR-based handset-specific transformation or respectively replaced by a handset-dependent speaker model and a handset-dependent background model whose parameters were adapted by reinforced learning to fit the new environment. Experimental results based on 150 speakers of the HTIMIT corpus show that environment adaptation based on both MLLR and reinforced learning outperforms the classical CMS, Hnorm and Tnorm approaches, with MLLR adaptation achieves the best performance.
Speaker verification via high-level feature based phonetic-class pronunciation modeling
- IEEE Trans. on Computers
"... Abstract — It has been shown recently that the pronunciation characteristics of speakers can be represented by articulatory feature-based conditional pronunciation models (AFCPMs). However, the pronunciation models are phoneme-dependent, which may lead to speaker models with low discriminative power ..."
Abstract
-
Cited by 8 (8 self)
- Add to MetaCart
Abstract — It has been shown recently that the pronunciation characteristics of speakers can be represented by articulatory feature-based conditional pronunciation models (AFCPMs). However, the pronunciation models are phoneme-dependent, which may lead to speaker models with low discriminative power when the amount of enrollment data is limited. This paper proposes to mitigate this problem by grouping similar phonemes into phonetic classes and representing background and speaker models as phonetic-class dependent density functions. Phonemes are grouped by (1) vector quantizing the discrete densities in the phoneme-dependent universal background models, (2) using the phone properties specified in the classical phoneme tree, or (3) combining vector quantization and phone properties. Evaluations based on 2000 NIST SRE show that this phonetic-class approach effectively alleviates the data spareness problem encountered in conventional AFCPM, which results in better performance when fused with acoustic features. Index Terms — Speaker verification, pronunciation modeling, articulatory features, phonetic classes, NIST speaker recognition evaluation. I.
Combining Stochastic Feature Transformation And Handset Identification For Telephone-Based Speaker Verification
- In: Proc. ICASSP’02
, 2002
"... The performance of telephone-based speaker verification systems can be severely degraded by the acoustic mismatch caused by telephone handsets. This paper proposes to combine a handset selector with stochastic feature transformation to reduce the mismatch. Specifically, a GMM-based handset selector ..."
Abstract
-
Cited by 7 (7 self)
- Add to MetaCart
The performance of telephone-based speaker verification systems can be severely degraded by the acoustic mismatch caused by telephone handsets. This paper proposes to combine a handset selector with stochastic feature transformation to reduce the mismatch. Specifically, a GMM-based handset selector is trained to identify the most likely handset used by the claimants, and then handset-specific stochastic feature transformations are applied to the distorted feature vectors. To overcome the non-linear distortion introduced by telephone handsets, a 2nd-order stochastic feature transformation is proposed. Estimation algorithms based on the stochastic matching technique and the EM algorithm are derived. Experimental results based on 150 speakers of the HTIMIT corpus show that the handset selector is able to identify the handsets accurately (98.3%), and that both linear and non-linear transformation reduce the error rate significantly (from 12.37% to 5.49%). 1.
Speaker Verification from Coded Telephone Speech Using Stochastic Feature Transformation and Handset Identification
- in The 3rd IEEE Pacific-Rim Conference on Multimedia 2002, 2002
, 2002
"... A handset compensation technique for speaker verification from coded telephone speech is proposed. The proposed technique combines handset selectors with stochastic feature transformation to reduce the acoustic mismatch between di#erent handsets and di#erent speech coders. Coder-dependent GMM-ba ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
A handset compensation technique for speaker verification from coded telephone speech is proposed. The proposed technique combines handset selectors with stochastic feature transformation to reduce the acoustic mismatch between di#erent handsets and di#erent speech coders. Coder-dependent GMM-based handset selectors are trained to identify the most likely handset used by the claimants. Stochastic feature transformations are then applied to remove the acoustic distortion introduced by the coder and the handset. Experimental results show that the proposed technique outperforms the CMS approach and significantly reduces the error rates under six di#erent coders with bit rates ranging from 2.4 kb/s to 64 kb/s. Strong correlation between speech quality and verification performance is also observed.
Articulatory Feature-Based Conditional Pronunciation Modeling for Speaker
- Speech Communication
, 2006
"... Because of the differences in education background, accents, etc., different persons have their unique way of pronunciation. This paper exploits the pronunciation characteristics of speakers and proposes a new conditional pronunciation modeling (CPM) technique for speaker verification. The proposed ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Because of the differences in education background, accents, etc., different persons have their unique way of pronunciation. This paper exploits the pronunciation characteristics of speakers and proposes a new conditional pronunciation modeling (CPM) technique for speaker verification. The proposed technique aims to establish a link between articulatory properties (e.g., manners and places of articulation) and phoneme sequences produced by a speaker. This is achieved by aligning two articulatory feature (AF) streams with a phoneme sequence determined by a phoneme recognizer, and formulating the probabilities of articulatory classes conditioned on the phonemes as speaker-dependent probabilistic models. The scores obtained from the AF-based pronunciation models are then fused with those obtained from a spectral-based speaker verification system, with the frame-by-frame fused scores weighted by the confidence of the pronunciation models. Evaluations based on the SPIDRE corpus demonstrate that AF-based CPM systems can recognize speakers even with short utterances and are readily combined with spectral-based systems to further enhance the reliability of speaker verification.

