Results 1 - 10
of
12
Environment Adaptation for Robust Speaker Verification
- IN EUROSPEECH’03
, 2003
"... In speaker verification over public telephone networks, utterances can be obtained from different types of handsets. Different handsets may introduce different degrees of distortion to the speech signals. This paper attempts to combine a handset selector with (1) handset-specific transformations and ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
In speaker verification over public telephone networks, utterances can be obtained from different types of handsets. Different handsets may introduce different degrees of distortion to the speech signals. This paper attempts to combine a handset selector with (1) handset-specific transformations and (2) handset-dependent speaker models to reduce the effect caused by the acoustic distortion. Specifically, a number of Gaussian mixture models are independently trained to identify the most likely handset given a test utterance; then during recognition, the speaker model and background model are either transformed by MLLR-based handset-specific transformation or respectively replaced by a handset-dependent speaker model and a handset-dependent background model whose parameters were adapted by reinforced learning to fit the new environment. Experimental results based on 150 speakers of the HTIMIT corpus show that environment adaptation based on both MLLR and reinforced learning outperforms the classical CMS, Hnorm and Tnorm approaches, with MLLR adaptation achieves the best performance.
Stochastic Feature Transformation with Divergence-Based Out-of-Handset Rejection for Robust Speaker Verification
- EURASIP J. on Applied Signal Processing
, 2004
"... The performance of telephone-based speaker verification systems can be severely degraded by linear and non-linear acoustic distortion caused by telephone handsets. This paper proposes to combine a handset selector with stochastic feature transformation to reduce the distortion. Specifically, a GMMba ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
(Show Context)
The performance of telephone-based speaker verification systems can be severely degraded by linear and non-linear acoustic distortion caused by telephone handsets. This paper proposes to combine a handset selector with stochastic feature transformation to reduce the distortion. Specifically, a GMMbased handset selector is trained to identify the most likely handset used by the claimants, and then handset-specific stochastic feature transformations are applied to the distorted feature vectors. This paper also proposes a divergence-based handset selector with out-of-handset (OOH) rejection capability to identify the `unseen' handsets. This is achieved by measuring the Jensen di#erence between the selector's output and a constant vector with identical elements. The resulting handset selector is combined with the proposed feature transformation technique for telephone-based speaker verification. Experimental results based on 150 speakers of the HTIMIT corpus show that the handset selector, either with or without OOH rejection capability, is able to identify the `seen' handsets accurately (98.3% in both cases). Results also demonstrate that feature transformation performs significantly better than the classical cepstral mean normalization approach. Finally, by using the transformation parameters of the `seen' handsets to transform the utterances with correctly identified handsets and processing those utterances with `unseen' handsets by cepstral mean subtraction, verification error rates are reduced significantly (from 12.41% to 6.59% on average).
Robust Speaker Verification From GSM-Transcoded Speech Based On Decision Fusion And Feature Transformation
- in Proc. IEEE ICASSP’03
, 2003
"... In speaker verification, a claimant may produce two or more utterances. Typically, the scores of the speech patterns extracted from these utterances are averaged and the resulting mean score is compared with a decision threshold. Rather than simply computing the mean score, we propose to compute the ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
(Show Context)
In speaker verification, a claimant may produce two or more utterances. Typically, the scores of the speech patterns extracted from these utterances are averaged and the resulting mean score is compared with a decision threshold. Rather than simply computing the mean score, we propose to compute the optimal weights for fusing the scores based on the score distribution of the independent utterances and our prior knowledge about the score statistics. More specifically, we use enrollment data to compute the mean scores of client speakers and impostors and consider them to be the prior scores. During verification, we set the fusion weights for individual speech patterns to be a function of the dispersion between the scores of these speech patterns and the prior scores. Experimental results based on the GSM-transcoded speech of 150 speakers from the HTIMIT corpus demonstrate that the proposed fusion algorithm can increase the dispersion between the mean speaker scores and the mean impostor scores. Compared with a baseline approach where equal weights are assigned to all scores, the proposed approach provides a relative error reduction of 19%.
Applying Articulatory Features To Telephone-Based Speaker Verification
- in Proc. IEEE International Conference on Acoustic, Speech, and Signal Processing
, 2004
"... This paper presents an approach that uses articulatory features (AFs) derived from spectral features for telephone-based speaker verification. To minimize the acoustic mismatch caused by different handsets, handset-specific normalization is applied to the spectral features before the AFs are extrac ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
(Show Context)
This paper presents an approach that uses articulatory features (AFs) derived from spectral features for telephone-based speaker verification. To minimize the acoustic mismatch caused by different handsets, handset-specific normalization is applied to the spectral features before the AFs are extracted. Experimental results based on 150 speakers using 10 different handsets show that AFs contain useful speaker-specific information for speaker verification and the use of handset-specific normalization significantly lowers the error rates under the handset mismatched conditions. Results also demonstrate that fusing the scores obtained from an AF-based system with those obtained from a spectral feature-based (MFCC) system helps lower the error rates of the individual systems.
Adaptive Decision Fusion for Multi-Sample Speaker Verification over GSM Networks
- in Eurospeech’03
, 2003
"... In speaker verification, a claimant may produce two or more utterances. In our previous study [1], we proposed to compute the optimal weights for fusing the scores of these utterances based on their score distribution and our prior knowledge about the score statistics estimated from the mean scores ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
In speaker verification, a claimant may produce two or more utterances. In our previous study [1], we proposed to compute the optimal weights for fusing the scores of these utterances based on their score distribution and our prior knowledge about the score statistics estimated from the mean scores of the corresponding client speaker and some pseudo-impostors during enrollment. As the fusion weights depend on the prior scores, in this paper, we propose to adapt the prior scores during verification based on the likelihood of the claimant being an impostor. To this end, a pseudo-imposter GMM score model is created for each speaker. During verification, the claimant's scores are fed to the score model to obtain a likelihood for adapting the prior score. Experimental results based on the GSM-transcoded speech of 150 speakers from the HTIMIT corpus demonstrate that the proposed prior score adaptation approach provides a relative error reduction of 15% when compared with our previous approach where the prior scores are non-adaptive.
Blind stochastic feature transformation for channel robust speaker verification
- J. OF VLSI SIGNAL PROCESSING
, 2006
"... To improve the reliability of telephone-based speaker verification systems, channel com-pensation is indispensable. However, it is also important to ensure that the channel com-pensation algorithms in these systems surpress channel variations and enhance interspeaker distinction. This paper addresse ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
To improve the reliability of telephone-based speaker verification systems, channel com-pensation is indispensable. However, it is also important to ensure that the channel com-pensation algorithms in these systems surpress channel variations and enhance interspeaker distinction. This paper addresses this problem by a blind feature-based transformation ap-proach in which the transformation parameters are determined online without any a priori knowledge of channel characteristics. Specifically, a composite statistical model formed by the fusion of a speaker model and a background model is used to represent the characteristics of enrollment speech. Based on the difference between the claimant’s speech and the com-posite model, a stochastic matching type of approach is proposed to transform the claimant’s speech to a region close to the enrollment speech. Therefore, the algorithm can estimate the transformation online without the necessity of detecting the handset types. Experimental results based on the 2001 NIST evaluation set show that the proposed transformation ap-proach achieves significant improvement in both equal error rate and minimum detection cost as compared to cepstral mean subtraction, Znorm, and short-time Gaussianization.
Cluster-Dependent Feature Transformation for Telephone-Based Speaker Verification
- In: Proc. International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA’03
, 2003
"... This paper presents a cluster-based feature transformation technique for telephone-based speaker verification when labels of the handset types are not available during the training phase. The technique combines a cluster selector with cluster-dependent feature transformations to reduce the acoustic ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
This paper presents a cluster-based feature transformation technique for telephone-based speaker verification when labels of the handset types are not available during the training phase. The technique combines a cluster selector with cluster-dependent feature transformations to reduce the acoustic mismatches among different handsets. Specifically, a...
A New Approach to Channel Robust Speaker Verification via Constrained Stochastic Feature Transformation
- in Proc. ICSLP’04
"... This paper proposes a constrained stochastic feature transformation algorithm for robust speaker verification. The algorithm computes the feature transformation parameters based on the statistical difference between a test utterance and a composite GMM formed by combining the speaker and background ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
This paper proposes a constrained stochastic feature transformation algorithm for robust speaker verification. The algorithm computes the feature transformation parameters based on the statistical difference between a test utterance and a composite GMM formed by combining the speaker and background models. The transformation is then used to transform the test utterance to fit the clean speaker model and background model before verification. By implicitly constraining the transformation, the transformed features can fit both models simultaneously. Experimental results based on the 2001 NIST evaluation set show that the proposed algorithms achieves significant improvement in both equal error rate and minimum detection cost when compared to cepstral mean subtraction and Z-norm. The performance of the proposed transformation approach is also slightly better than the short-time Gaussianization method proposed in [1].
Extraction of Speaker Features from Different Stages of DSR Front-ends for Distributed Speaker Verification
, 2004
"... The ETSI has recently published a front-end processing standard for distributed speech recognition systems. The key idea of the standard is to extract the spectral features of speech signals at the front-end terminals so that acoustic distortion caused by communication channels can be avoided. Th ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The ETSI has recently published a front-end processing standard for distributed speech recognition systems. The key idea of the standard is to extract the spectral features of speech signals at the front-end terminals so that acoustic distortion caused by communication channels can be avoided. This paper investigates the e#ect of extracting spectral features from di#erent stages of the front-end processing on the performance of distributed speaker verification systems. A technique that combines handset selectors with stochastic feature transformation is also employed in a back-end speaker verification system to reduce the acoustic mismatch between di#erent handsets. Because the feature vectors obtained from the back-end server are vector quantized, the paper proposes two approaches to adding Gaussian noise to the quantized feature vectors for training the Gaussian mixture speaker models. In one approach, the variances of the Gaussian noise are made dependent on the codeword distance. In another approach, the variances are a function of the distance between some unquantized training vectors and their closest code vector. The HTIMIT corpus was # Correspondence should be sent to M.W. Mak, Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong. Email: enmwmak@polyu.edu.hk. Tel: (852)27666257. Fax: (852)23628439.
Probabilistic Fusion of Sorted Score Sequences for Robust Speaker Verification
"... Abstract. Fusion techniques have been widely used in multi-modal biometric authentication systems. While these techniques are mainly applied to combine the outputs of modality-dependent classifiers, they can also be applied to fuse the decisions or scores from a single modality. The idea is to consi ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Fusion techniques have been widely used in multi-modal biometric authentication systems. While these techniques are mainly applied to combine the outputs of modality-dependent classifiers, they can also be applied to fuse the decisions or scores from a single modality. The idea is to consider the multiple samples extracted from a single modality as independent but coming from the same source. In this chapter, we propose a single-source, multi-sample data-dependent fusion algorithm for speaker verification. The algorithm is data-dependent in that the fusion weights are dependent on the verification scores and the prior score statistics of claimed speakers and background speakers. To obtain the best out of the speaker’s scores, scores from multiple utterances are sorted before they are probabilistically combined. Evaluations based on 150 speakers from a GSM-transcoded corpus are presented. Results show that data-dependent fusion of speaker’s scores is significantly better than the conventional score averaging approach. It was also found that the proposed fusion algorithm can be further enhanced by sorting the score sequences before they are probabilistically combined.