Results 1 - 10
of
11
An overview of text-independent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of ..."
Abstract
-
Cited by 156 (37 self)
- Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
Blind stochastic feature transformation for channel robust speaker verification
- J. OF VLSI SIGNAL PROCESSING
, 2006
"... To improve the reliability of telephone-based speaker verification systems, channel com-pensation is indispensable. However, it is also important to ensure that the channel com-pensation algorithms in these systems surpress channel variations and enhance interspeaker distinction. This paper addresse ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
To improve the reliability of telephone-based speaker verification systems, channel com-pensation is indispensable. However, it is also important to ensure that the channel com-pensation algorithms in these systems surpress channel variations and enhance interspeaker distinction. This paper addresses this problem by a blind feature-based transformation ap-proach in which the transformation parameters are determined online without any a priori knowledge of channel characteristics. Specifically, a composite statistical model formed by the fusion of a speaker model and a background model is used to represent the characteristics of enrollment speech. Based on the difference between the claimant’s speech and the com-posite model, a stochastic matching type of approach is proposed to transform the claimant’s speech to a region close to the enrollment speech. Therefore, the algorithm can estimate the transformation online without the necessity of detecting the handset types. Experimental results based on the 2001 NIST evaluation set show that the proposed transformation ap-proach achieves significant improvement in both equal error rate and minimum detection cost as compared to cepstral mean subtraction, Znorm, and short-time Gaussianization.
Unseen Handset Mismatch Compensation Based on A Priori Knowledge Interpolation for Robust Speaker Recognition
"... Unseen but mismatch handset is the major source of performance degradation for speaker recognition in telecommunication environment. In this paper, an unseen handset characteristics estimation method based on a priori knowledge interpolation (AKI) is proposed. AKI could be applied in both the featur ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Unseen but mismatch handset is the major source of performance degradation for speaker recognition in telecommunication environment. In this paper, an unseen handset characteristics estimation method based on a priori knowledge interpolation (AKI) is proposed. AKI could be applied in both the feature and model space to interpolate the feature and model transformation functions measured using stochastic matching (SM) and maximum likelihood linear regression (MLLR), respectively. Cross-validation experimental results on HTIMIT database showed that the average speaker recognition rate could be improved from 59.6%/57.8 % to 73.8%/66.8 % for seen/unseen handsets. It is therefore a promising method for robust speaker recognition. 1.
A New Approach to Channel Robust Speaker Verification via Constrained Stochastic Feature Transformation
- in Proc. ICSLP’04
"... This paper proposes a constrained stochastic feature transformation algorithm for robust speaker verification. The algorithm computes the feature transformation parameters based on the statistical difference between a test utterance and a composite GMM formed by combining the speaker and background ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
This paper proposes a constrained stochastic feature transformation algorithm for robust speaker verification. The algorithm computes the feature transformation parameters based on the statistical difference between a test utterance and a composite GMM formed by combining the speaker and background models. The transformation is then used to transform the test utterance to fit the clean speaker model and background model before verification. By implicitly constraining the transformation, the transformed features can fit both models simultaneously. Experimental results based on the 2001 NIST evaluation set show that the proposed algorithms achieves significant improvement in both equal error rate and minimum detection cost when compared to cepstral mean subtraction and Z-norm. The performance of the proposed transformation approach is also slightly better than the short-time Gaussianization method proposed in [1].
Extraction of Speaker Features from Different Stages of DSR Front-ends for Distributed Speaker Verification
, 2004
"... The ETSI has recently published a front-end processing standard for distributed speech recognition systems. The key idea of the standard is to extract the spectral features of speech signals at the front-end terminals so that acoustic distortion caused by communication channels can be avoided. Th ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The ETSI has recently published a front-end processing standard for distributed speech recognition systems. The key idea of the standard is to extract the spectral features of speech signals at the front-end terminals so that acoustic distortion caused by communication channels can be avoided. This paper investigates the e#ect of extracting spectral features from di#erent stages of the front-end processing on the performance of distributed speaker verification systems. A technique that combines handset selectors with stochastic feature transformation is also employed in a back-end speaker verification system to reduce the acoustic mismatch between di#erent handsets. Because the feature vectors obtained from the back-end server are vector quantized, the paper proposes two approaches to adding Gaussian noise to the quantized feature vectors for training the Gaussian mixture speaker models. In one approach, the variances of the Gaussian noise are made dependent on the codeword distance. In another approach, the variances are a function of the distance between some unquantized training vectors and their closest code vector. The HTIMIT corpus was # Correspondence should be sent to M.W. Mak, Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong. Email: enmwmak@polyu.edu.hk. Tel: (852)27666257. Fax: (852)23628439.
Channel Robust Speaker Verification via Bayesian Blind Stochastic Feature Transformation
"... In telephone-based speaker verification, the channel conditions can be varied significantly from sessions to sessions. Therefore, it is desirable to estimate the channel conditions online and compensate the acoustic distortion without prior knowledge of the channel characteristics. Because no a prio ..."
Abstract
- Add to MetaCart
(Show Context)
In telephone-based speaker verification, the channel conditions can be varied significantly from sessions to sessions. Therefore, it is desirable to estimate the channel conditions online and compensate the acoustic distortion without prior knowledge of the channel characteristics. Because no a priori knowledge is used, the estimation accuracy depends greatly on the length of the verification utterances. This paper extends the Blind Stochastic Feature Transformation (BSFT) algorithm that we recently proposed to handle the short-utterance scenario. The idea is to estimate a set of prior transformation parameters from a development set in which a wide variety of channel conditions exists in the verification utterances. The prior transformations are then incorporated into the online estimation of the BSFT parameters in a Bayesian (maximum a posteriori) fashion. The resulting transformation parameters are therefore dependent on both the prior transformations and the verification utterances. For short (long) utterances, the prior transformations play a more (less) important role. We referred the extended algorithm to as Bayesian BSFT (BBSFT) and applied it to the 2001 NIST SRE task. Results show that Bayesian BSFT outperforms BSFT for utterances shorter than or equal to 4 seconds. 1.
Cluster-Dependent Feature Transformation with Divergence-Based Out-of-Handset Rejection for . . .
, 2003
"... This paper proposes a divergence-based cluster selector with out-of-handset (OOH) rejection capability to identify the `unseen' handsets. This is achieved by measuring the Jensen di#erence between the selector's output and a constant vector with identical elements. The resulting cluster se ..."
Abstract
- Add to MetaCart
This paper proposes a divergence-based cluster selector with out-of-handset (OOH) rejection capability to identify the `unseen' handsets. This is achieved by measuring the Jensen di#erence between the selector's output and a constant vector with identical elements. The resulting cluster selector is combined with a feature-based channel compensation algorithm for telephone-based speaker verification. Utterances whose handsets are identified as `unseen' will be normalized by cepstral mean subtraction (CMS). On the other hand, if the handset can be identified (considered as `seen'), a corresponding set of cluster-dependent transformation parameters will be used to transform the utterances. Experiments based on ten handsets of the HTIMIT corpus show that using the cluster-dependent transformation parameters to transform the utterances with correctly identified handsets and processing those utterances with `unseen' handsets by CMS achieve the best result.
Blind Stochastic Feature Transformation for Speaker Verification over Cellular Networks
"... Acoustic mismatch between the training and recognition conditions presents one of the serious challenges faced by speaker recognition researchers today. The goal of channel compensation is to achieve performance approaching that of a "matched condition" system while avoiding the need for a ..."
Abstract
- Add to MetaCart
(Show Context)
Acoustic mismatch between the training and recognition conditions presents one of the serious challenges faced by speaker recognition researchers today. The goal of channel compensation is to achieve performance approaching that of a "matched condition" system while avoiding the need for a large amount of training data. It is important to ensure that the channel compensation algorithms in these systems compensate the channel variation instead of speaker variation. This paper addresses the problem of unsupervised compensation in which the features of a test utterance are transformed to fit the clean speaker model and gender-dependent background model. Specifically, a feature-based transformation is estimated based on the statistical difference between a test utterance and a composite acoustic model formed by combining the speaker and background models. By transforming the features to fit both models, the transformation is implicitly constrained. Experimental results based on the 2001 NIST evaluation set show that the proposed transformation approach achieves significant improvement in both equal error rate and minimum detection cost as compared to cepstral mean subtraction, Znorm and short-time Gaussianization.