Results 1 - 10
of
46
Support vector machines using GMM supervectors for speaker verification
- IEEE Signal Processing Letters
, 2006
"... pretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States ..."
Abstract
-
Cited by 58 (1 self)
- Add to MetaCart
pretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States
SVM based speaker verification using a GMM supervector kernel and NAP variability compensation
- in Proceedings of ICASSP, 2006
"... Gaussian mixture models with universal backgrounds (UBMs) have become the standard method for speaker recognition. Typically, a speaker model is constructed by MAP adaptation of the means of the UBM. A GMM supervector is constructed by stacking the means of the adapted mixture components. A recent d ..."
Abstract
-
Cited by 53 (3 self)
- Add to MetaCart
Gaussian mixture models with universal backgrounds (UBMs) have become the standard method for speaker recognition. Typically, a speaker model is constructed by MAP adaptation of the means of the UBM. A GMM supervector is constructed by stacking the means of the adapted mixture components. A recent discovery is that latent factor analysis of this GMM supervector is an effective method for variability compensation. We consider this GMM supervector in the context of support vector machines. We construct a support vector machine kernel using the GMM supervector. We show similarities based on this kernel between the method of SVM nuisance attribute projection (NAP) and the recent results in latent factor analysis. Experiments on a NIST SRE 2005 corpus demonstrate the effectiveness of the new technique. 1.
An overview of text-independent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of ..."
Abstract
-
Cited by 31 (14 self)
- Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
Fusion of heterogeneous speaker recognition systems
- in the STBU submission for the NIST speaker recognition evaluation 2006,” IEEE Transactions on Audio, Speech and Signal Processing
, 2007
"... Abstract—This paper describes and discusses the ‘STBU’ speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
Abstract—This paper describes and discusses the ‘STBU’ speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium
Within-class Covariance Normalization for SVM-based Speaker Recognition
- Proc. of ICSLP
, 2006
"... This paper extends the within-class covariance normalization (WCCN) technique described in [1, 2] for training generalized linear kernels. We describe a practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
This paper extends the within-class covariance normalization (WCCN) technique described in [1, 2] for training generalized linear kernels. We describe a practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space. Our approach involves using principal component analysis (PCA) to split the original feature space into two subspaces: a low-dimensional “PCA space ” and a high-dimensional “PCA-complement space.” After performing WCCN in the PCA space, we concatenate the resulting feature vectors with a weighted version of their PCAcomplements. When applied to a state-of-the-art MLLR-SVM speaker recognition system, this approach achieves improvements of up to 22 % in EER and 28 % in minimum decision cost function (DCF) over our previous baseline. We also achieve substantial improvements over an MLLR-SVM system that performs WCCN in the PCA space but discards the PCA-complement. Index Terms: kernel machines, support vector machines, feature normalization, generalized linear kernels, speaker recognition.
Higher-Level Features in Speaker Recognition,” in Speaker Classification I
- of Lecture Notes in Computer Science / Artificial Intelligence. Springer, Heidelberg / Berlin
, 2007
"... Abstract. Higher-level features based on linguistic or long-range information have attracted significant attention in automatic speaker recognition. This article briefly summarizes approaches to using higher-level features for text-independent speaker verification over the last decade. To clarify ho ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Abstract. Higher-level features based on linguistic or long-range information have attracted significant attention in automatic speaker recognition. This article briefly summarizes approaches to using higher-level features for text-independent speaker verification over the last decade. To clarify how each approach uses higher-level information, features are described in terms of their type, temporal span, and reliance on automatic speech recognition for both feature extraction and feature conditioning. A subsequent analysis of higher-level features in a state-of-the-art system illustrates that (1) a higher-level cepstral system outperforms standard systems, (2) a prosodic system shows excellent performance individually and in combination, (3) other higher-level systems provide further gains, and (4) higher-level systems provide increasing relative gains as training data increases. Implications for the general field of speaker classification are discussed.
Speaker recognition with session variability normalization based on MLLR adaptation transforms
- IEEE TRANS. AUDIO
, 2007
"... We present a new modeling approach for speaker recognition that uses the maximum-likelihood linear regression (MLLR) adaptation transforms employed by a speech recognition system as features for support vector machine (SVM) speaker models. This approach is attractive because, unlike standard frame- ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
We present a new modeling approach for speaker recognition that uses the maximum-likelihood linear regression (MLLR) adaptation transforms employed by a speech recognition system as features for support vector machine (SVM) speaker models. This approach is attractive because, unlike standard frame-based cepstral speaker recognition models, it normalizes for the choice of spoken words in text-independent speaker verification without data fragmentation. We discuss the basics of the MLLR-SVM approach, and show how it can be enhanced by combining transforms relative to multiple reference models, with excellent results on recent English NIST evaluation sets. We then show how the approach can be applied even if no full word-level recognition system is available, which allows its use on non-English data even without matching speech recognizers. Finally, we examine how two recently proposed algorithms for intersession variability compensation perform in conjunction with MLLR-SVM.
A comparison of session variability compensation techniques for SVM-based speaker recognition
- in Proc. Interspeech, 2007
"... This paper compares two of the leading techniques for session variability compensation in the context of GMM mean supervector SVM classifiers for speaker recognition: inter-session variability modelling and nuisance attribute projection. The former is incorporated in the GMM model training while the ..."
Abstract
-
Cited by 9 (8 self)
- Add to MetaCart
This paper compares two of the leading techniques for session variability compensation in the context of GMM mean supervector SVM classifiers for speaker recognition: inter-session variability modelling and nuisance attribute projection. The former is incorporated in the GMM model training while the latter is employed as a modified SVM kernel. Results on both the NIST 2005 and 2006 corpora demonstrate the effectiveness of both techniques for reducing the effects of session variation. Further, system- and score-level fusion experiments show that the combination of the two methods provides improved performance.
Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition
- IEEE Transactions on Audio, Speech and Language Processing
, 2007
"... Abstract—Forensic DNA profiling is acknowledged as the model for a scientifically defensible approach in forensic identification science, as it meets the most stringent court admissibility requirements demanding transparency in scientific evaluation of evidence and testability of systems and protoco ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract—Forensic DNA profiling is acknowledged as the model for a scientifically defensible approach in forensic identification science, as it meets the most stringent court admissibility requirements demanding transparency in scientific evaluation of evidence and testability of systems and protocols. In this paper, we propose a unified approach to forensic speaker recognition (FSR) oriented to fulfil these admissibility requirements within a framework which is transparent, testable, and understandable, both for scientists and fact-finders. We show how the evaluation of DNA evidence, which is based on a probabilistic similarity-typicality metric in the form of likelihood ratios (LR), can also be generalized to continuous LR estimation, thus providing a common framework for phonetic–linguistic methods and automatic systems. We highlight the importance of calibration, and we exemplify with LRs from diphthongal F-pattern, and LRs in NIST-SRE06 tasks. The application of the proposed approach in daily casework remains a sensitive issue, and special caution is enjoined. Our objective is to show how traditional and automatic FSR methodologies can be transparent and testable, but simultaneously remain conscious of the present limitations. We conclude with a discussion on the combined use of traditional and automatic approaches and current challenges for the admissibility of speech evidence. Index Terms—Admissibility of speech evidence, calibration, Daubert, deoxyribonucleic acid (DNA), forensic speaker recognition (FSR), likelihood ratio (LR). I.
NAP AND WCCN: COMPARISON OF APPROACHES USING MLLR-SVM SPEAKER VERIFICATION SYSTEM
"... We compare two recently proposed techniques, within class covariance normalization (WCCN) [1] and nuisance attribute projection (NAP) [2], for intersession variability compensation in speaker verification. The comparison is performed using an MLLR-SVM speaker verification system. Both techniques mod ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We compare two recently proposed techniques, within class covariance normalization (WCCN) [1] and nuisance attribute projection (NAP) [2], for intersession variability compensation in speaker verification. The comparison is performed using an MLLR-SVM speaker verification system. Both techniques model intersession variability using a within-speaker covariance matrix (WSCM). However, they manipulate eigenvectors of this matrix differently. We compare them on the 2005 and 2006 NIST speaker recognition evaluation (SRE) task. Results show that WCCN is more sensitive to the choice of background speakers and NAP is more sensitive to the choice of data for WSCM estimation. WCCN gives the best performance on 2005 SRE. On 2006 SRE, both techniques give similar performance under matched conditions. Further experiments with a simple combination of these techniques show slight improvements in the best performance of either technique. Overall results show that an MLLR-SVM system with either NAP or WCCN performs comparably to the best single systems in the 2006 NIST SRE. Index Terms — Speaker recognition, Intersession variability, MLLR transforms, SVM

