Results 1  10
of
74
Front End Factor Analysis for Speaker Verification
 IEEE Transactions on Audio, Speech and Language Processing
, 2010
"... Abstract—This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new lowdimensional speaker and channeldependent space is defined using a simple factor analysis. This space is named the total variability space ..."
Abstract

Cited by 104 (14 self)
 Add to MetaCart
Abstract—This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new lowdimensional speaker and channeldependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machinebased system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are withinclass covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12 % and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4 % absolute EER improvement for bothgender trials on the 10 s10 s condition compared to the classical joint factor analysis scoring. Index Terms—Cosine distance scoring, joint factor analysis (JFA), support vector machines (SVMs), total variability space. I.
An overview of textindependent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on textindependent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the stateoftheart methods. We start with the fundamentals of ..."
Abstract

Cited by 64 (24 self)
 Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on textindependent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the stateoftheart methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
Cosine Similarity Scoring without Score Normalization Techniques
"... In recent work [1], a simplified and highly effective approach to speaker recognition based on the cosine similarity between lowdimensional vectors, termed ivectors, defined in a total variability space was introduced. The total variability space representation is motivated by the popular Joint Fact ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
In recent work [1], a simplified and highly effective approach to speaker recognition based on the cosine similarity between lowdimensional vectors, termed ivectors, defined in a total variability space was introduced. The total variability space representation is motivated by the popular Joint Factor Analysis (JFA) approach, but does not require the complication of estimating separate speaker and channel spaces and has been shown to be less dependent on score normalization procedures, such as znorm and tnorm. In this paper, we introduce a modification to the cosine similarity that does not require explicit score normalization, relying instead on simple mean and covariance statistics from a collection of impostor speaker ivectors. By avoiding the complication of z and tnorm, the new approach further allows for application of a new unsupervised speaker adaptation technique to models defined in the ivector space. Experiments are conducted on the core condition of the NIST 2008 corpora, where, with adaptation, the new approach produces an equal error rate (EER) of 4.8 % and min decision cost function (MinDCF) of 2.3 % on all female speaker trials. 1.
An ivector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech
"... It is widely believed that speaker verification systems perform better when there is sufficient background training data to deal with nuisance effects of transmission channels. It is also known that these systems perform at their best when the sound environment of the training data is similar to tha ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
It is widely believed that speaker verification systems perform better when there is sufficient background training data to deal with nuisance effects of transmission channels. It is also known that these systems perform at their best when the sound environment of the training data is similar to that of the context of use (test context). For some applications however, training data from the same type of sound environment is scarce, whereas a considerable amount of data from a different type of environment is available. In this paper, we propose a new architecture for textindependent speaker verification systems that are satisfactorily trained by virtue of a limited amount of applicationspecific data, supplemented with a sufficient amount of training data from some other context. This architecture is based on the extraction of parameters (ivectors) from a lowdimensional space (total variability space) proposed by Dehak [1]. Our aim is to extend Dehak’s work to speaker recognition on sparse data, namely microphone speech. The main challenge is to overcome the fact that insufficient applicationspecific data is available to accurately estimate the total variability covariance matrix. We propose a method based on Joint Factor Analysis (JFA) to estimate microphone eigenchannels (sparse data) with telephone eigenchannels (sufficient data). For classification, we experimented with the following two approaches: Support Vector Machines (SVM) and Cosine Distance Scoring (CDS) classifier, based on cosine distances. We present recognition results for the part of female voices in the interview data of the NIST 2008 SRE. The best performance is obtained when our system is fused with the stateoftheart JFA. We achieve 13 % relative improvement on equal error rate and the minimum value of detection cost function decreases from 0.0219 to 0.0164. 1.
SUBSPACE GAUSSIAN MIXTURE MODELS FOR SPEECH RECOGNITION
"... This technical report contains the details of an acoustic modeling approach based on subspace adaptation of a shared Gaussian Mixture Model. This refers to adaptation to a particular speech state; it is not a speaker adaptation technique, although we do later introduce a speaker adaptation technique ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
This technical report contains the details of an acoustic modeling approach based on subspace adaptation of a shared Gaussian Mixture Model. This refers to adaptation to a particular speech state; it is not a speaker adaptation technique, although we do later introduce a speaker adaptation technique that it tied to this particular framework. Our model is a large shared GMM whose parameters vary in a subspace of relatively low dimension (e.g. 50), thus each state is described by a vector of low dimension which controls the GMM’s means and mixture weights in a manner determined by globally shared parameters. In addition we generalize to having each speech state be a mixture of substates, each with a different vector. Only the mathematical details are provided here; experimental results are being published separately.
Speaker comparison with inner product discriminant functions
 in Advances in NIPS
"... Speaker comparison, the process of finding the speaker similarity between two speech signals, occupies a central role in a variety of applications—speaker verification, clustering, and identification. Speaker comparison can be placed in a geometric framework by casting the problem as a model compari ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Speaker comparison, the process of finding the speaker similarity between two speech signals, occupies a central role in a variety of applications—speaker verification, clustering, and identification. Speaker comparison can be placed in a geometric framework by casting the problem as a model comparison process. For a given speech signal, feature vectors are produced and used to adapt a Gaussian mixture model (GMM). Speaker comparison can then be viewed as the process of compensating and finding metrics on the space of adapted models. We propose a framework, inner product discriminant functions (IPDFs), which extends many common techniques for speaker comparison—support vector machines, joint factor analysis, and linear scoring. The framework uses inner products between the parameter vectors of GMM models motivated by several statistical methods. Compensation of nuisances is performed via linear transforms on GMM parameter vectors. Using the IPDF framework, we show that many current techniques are simple variations of each other. We demonstrate, on a 2006 NIST speaker recognition evaluation task, new scoring methods using IPDFs which produce excellent error rates and require significantly less computation than current techniques. 1
Unsupervised Speaker Adaptation based on the Cosine Similarity for TextIndependent Speaker Verification
"... This paper proposes a new approach to unsupervised speaker adaptation inspired by the recent success of the factor analysisbased Total Variability Approach to textindependent speaker verification [1]. This approach effectively represents speaker variability in terms of lowdimensional total factor ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
This paper proposes a new approach to unsupervised speaker adaptation inspired by the recent success of the factor analysisbased Total Variability Approach to textindependent speaker verification [1]. This approach effectively represents speaker variability in terms of lowdimensional total factor vectors and, when paired alongside the simplicity of cosine similarity scoring, allows for easy manipulation and efficient computation [2]. The development of our adaptation algorithm is motivated by the desire to have a robust method of setting an adaptation threshold, to minimize the amount of required computation for each adaptation update, and to simplify the associated score normalization procedures where possible. To address the final issue, we propose the Symmetric Normalization (Snorm) method, which takes advantage of the symmetry in cosine similarity scoring and achieves competitive performance to that of the ZTnorm while requiring fewer parameter calculations. In our subsequent experiments, we also assess an attempt to completely replace the use of score normalization procedures with a Normalized Cosine Similarity scoring function [3]. We evaluated the performance of our unsupervised speaker adaptation algorithm under various score normalization procedures on the 10sec10sec and core conditions of the 2008 NIST SRE dataset. Using noadaptation results as our baseline, it was found that the proposed methods are consistent in successfully improving speaker verification performance to achieve stateoftheart results. 1.
Prosodic Speaker Verification using Subspace Multinomial Models with Intersession Compensation
"... We propose a novel approach to modeling prosodic features. Inspired by Joint Factor Analysis model (JFA), our model is based on the same idea of introducing subspace of model parameters. However, the underlying Gaussian Mixture distribution of JFA is replaced by multinomial distribution to model seq ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
We propose a novel approach to modeling prosodic features. Inspired by Joint Factor Analysis model (JFA), our model is based on the same idea of introducing subspace of model parameters. However, the underlying Gaussian Mixture distribution of JFA is replaced by multinomial distribution to model sequences of discrete units rather than continuous features. In this work, we use the subspace model as a feature extractor for support vector machines (SVMs), similar to the recently proposed JFA in total variability space. We can show the capability to reduce highdimensional count vectors to low dimension while keeping system performance stable. With additional intersession compensation, we can improve 30 % relative to the baseline system and reach an equal error rate of 8.8 % on the NIST 2006 SRE dataset. Index Terms: speaker verification, prosody, JFA, multinomial model
Sparse classifier fusion for speaker verification
 IEEE Transactions on Audio, Speech and Language Processing
, 2013
"... Abstract—Stateoftheart speaker verification systems take advantage of a number of complementary base classifiers by fusing them to arrive at reliable verification decisions. In speaker verification, fusion is typically implemented as a weighted linear combination of the base classifier scores, wh ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
Abstract—Stateoftheart speaker verification systems take advantage of a number of complementary base classifiers by fusing them to arrive at reliable verification decisions. In speaker verification, fusion is typically implemented as a weighted linear combination of the base classifier scores, where the combination weights are estimated using a logistic regression model. An alternative way for fusion is to use classifier ensemble selection, which can be seen as sparse regularization applied to logistic regression. Even though score fusion has been extensively studied in speaker verification, classifier ensemble selection is much less studied. In this study, we extensively study a sparse classifier fusion on a collection of twelve I4U spectral subsystems on the NIST 2008 and 2010 speaker recognition evaluation (SRE) corpora. Index Terms—Classifier ensemble selection, experimentation, linear fusion, speaker verification. I.
What Else is New Than the Hamming Window? Robust MFCCs for Speaker Recognition via Multitapering
"... Usually the melfrequency cepstral coefficients (MFCCs) are derived via Hamming windowed DFT spectrum. In this paper, we advocate to use a socalled multitaper method instead. Multitaper methods form a spectrum estimate using multiple window functions and frequencydomain averaging. Multitapers prov ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Usually the melfrequency cepstral coefficients (MFCCs) are derived via Hamming windowed DFT spectrum. In this paper, we advocate to use a socalled multitaper method instead. Multitaper methods form a spectrum estimate using multiple window functions and frequencydomain averaging. Multitapers provide a robust spectrum estimate but have not received much attention in speech processing. Our speaker recognition experiment on NIST 2002 yields equal error rates (EERs) of 9.66 % (clean data) and 16.41 % (10 dB SNR) for the conventional Hamming method and 8.13 % (clean data) and 14.63 % (10 dB SNR) using multitapers. Multitapering is a simple and robust alternative to the Hamming window method. Index Terms: speaker verification, multiple window method 1.