Results 1 - 10
of
121
A study of inter-speaker variability in speaker verification
- IEEE Trans. Audio, Speech and Language Processing
, 2008
"... Abstract — We propose a new approach to the problem of estimating the hyperparameters which define the inter-speaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10–15 % reductions in error r ..."
Abstract
-
Cited by 131 (12 self)
- Add to MetaCart
(Show Context)
Abstract — We propose a new approach to the problem of estimating the hyperparameters which define the inter-speaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10–15 % reductions in error rates on the core condition and the extended data condition (as measured both by equal error rates and the NIST detection cost function). We show that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the cross-channel condition, it is capable of performing at least as well as fusions of multiple systems of other types. (The comparisons are based on the best results on these tasks that have been reported in the literature.) In the case of the cross-channel condition, a factor analysis model with 300 speaker factors and 200 channel factors can achieve equal error rates of less than 3.0%. This is a substantial improvement over the best results that have previously been reported on this task. Index Terms — Speaker verification, Gaussian mixture model, speaker factors, channel factors
Probabilistic models for inference about iden‐ tity
- IEEE TPAMI
, 2012
"... Abstract—Many face recognition algorithms use “distance-based ” methods: feature vectors are extracted from each face and distances in feature space are compared to determine matches. In this paper we argue for a fundamentally different approach. We consider each image as having been generated from ..."
Abstract
-
Cited by 52 (0 self)
- Add to MetaCart
Abstract—Many face recognition algorithms use “distance-based ” methods: feature vectors are extracted from each face and distances in feature space are compared to determine matches. In this paper we argue for a fundamentally different approach. We consider each image as having been generated from several underlying causes, some of which are due to identity (latent identity variables, or LIVs) and some of which are not. In recognition we evaluate the probability that two faces have the same underlying identity cause. We make these ideas concrete by developing a series of novel generative models which incorporate both within-individual and between-individual variation. We consider both the linear case where signal and noise are represented by a subspace, and the non-linear case where an arbitrary face manifold can be described and noise is position-dependent. We also develop a “tied ” version of the algorithm that allows explicit comparison of faces across quite different viewing conditions. We demonstrate that our model produces results that are comparable or better than the state of the art for both frontal face recognition and face recognition under varying pose.
Language recognition in iVectors space,” in
- Proc. Interspeech,
, 2011
"... Abstract The concept of so called iVectors, where each utterance is represented by fixed-length low-dimensional feature vector, has recently become very successfully in speaker verification. In this work, we apply the same idea in the context of Language Recognition (LR). To recognize language in t ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
(Show Context)
Abstract The concept of so called iVectors, where each utterance is represented by fixed-length low-dimensional feature vector, has recently become very successfully in speaker verification. In this work, we apply the same idea in the context of Language Recognition (LR). To recognize language in the iVector space, we experiment with three different linear classifiers: one based on a generative model, where classes are modeled by Gaussian distributions with shared covariance matrix, and two discriminative classifiers, namely linear Support Vector Machine and Logistic Regression. The tests were performed on the NIST LRE 2009 dataset and the results were compared with stateof-the-art LR based on Joint Factor Analysis (JFA). While the iVector system offers better performance, it also seems to be complementary to JFA, as their fusion shows another improvement.
Misalignment-robust face recognition
- IEEE TIP
, 2010
"... In this paper, we study the problem of subspace-based face recognition under scenarios with spatial misalign-ments and/or image occlusions. For a given subspace, the embedding of a new datum and the underlying spatial mis-alignment parameters are simultaneously inferred by solv-ing a constrained 1 n ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
(Show Context)
In this paper, we study the problem of subspace-based face recognition under scenarios with spatial misalign-ments and/or image occlusions. For a given subspace, the embedding of a new datum and the underlying spatial mis-alignment parameters are simultaneously inferred by solv-ing a constrained 1 norm optimization problem, which minimizes the error between the misalignment-amended image and the image reconstructed from the given sub-space along with its principal complementary subspace. A byproduct of this formulation is the capability to detect the underlying image occlusions. Extensive experiments on spatial misalignment estimation, image occlusion detection, and face recognition with spatial misalignments and im-age occlusions all validate the effectiveness of our proposed general formulation. 1.
G.: A practical transfer learning algorithm for face verification
- In: ICCV. (2013
"... Face verification involves determining whether a pair of facial images belongs to the same or different subjects. This problem can prove to be quite challenging in many im-portant applications where labeled training data is scarce, e.g., family album photo organization software. Herein we propose a ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
(Show Context)
Face verification involves determining whether a pair of facial images belongs to the same or different subjects. This problem can prove to be quite challenging in many im-portant applications where labeled training data is scarce, e.g., family album photo organization software. Herein we propose a principled transfer learning approach for merg-ing plentiful source-domain data with limited samples from some target domain of interest to create a classifier that ide-ally performs nearly as well as if rich target-domain data were present. Based upon a surprisingly simple generative Bayesian model, our approach combines a KL-divergence-based regularizer/prior with a robust likelihood function leading to a scalable implementation via the EM algorithm. As justification for our design choices, we later use prin-ciples from convex analysis to recast our algorithm as an equivalent structured rank minimization problem leading to a number of interesting insights related to solution struc-ture and feature-transform invariance. These insights help to both explain the effectiveness of our algorithm as well as elucidate a wide variety of related Bayesian approaches. Experimental testing with challenging datasets validate the utility of the proposed algorithm. 1.
I4U submission to NIST SRE 2012: a large-scale collaborative effort for noise-robust speaker verification
- in InterSpeech
, 2013
"... I4U is a joint entry of nine research Institutes and Universities across 4 continents to NIST SRE 2012. It started with a brief discussion during the Odyssey 2012 workshop in Singapore. An online discussion group was soon set up, providing a dis-cussion platform for different issues surrounding NIST ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
(Show Context)
I4U is a joint entry of nine research Institutes and Universities across 4 continents to NIST SRE 2012. It started with a brief discussion during the Odyssey 2012 workshop in Singapore. An online discussion group was soon set up, providing a dis-cussion platform for different issues surrounding NIST SRE’12. Noisy test segments, uneven multi-session training, variable en-rollment duration, and the issue of open-set identification were actively discussed leading to various solutions integrated to the I4U submission. The joint submission and several of its 17 sub-systems were among top-performing systems. We summarize the lessons learnt from this large-scale effort.
Improving local descriptors by embedding global and local spatial information
- In ECCV
, 2010
"... Abstract. In this paper, we present a novel problem: “Given local de-scriptors, how can we incorporate both local and global spatial infor-mation into the descriptors, and obtain compact and discriminative fea-tures? ” To address this problem, we proposed a general framework to improve any local des ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
Abstract. In this paper, we present a novel problem: “Given local de-scriptors, how can we incorporate both local and global spatial infor-mation into the descriptors, and obtain compact and discriminative fea-tures? ” To address this problem, we proposed a general framework to improve any local descriptors by embedding both local and global spatial information. In addition, we proposed a simple and powerful combina-tion method for different types of features. We evaluated the proposed method for the most standard scene and object recognition dataset, and confirm the effectiveness of the proposed method from the viewpoint of speed and accuracy.
Speaker verification using simplified and supervised ivector modeling,” appear to
- Proc. of ICASSP
, 2013
"... This paper presents a simplified and supervised i-vector modeling framework that is applied in the task of robust and efficient speaker verification (SRE). First, by concatenating the mean supervector and the i-vector factor loading matrix with respectively the label vector and the linear classifier ..."
Abstract
-
Cited by 11 (9 self)
- Add to MetaCart
(Show Context)
This paper presents a simplified and supervised i-vector modeling framework that is applied in the task of robust and efficient speaker verification (SRE). First, by concatenating the mean supervector and the i-vector factor loading matrix with respectively the label vector and the linear classifier matrix, the traditional i-vectors are then ex-tended to label-regularized supervised i-vectors. These supervised i-vectors are optimized to not only reconstruct the mean supervectors well but also minimize the mean squared error between the origi-nal and the reconstructed label vectors, such that they become more discriminative. Second, factor analysis (FA) can be performed on the pre-normalized centered GMM first order statistics supervector to ensure that the Gaussian statistics sub-vector of each Gaussian component is treated equally in the FA, which reduces the compu-tational cost significantly. Experimental results are reported on the female part of the NIST SRE 2010 task with common condition 5. The proposed supervised i-vector approach outperforms the i-vector baseline by relatively 12 % and 7 % in terms of equal error rate (EER) and norm old minDCF values, respectively. Index Terms — Speaker verification, Simplified i-vector, Super-vised i-vector
AN INVESTIGATION ON BACK-END FOR SPEAKER RECOGNITION IN MULTI-SESSION ENROLLMENT
"... This study explores various back-end classifiers for robust speaker recognition in multi-session enrollment, with emphasis on optimal utilization and organization of speaker information present in the development data. Our objective is to construct a highly discriminative back-end framework by fusin ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
This study explores various back-end classifiers for robust speaker recognition in multi-session enrollment, with emphasis on optimal utilization and organization of speaker information present in the development data. Our objective is to construct a highly discriminative back-end framework by fusing several back-ends on an i-vector system framework. It is demonstrated that, by using different information/data configuration and modeling schemes, performance of the fused system can be significantly improved compared to an individual system using a single front-end and back-end. Averaged across both genders, we obtain a relative improvement in EER and minDCF by 56.5 % and 49.4%, respectively. Consistent performance gains obtained using the proposed strategy validates its effectiveness. This system is part of the CRSS ’ NIST SRE 2012 submission system.
Linear versus mel- frequency cepstral coefficients for speaker recognition
- in IEEE Automatic Speech Recognition and Understanding Workshop
, 2011
"... Abstract—Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as well as in speech recognition. However, based on theories in speech production, some speaker characteristics associated with the structure of the vocal tract, particularly the vocal tract length, ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as well as in speech recognition. However, based on theories in speech production, some speaker characteristics associated with the structure of the vocal tract, particularly the vocal tract length, are reflected more in the high frequency range of speech. This insight suggests that a linear scale in frequency may provide some advantages in speaker recognition over the mel scale. Based on two state-of-theart speaker recognition back-end systems (one Joint Factor Analysis system and one Probabilistic Linear Discriminant Analysis system), this study compares the performances between MFCC and LFCC (Linear frequency cepstral coefficients) in the NIST SRE (Speaker Recognition Evaluation) 2010 extended-core task. Our results in SRE10 show that, while they are complementary to each other, LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials. This can be explained by the relatively shorter vocal tract in females and the resulting higher formant frequencies in speech. LFCC benefits more in female speech by better capturing the spectral characteristics in the high frequency region. In addition, our results show some advantage of LFCC over MFCC in reverberant speech. LFCC is as robust as MFCC in the babble noise, but not in the white noise. It is concluded that LFCC should be more widely used, at least for the female trials, by the mainstream of the speaker recognition community.