Results 1 - 10
of
20
Speaker verification using Adapted Gaussian mixture models
- Digital Signal Processing
, 2000
"... In this paper we describe the major elements of MIT Lincoln Laboratory’s Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but ef ..."
Abstract
-
Cited by 1010 (42 self)
- Add to MetaCart
(Show Context)
In this paper we describe the major elements of MIT Lincoln Laboratory’s Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. The development and use of a handset detector and score normalization to greatly improve verification performance is also described and discussed. Finally, representative performance benchmarks and system behavior experiments on NIST SRE corpora are presented. © 2000 Academic Press Key Words: speaker recognition; Gaussian mixture models; likelihood ratio detector; universal background model; handset normalization; NIST evaluation. 1.
Generalized Linear Discriminant Sequence Kernels For Speaker Recognition
, 2002
"... Support Vector Machines have recently shown dramatic performance gains in many application areas. We show that the same gains can be realized in the area of speaker recognition via sequence kernels. A sequence kernel provides a numerical comparison of speech utterances as entire sequences rather tha ..."
Abstract
-
Cited by 95 (23 self)
- Add to MetaCart
Support Vector Machines have recently shown dramatic performance gains in many application areas. We show that the same gains can be realized in the area of speaker recognition via sequence kernels. A sequence kernel provides a numerical comparison of speech utterances as entire sequences rather than a probability at the frame level. We introduce a novel sequence kernel derived from generalized linear discriminants. The kernel has several advantages. First, the kernel uses an explicit expansion into "feature space"--this property allows all of the support vectors to be collapsed into a single vector creating a small speaker model. Second, the kernel retains the computational advantage of generalized linear discriminants trained using mean-squared error training. Finally, the kernel shows dramatic reductions in equal error rates over standard mean-squared error training in matched and mismatched conditions on a NIST speaker recognition task.
Fusion of heterogeneous speaker recognition systems
- in the STBU submission for the NIST speaker recognition evaluation 2006,” IEEE Transactions on Audio, Speech and Signal Processing
, 2007
"... Abstract—This paper describes and discusses the ‘STBU’ speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium ..."
Abstract
-
Cited by 63 (14 self)
- Add to MetaCart
(Show Context)
Abstract—This paper describes and discusses the ‘STBU’ speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium
User Authentication via Adapted Statistical Models of Face Images
, 2006
"... It has been previously demonstrated that systems based on local features and relatively complex statistical models, namely, one-dimensional (1-D) hidden Markov models (HMMs) and pseudo-two-dimensional (2-D) HMMs, are suitable for face recognition. Recently, a simpler statistical model, namely, th ..."
Abstract
-
Cited by 61 (10 self)
- Add to MetaCart
It has been previously demonstrated that systems based on local features and relatively complex statistical models, namely, one-dimensional (1-D) hidden Markov models (HMMs) and pseudo-two-dimensional (2-D) HMMs, are suitable for face recognition. Recently, a simpler statistical model, namely, the Gaussian mixture model (GMM), was also shown to perform well. In much of the literature devoted to these models, the experiments were performed with controlled images (manual face localization, controlled lighting, background, pose, etc). However, a practical recognition system has to be robust to more challenging conditions. In this article we evaluate, on the relatively difficult BANCA database, the performance, robustness and complexity of GMM and HMM-based approaches, using both manual and automatic face localization. We extend the GMM approach through the use of local features with embedded positional information, increasing performance without sacrificing its low complexity. Furthermore, we show that the traditionally used maximum likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available. Considerably more precise models can be obtained through the use of Maximum a posteriori probability (MAP) training. We also show that face recognition techniques which obtain good performance on manually located faces do not necessarily obtain good performance on automatically located faces, indicating that recognition techniques must be designed from the ground up to handle imperfect localization. Finally, we show that while the pseudo-2-D HMM approach has the best overall performance, authentication time on current hardware makes it impractical. The best tradeoff in terms of authentication ...
On transforming statistical models for non-frontal face verification
, 2006
"... We address the pose mismatch problem which can occur in face verification systems that have only a single (frontal) face image available for training. In the framework of a Bayesian classifier based on mixtures of gaussians, the problem is tackled through extending each frontal face model with artif ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
We address the pose mismatch problem which can occur in face verification systems that have only a single (frontal) face image available for training. In the framework of a Bayesian classifier based on mixtures of gaussians, the problem is tackled through extending each frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of maximum likelihood linear regression (MLLR), as well as standard multi-variate linear regression (LinReg). All synthesis techniques rely on prior information and learn how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: a holistic system (based on PCA-derived features) and a local feature system (based on DCT-derived features). Experiments on the FERET database suggest that for the holistic system, the LinReg-based technique is more suited than the MLLR-based techniques; for the local feature system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR. The results further suggest that extending frontal models considerably reduces errors. It is also shown that the local feature system is less affected by view changes than the holistic system; this can be attributed to the parts based representation of the face, and, due to the classifier based on mixtures of gaussians, the lack of constraints on spatial relations between the face parts, allowing for deformations and movements of face areas.
An Introduction to Application-Independent Evaluation of Speaker Recognition Systems
"... ..."
(Show Context)
Speaker adaptive cohort selection for tnorm in text-independent speaker verification
- in Proc. ICASSP
"... In this paper we discuss an extension to the widely used score normalization technique of test normalization (Tnorm) for textindependent speaker verification. A new method of speaker Adaptive-Tnorm that offers advantages over the standard Tnorm by adjusting the speaker set to the target model is pre ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
(Show Context)
In this paper we discuss an extension to the widely used score normalization technique of test normalization (Tnorm) for textindependent speaker verification. A new method of speaker Adaptive-Tnorm that offers advantages over the standard Tnorm by adjusting the speaker set to the target model is presented. Examples of this improvement using the 2004 NIST SRE data are also presented. 1.
The distribution of calibrated likelihood-ratios in speaker recognition
- in Proc. Interspeech. ISCA, 2013
"... This paper studies properties of the score distributions of calibrated log-likelihood-ratios that are used in auto-matic speaker recognition. We derive the essential con-dition for calibration that the log likelihood ratio of the log-likelihood-ratio is the log-likelihood-ratio. We then investigate ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
(Show Context)
This paper studies properties of the score distributions of calibrated log-likelihood-ratios that are used in auto-matic speaker recognition. We derive the essential con-dition for calibration that the log likelihood ratio of the log-likelihood-ratio is the log-likelihood-ratio. We then investigate what the consequence of this condition is to the probability density functions (PDFs) of the log-likelihood-ratio score. We show that if the PDF of the non-target distribution is Gaussian, then the PDF of the target distribution must be Gaussian as well. The means and variances of these two PDFs are interrelated, and de-termined completely by the discrimination performance of the recognizer characterized by the equal error rate. These relations allow for a new way of computing the offset and scaling parameters for linear calibration, and we derive closed-form expressions for these and show that for modern i-vector systems with PLDA scoring this leads to good calibration, comparable to traditional logis-tic regression, over a wide range of system performance. 1.
Channel-dependent GMM and multi-class logistic regression models for language recognition
- in Proceedings of the IEEE Odyssey 2006 Speaker and Language Recognition Workshop
, 2006
"... This paper describes two new approaches to spoken language recognition. These were both successfully applied in the NIST 2005 Language Recognition Evaluation. The first approach extends the Gaussian Mixture Model technique with channel dependency, which results in actual detection costs (CDET) of 0. ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
This paper describes two new approaches to spoken language recognition. These were both successfully applied in the NIST 2005 Language Recognition Evaluation. The first approach extends the Gaussian Mixture Model technique with channel dependency, which results in actual detection costs (CDET) of 0.095 in NIST LRE-2005, and which should be compared to a traditional 2-gender dependency of GMM language models achieving 0.120. The second approach is a Multi-class Logistic Regression system, which operates similarly to a Support Vector Machine (SVM), but can be trained for all languages simultaneously. This new approach resulted in a CDET of 0.198. The joint TNO-Spescom Datavoice (TNO-SDV) submission to NIST LRE-2005 contained two more systems and obtained a result of 0.0958. 1.
Results of the 2003 NFI-TNO Forensic Speaker Recognition Evaluation
"... In this paper we report on the results of the NFI-TNO speaker recognition evaluation held in 2003. The speech material used in this evaluation has been obtained from wire-tapped recordings from real police investigations in the Netherlands. In total six experiments were carried out, one main experim ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
In this paper we report on the results of the NFI-TNO speaker recognition evaluation held in 2003. The speech material used in this evaluation has been obtained from wire-tapped recordings from real police investigations in the Netherlands. In total six experiments were carried out, one main experiment in Dutch, one experiment in which speech lengths were systematically varied, three language dependence experiments, and one experiment evaluating a proposed forensic procedure for providing evidence in court cases. The lowest equal error rate of all systems was 12.1 % in the condition using 15 seconds test segments and 60 seconds training segments.