Results 1 - 10
of
315
Cosine Similarity Scoring without Score Normalization Techniques
"... In recent work [1], a simplified and highly effective approach to speaker recognition based on the cosine similarity between lowdimensional vectors, termed ivectors, defined in a total variability space was introduced. The total variability space representation is motivated by the popular Joint Fact ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
(Show Context)
In recent work [1], a simplified and highly effective approach to speaker recognition based on the cosine similarity between lowdimensional vectors, termed ivectors, defined in a total variability space was introduced. The total variability space representation is motivated by the popular Joint Factor Analysis (JFA) approach, but does not require the complication of estimating separate speaker and channel spaces and has been shown to be less dependent on score normalization procedures, such as z-norm and t-norm. In this paper, we introduce a modification to the cosine similarity that does not require explicit score normalization, relying instead on simple mean and covariance statistics from a collection of impostor speaker ivectors. By avoiding the complication of z- and t-norm, the new approach further allows for application of a new unsupervised speaker adaptation technique to models defined in the ivector space. Experiments are conducted on the core condition of the NIST 2008 corpora, where, with adaptation, the new approach produces an equal error rate (EER) of 4.8 % and min decision cost function (MinDCF) of 2.3 % on all female speaker trials. 1.
Language recognition via i-vectors and dimensionality reduction
- in Interspeech
, 2011
"... In this paper, a new language identification system is presented based on the total variability approach previously developed in the field of speaker identification. Various techniques are em-ployed to extract the most salient features in the lower dimen-sional i-vector space and the system develope ..."
Abstract
-
Cited by 35 (4 self)
- Add to MetaCart
(Show Context)
In this paper, a new language identification system is presented based on the total variability approach previously developed in the field of speaker identification. Various techniques are em-ployed to extract the most salient features in the lower dimen-sional i-vector space and the system developed results in excel-lent performance on the 2009 LRE evaluation set without the need for any post-processing or backend techniques. Additional performance gains are observed when the system is combined with other acoustic systems.
Language recognition in iVectors space,” in
- Proc. Interspeech,
, 2011
"... Abstract The concept of so called iVectors, where each utterance is represented by fixed-length low-dimensional feature vector, has recently become very successfully in speaker verification. In this work, we apply the same idea in the context of Language Recognition (LR). To recognize language in t ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
(Show Context)
Abstract The concept of so called iVectors, where each utterance is represented by fixed-length low-dimensional feature vector, has recently become very successfully in speaker verification. In this work, we apply the same idea in the context of Language Recognition (LR). To recognize language in the iVector space, we experiment with three different linear classifiers: one based on a generative model, where classes are modeled by Gaussian distributions with shared covariance matrix, and two discriminative classifiers, namely linear Support Vector Machine and Logistic Regression. The tests were performed on the NIST LRE 2009 dataset and the results were compared with stateof-the-art LR based on Joint Factor Analysis (JFA). While the iVector system offers better performance, it also seems to be complementary to JFA, as their fusion shows another improvement.
Simplification and optimization of i-vector extraction,”
- in Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011. IEEE Signal Processing Society,
, 2011
"... ABSTRACT This paper introduces some simplifications to the i-vector speaker recognition systems. I-vector extraction as well as training of the i-vector extractor can be an expensive task both in terms of memory and speed. Under certain assumptions, the formulas for i-vector extraction-also used in ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
(Show Context)
ABSTRACT This paper introduces some simplifications to the i-vector speaker recognition systems. I-vector extraction as well as training of the i-vector extractor can be an expensive task both in terms of memory and speed. Under certain assumptions, the formulas for i-vector extraction-also used in i-vector extractor training-can be simplified and lead to a faster and memory more efficient code. The first assumption is that the GMM component alignment is constant across utterances and is given by the UBM GMM weights. The second assumption is that the i-vector extractor matrix can be linearly transformed so that its per-Gaussian components are orthogonal. We use PCA and HLDA to estimate this transform.
Duration Mismatch Compensation for I-vector based Speaker Recognition Systems,” in
- Proc. IEEE ICASSP,
, 2013
"... ABSTRACT Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and i-vector length. We d ..."
Abstract
-
Cited by 16 (7 self)
- Add to MetaCart
(Show Context)
ABSTRACT Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and i-vector length. We demonstrate that, as utterance duration is decreased, number of detected unique phonemes and i-vector length approaches zero in a logarithmic and non-linear fashion, respectively. Assuming duration variability as an additive noise in the ivector space, we propose three different strategies for its compensation: i) multi-duration training in Probabilistic Linear Discriminant Analysis (PLDA) model, ii) score calibration using log duration as a Quality Measure Function (QMF), and iii) multi-duration PLDA training with synthesized short duration i-vectors. Experiments are designed based on the 2012 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) protocol with varying test utterance duration. Experimental results demonstrate the effectiveness of the proposed schemes on short duration test conditions, especially with the QMF calibration approach.
Leeuwen, “Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors
- in accepted into IEEE Int. Conf. on Acoustics, Speech and Signal Processing
, 2011
"... The recently developed i-vector framework for speaker recognition has set a new performance standard in the research field. An i-vector is a compact representation of a speaker utterance extracted from a low-dimensional total variability subspace. Prior to classification using a cosine kernel, i-vec ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
(Show Context)
The recently developed i-vector framework for speaker recognition has set a new performance standard in the research field. An i-vector is a compact representation of a speaker utterance extracted from a low-dimensional total variability subspace. Prior to classification using a cosine kernel, i-vectors are projected into an LDA space in order to reduce inter-session variability and enhance speaker discrimination. The accurate estimation of this LDA space from a training dataset is crucial to classification performance. A typical training dataset, however, does not consist of utterances acquired from all sources of interest (ie., telephone, microphone and interview speech sources) for each speaker. This has the effect of introducing source-related variation in the between-speaker covariance matrix and results in an incomplete representation of the within-speaker scatter matrix used for LDA. Proposed is a novel source-normalised-and-weighted LDA algorithm developed to improve the robustness of i-vector-based speaker recognition under both mis-matched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development. Evaluated on the recent NIST 2008 and 2010 Speaker Recognition Evaluations (SRE), the proposed technique demonstrated improvements of up to 31 % in minimum DCF and EER under mis-matched and sparsely-resourced conditions. Index Terms — speaker recognition, linear discriminant analysis, i-vector, total variability, source variability 1.
I4U submission to NIST SRE 2012: a large-scale collaborative effort for noise-robust speaker verification
- in InterSpeech
, 2013
"... I4U is a joint entry of nine research Institutes and Universities across 4 continents to NIST SRE 2012. It started with a brief discussion during the Odyssey 2012 workshop in Singapore. An online discussion group was soon set up, providing a dis-cussion platform for different issues surrounding NIST ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
(Show Context)
I4U is a joint entry of nine research Institutes and Universities across 4 continents to NIST SRE 2012. It started with a brief discussion during the Odyssey 2012 workshop in Singapore. An online discussion group was soon set up, providing a dis-cussion platform for different issues surrounding NIST SRE’12. Noisy test segments, uneven multi-session training, variable en-rollment duration, and the issue of open-set identification were actively discussed leading to various solutions integrated to the I4U submission. The joint submission and several of its 17 sub-systems were among top-performing systems. We summarize the lessons learnt from this large-scale effort.
A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case
- in APSIPA ASC
, 2012
"... Abstract—Voice conversion technique, which modifies one speaker’s (source) voice to sound like another speaker (target), presents a threat to automatic speaker verification. In this paper, we first present new results of evaluating the vulnerability of current state-of-the-art speaker verification s ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
(Show Context)
Abstract—Voice conversion technique, which modifies one speaker’s (source) voice to sound like another speaker (target), presents a threat to automatic speaker verification. In this paper, we first present new results of evaluating the vulnerability of current state-of-the-art speaker verification systems: Gaussian mixture model with joint factor analysis (GMM-JFA) and probabilistic linear discriminant analysis (PLDA) systems, against spoofing attacks. The spoofing attacks are simulated by two voice conversion techniques: Gaussian mixture model based conversion and unit selection based conversion. To reduce false acceptance rate caused by spoofing attack, we propose a general anti-spoofing attack framework for the speaker verification systems, where a converted speech detector is adopted as a post-processing module for the speaker verification system’s acceptance decision. The detector decides whether the accepted claim is human speech or converted speech. A subset of the core task in the NIST SRE 2006 corpus is used to evaluate the vulnerability of speaker verification system and the performance of converted speech detector. The results indicate that both conversion techniques can increase the false acceptance rate of GMM-JFA and PLDA system, while the converted speech detector can reduce the false acceptance rate
Low-Variance Multitaper MFCC Features: a Case Study in Robust Speaker Verification
, 2012
"... In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the so-called multitaper method which uses multiple time-domain windows (tapers) with frequencydomain averaging. Multitapers have received little attention in speech processing even though they produce low-variance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, firstly, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMM-UBM), support vector machine (GMM-SVM) and joint factor analysis (GMM-JFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4 % (GMM-SVM) and 13.7 % (GMM-JFA) on the interview-interview condition in NIST 2008. The GMM-JFA system further reduces MinDCF by 18.7 % on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs.
Exploiting intra-conversation variability for speaker diarization
- in INTERSPEECH
, 2011
"... In this paper, we propose a new approach to speaker diariza-tion based on the Total Variability approach to speaker verifica-tion. Drawing on previous work done in applying factor anal-ysis priors to the diarization problem, we arrive at a simplified approach that exploits intra-conversation variabi ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
In this paper, we propose a new approach to speaker diariza-tion based on the Total Variability approach to speaker verifica-tion. Drawing on previous work done in applying factor anal-ysis priors to the diarization problem, we arrive at a simplified approach that exploits intra-conversation variability in the To-tal Variability space through the use of Principal Component Analysis (PCA). Using our proposed methods, we demonstrate the ability to achieve state-of-the-art performance (0.9 % DER) in the diarization of summed-channel telephone data from the NIST 2008 SRE.