Results 1 - 10
of
157
Temporally weighted linear prediction features for tackling additive noise in speaker verification
, 2010
"... We consider text-independent speaker verification under additive noise corruption. In the popular mel-frequency cepstral coefficient (MFCC) front-end, we substitute the conventional Fourier-based spectrum estimation with weighted linear predictive methods, which have earlier shown success in noise-r ..."
Abstract
-
Cited by 26 (19 self)
- Add to MetaCart
(Show Context)
We consider text-independent speaker verification under additive noise corruption. In the popular mel-frequency cepstral coefficient (MFCC) front-end, we substitute the conventional Fourier-based spectrum estimation with weighted linear predictive methods, which have earlier shown success in noise-robust speech recognition. We introduce two temporally weighted variants of linear predictive (LP) modeling to speaker verification and compare them to FFT, which is normally used in computing MFCCs, and to conventional LP. We also investigate the effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations. Our experiments on the NIST 2002 SRE corpus indicate that the accuracy of the conventional and proposed features are close to each other on clean data. On 0 dB SNR level, baseline FFT and the better of the proposed features give EERs of 17.4 % and 15.6 %, respectively. These accuracies improve to 11.6 % and 11.2 %, respectively, when spectral subtraction is included as a pre-processing method. The new features hold a promise for noise-robust speaker verification. 1.
Vulnerability of Speaker Verification Systems Against Voice Conversion Spoofing Attacks: the Case of Telephone Speech
"... Voice conversion – the methodology of automatically converting one’s utterances to sound as if spoken by another speaker – presents a threat for applications relying on speaker verification. We study vulnerability of text-independent speaker verification systems against voice conversion attacks usin ..."
Abstract
-
Cited by 23 (10 self)
- Add to MetaCart
(Show Context)
Voice conversion – the methodology of automatically converting one’s utterances to sound as if spoken by another speaker – presents a threat for applications relying on speaker verification. We study vulnerability of text-independent speaker verification systems against voice conversion attacks using telephone speech. We implemented a voice conversion systems with two types of features and nonparallel frame alignment methods and five speaker verification systems ranging from simple Gaussian mixture models (GMMs) to state-of-the-art joint factor analysis (JFA) recognizer. Experiments on a subset of NIST 2006 SRE corpus indicate that the JFA method is most resilient against conversion attacks. But even it experiences more than 5-fold increase in the false acceptance rate from 3.24 % to 17.33 %. Index Terms: speaker verification, voice conversion, security 1.
Low-Variance Multitaper MFCC Features: a Case Study in Robust Speaker Verification
, 2012
"... In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the so-called multitaper method which uses multiple time-domain windows (tapers) with frequencydomain averaging. Multitapers have received little attention in speech processing even though they produce low-variance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, firstly, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMM-UBM), support vector machine (GMM-SVM) and joint factor analysis (GMM-JFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4 % (GMM-SVM) and 13.7 % (GMM-JFA) on the interview-interview condition in NIST 2008. The GMM-JFA system further reduces MinDCF by 18.7 % on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs.
Variational bayes logistic regression as regularized fusion for NIST sre 2010
- in Proc. Odyssey: the Speaker and Language Recognition Workshop
, 2012
"... Fusion of the base classifiers is seen as a way to achieve high performance in state-of-the-art speaker verification systems. Typically, we are looking for base classifiers that would be com-plementary. We might also be interested in reinforcing good base classifiers by including others that are sim ..."
Abstract
-
Cited by 10 (8 self)
- Add to MetaCart
(Show Context)
Fusion of the base classifiers is seen as a way to achieve high performance in state-of-the-art speaker verification systems. Typically, we are looking for base classifiers that would be com-plementary. We might also be interested in reinforcing good base classifiers by including others that are similar to them. In any case, the final ensemble size is typically small and has to be formed based on some rules of thumb. We are interested to find out a subset of classifiers that has a good generalization perfor-mance. We approach the problem from sparse learning point of view. We assume that the true, but unknown, fusion weights are sparse. As a practical solution, we regularize weighted logistic regression loss function by elastic-net and LASSO constraints. However, all regularization methods have an additional param-eter that controls the amount of regularization employed. This needs to be separately tuned. In this work, we use variational Bayes approach to automatically obtain sparse solutions without additional cross-validation. Variational Bayes method improves the baseline method in 3 out of 4 sub-conditions. Index Terms: logistic regression, regularization, compressed sensing, linear fusion, speaker verification
Sparse classifier fusion for speaker verification
- IEEE Transactions on Audio, Speech and Language Processing
, 2013
"... Abstract—State-of-the-art speaker verification systems take advantage of a number of complementary base classifiers by fusing them to arrive at reliable verification decisions. In speaker verification, fusion is typically implemented as a weighted linear combination of the base classifier scores, wh ..."
Abstract
-
Cited by 9 (9 self)
- Add to MetaCart
(Show Context)
Abstract—State-of-the-art speaker verification systems take advantage of a number of complementary base classifiers by fusing them to arrive at reliable verification decisions. In speaker verification, fusion is typically implemented as a weighted linear combination of the base classifier scores, where the combination weights are estimated using a logistic regression model. An alternative way for fusion is to use classifier ensemble selection, which can be seen as sparse regularization applied to logistic regression. Even though score fusion has been extensively studied in speaker verification, classifier ensemble selection is much less studied. In this study, we extensively study a sparse classifier fusion on a collection of twelve I4U spectral subsystems on the NIST 2008 and 2010 speaker recognition evaluation (SRE) corpora. Index Terms—Classifier ensemble selection, experimentation, linear fusion, speaker verification. I.
I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry
"... Voice imitation is mimicry of another speaker’s voice characteristics and speech behavior. Professional voice mimicry can create entertaining, yet realistic sounding target speaker renditions. As mimicry tends to exaggerate prosodic, idiosyncratic and lexical behavior, it is unclear how modern spect ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
Voice imitation is mimicry of another speaker’s voice characteristics and speech behavior. Professional voice mimicry can create entertaining, yet realistic sounding target speaker renditions. As mimicry tends to exaggerate prosodic, idiosyncratic and lexical behavior, it is unclear how modern spectral-feature automatic speaker verification systems respond to mimicry “attacks”. We study the vulnerability of two well-known speaker recognition systems, traditional Gaussian mixture model – universal background model (GMM-UBM) and a state-of-the-art i-vector classifier with cosine scoring. The material consists of one professional Finnish imitator impersonating five wellknown Finnish public figures. In a carefully controlled setting, mimicry attack does slightly increase the false acceptance rate for the i-vector system, but generally this is not alarmingly large in comparison to voice conversion or playback attacks. Index Terms: Voice imitation, speaker recognition, mimicry attack
Modulation Features for Noise Robust Speaker Identification
"... Current state-of-the-art speaker identification (SID) systems perform exceptionally well under clean conditions, but their performance deteriorates when noise and channel degradations are introduced. Literature has mostly focused on robust modeling techniques to combat degradations due to background ..."
Abstract
-
Cited by 8 (7 self)
- Add to MetaCart
(Show Context)
Current state-of-the-art speaker identification (SID) systems perform exceptionally well under clean conditions, but their performance deteriorates when noise and channel degradations are introduced. Literature has mostly focused on robust modeling techniques to combat degradations due to background noise and/or channel effects, and have demonstrated significant improvement in SID performance in noise. In this paper, we present a robust acoustic feature on top of robust modeling techniques to further improve speakeridentification performance. We propose Modulation features of Medium Duration sub-band Speech Amplitudes (MMeDuSA); an acoustic feature motivated by human auditory processing, which is robust to noise corruption and captures speaker stylistic differences. We analyze the performance of MMeDuSA using SRI International’s robust SID system using a channel and noise degraded multilingual corpus distributed through the Defense Advance Research Projects Agency (DARPA) Robust Automatic Transcription of Speech (RATS) program. When benchmarked against standard cepstral features (MFCC) and other noise robust acoustic features, MMeDuSA provided lower SID error rates compared to the others. Index Terms — noise-robust speaker identification, modulation features, noise robust acoustic features 1.
Spoofing and countermeasures for automatic speaker verification
"... It is widely acknowledged that most biometric systems are vulnerable to spoofing, also known as imposture. While vulnerabilities and countermeasures for other biometric modalities have been widely studied, e.g. face verification, speaker verification systems remain vulnerable. This paper describes s ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
(Show Context)
It is widely acknowledged that most biometric systems are vulnerable to spoofing, also known as imposture. While vulnerabilities and countermeasures for other biometric modalities have been widely studied, e.g. face verification, speaker verification systems remain vulnerable. This paper describes some specific vulnerabilities studied in the literature and presents a brief survey of recent work to develop spoofing countermeasures. The paper concludes with a discussion on the need for standard datasets, metrics and formal evaluations which are needed to assess vulnerabilities to spoofing in realistic scenarios without prior knowledge. Index Terms: spoofing, imposture, automatic speaker verification 1.
A PRACTICAL, SELF-ADAPTIVE VOICE ACTIVITY DETECTOR FOR SPEAKER VERIFICATION WITH NOISY TELEPHONE AND MICROPHONE DATA
"... A voice activity detector (VAD) plays a vital role in robust speaker verification, where energy VAD is most commonly used. Energy VAD works well in noise-free conditions but deteriorates in noisy conditions. One way to tackle this is to introduce speech enhancement preprocessing. We study an alterna ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
(Show Context)
A voice activity detector (VAD) plays a vital role in robust speaker verification, where energy VAD is most commonly used. Energy VAD works well in noise-free conditions but deteriorates in noisy conditions. One way to tackle this is to introduce speech enhancement preprocessing. We study an alternative, likelihood ratio based VAD that trains speech and nonspeech models on an utterance-byutterance basis from mel-frequency cepstral coefficients (MFCCs). The training labels are obtained from enhanced energy VAD. As the speech and nonspeech models are re-trained for each utterance, minimum assumptions of the background noise are made. According to both VAD error analysis and speaker verification results utilizing state-of-the-art i-vector system, the proposed method outperforms energy VAD variants by a wide margin. We provide open-source implementation of the method. Index Terms — Voice activity detection, speaker verification 1.
The Delta-Phase Spectrum with Application to Voice Activity Detection and Speaker Recognition
, 2011
"... For several reasons, the Fourier phase domain is less favoured than the magnitude domain in signal processing and modelling of speech. To correctly analyse the phase, several factors must be considered and compensated, including the effect of the step size, windowing function and other processing pa ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
For several reasons, the Fourier phase domain is less favoured than the magnitude domain in signal processing and modelling of speech. To correctly analyse the phase, several factors must be considered and compensated, including the effect of the step size, windowing function and other processing parameters. Building on a review of these factors, this paper investigates a spectral representation based on the Instantaneous Frequency Deviation, but in which the step size between processing frames is used in calculating phase changes, rather than the traditional single sample interval. Reflecting these longer intervals, the term Delta-Phase Spectrum is used to distinguish this from instantaneous derivatives. Experiments show that mel-frequency cepstral coefficients features derived from the Delta-Phase Spectrum (termed Mel-Frequency Delta-Phase features) can produce broadly similar performance to equivalent magnitude domain features for both voice activity detection and speaker recognition tasks. Further, it is shown that the fusion of the magnitude and phase representations yields performance benefits over either in isolation.