Results 1  10
of
26
LowVariance Multitaper MFCC Features: a Case Study in Robust Speaker Verification
, 2012
"... In speech and audio applications, shortterm signal spectrum is often represented using melfrequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
(Show Context)
In speech and audio applications, shortterm signal spectrum is often represented using melfrequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the socalled multitaper method which uses multiple timedomain windows (tapers) with frequencydomain averaging. Multitapers have received little attention in speech processing even though they produce lowvariance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, firstly, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMMUBM), support vector machine (GMMSVM) and joint factor analysis (GMMJFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4 % (GMMSVM) and 13.7 % (GMMJFA) on the interviewinterview condition in NIST 2008. The GMMJFA system further reduces MinDCF by 18.7 % on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs.
Sparse classifier fusion for speaker verification
 IEEE Transactions on Audio, Speech and Language Processing
, 2013
"... Abstract—Stateoftheart speaker verification systems take advantage of a number of complementary base classifiers by fusing them to arrive at reliable verification decisions. In speaker verification, fusion is typically implemented as a weighted linear combination of the base classifier scores, wh ..."
Abstract

Cited by 9 (9 self)
 Add to MetaCart
Abstract—Stateoftheart speaker verification systems take advantage of a number of complementary base classifiers by fusing them to arrive at reliable verification decisions. In speaker verification, fusion is typically implemented as a weighted linear combination of the base classifier scores, where the combination weights are estimated using a logistic regression model. An alternative way for fusion is to use classifier ensemble selection, which can be seen as sparse regularization applied to logistic regression. Even though score fusion has been extensively studied in speaker verification, classifier ensemble selection is much less studied. In this study, we extensively study a sparse classifier fusion on a collection of twelve I4U spectral subsystems on the NIST 2008 and 2010 speaker recognition evaluation (SRE) corpora. Index Terms—Classifier ensemble selection, experimentation, linear fusion, speaker verification. I.
What Else is New Than the Hamming Window? Robust MFCCs for Speaker Recognition via Multitapering
"... Usually the melfrequency cepstral coefficients (MFCCs) are derived via Hamming windowed DFT spectrum. In this paper, we advocate to use a socalled multitaper method instead. Multitaper methods form a spectrum estimate using multiple window functions and frequencydomain averaging. Multitapers prov ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
(Show Context)
Usually the melfrequency cepstral coefficients (MFCCs) are derived via Hamming windowed DFT spectrum. In this paper, we advocate to use a socalled multitaper method instead. Multitaper methods form a spectrum estimate using multiple window functions and frequencydomain averaging. Multitapers provide a robust spectrum estimate but have not received much attention in speech processing. Our speaker recognition experiment on NIST 2002 yields equal error rates (EERs) of 9.66 % (clean data) and 16.41 % (10 dB SNR) for the conventional Hamming method and 8.13 % (clean data) and 14.63 % (10 dB SNR) using multitapers. Multitapering is a simple and robust alternative to the Hamming window method. Index Terms: speaker verification, multiple window method 1.
Regularized allpole models for speaker verification under noisy environments
 IEEE Sig. Proc. Lett
, 2012
"... Regularization of linear prediction based melfrequency cepstral coefficient (MFCC) extraction in speaker verification is considered. Commonly, MFCCs are extracted from the discrete Fourier transform (DFT) spectrum of speech frames. In our recent study, it was shown that replacing the DFT spectrum ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
Regularization of linear prediction based melfrequency cepstral coefficient (MFCC) extraction in speaker verification is considered. Commonly, MFCCs are extracted from the discrete Fourier transform (DFT) spectrum of speech frames. In our recent study, it was shown that replacing the DFT spectrum estimation step with the conventional and temporally weighted linear prediction (LP) and their regularized versions increases the recognition performance considerably. In this paper, we provide a through analysis on the regularization of conventional and temporally weighted LP methods. Experiments on the NIST 2002 corpus indicate that regularized allpole methods yield large improvements on recognition accuracy under additive factory and babble noise conditions in terms of both equal error rate (EER) and minimum detection cost function (MinDCF). 1.
Classifier subset selection and fusion for speaker verification
 in ICASSP 2011
"... Stateoftheart speaker verification systems consists of a number of complementary subsystems whose outputs are fused, to arrive at more accurate and reliable verification decision. In speaker verification, fusion is typically implemented as a linear combination of the subsystem scores. Parameters ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
(Show Context)
Stateoftheart speaker verification systems consists of a number of complementary subsystems whose outputs are fused, to arrive at more accurate and reliable verification decision. In speaker verification, fusion is typically implemented as a linear combination of the subsystem scores. Parameters of the linear model are commonly estimated using the logistic regression method, as implemented in the popular FoCal toolkit. In this paper, we study simultaneous use of classifier selection and fusion. We study four alternative fusion strategies, three score warping techniques, and provide interesting experimental bounds on optimal classifier subset selection. Detailed experiments are carried out on the NIST 2008 and 2010 SRE corpora. Index Terms — Classifier selection, linear fusion 1.
M HanssonSandsten, Comparing spectrum estimators in speaker verification under additive noise degradation
 in Proc. of the ICASSP (IEEE
, 2012
"... Different shortterm spectrum estimators for speaker verification under additive noise are considered. Conventionally, melfrequency cepstral coefficients (MFCCs) are computed from discrete Fourier transform (DFT) spectra of windowed speech frames. Recently, linear prediction (LP) and its temporall ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Different shortterm spectrum estimators for speaker verification under additive noise are considered. Conventionally, melfrequency cepstral coefficients (MFCCs) are computed from discrete Fourier transform (DFT) spectra of windowed speech frames. Recently, linear prediction (LP) and its temporally weighted variants have been substituted as the spectrum analysis method in speech and speaker recognition. In this paper, 12 different shortterm spectrum estimation methods are compared for speaker verification under additive noise contamination. Experimental results conducted on NIST 2002 SRE show that the spectrum estimation method has a large effect on recognition performance and stabilized weighted LP (SWLP) and minimum variance distortionless response (MVDR) methods yield approximately 7 % and 8 % relative improvements over the standard DFT method at10 dB SNR level of factory and babble noises, respectively in terms of equal error rate (EER). Index Terms — spectrum estimation, speaker verification 1.
Using group delay functions from allpole models for speaker recognition
"... Popular features for speech processing, such as melfrequency cepstral coefficients (MFCCs), are derived from the shortterm magnitude spectrum, whereas the phase spectrum remains unused. While the common argument to use only the magnitude spectrum is that the human ear is phasedeaf, phasebased fe ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Popular features for speech processing, such as melfrequency cepstral coefficients (MFCCs), are derived from the shortterm magnitude spectrum, whereas the phase spectrum remains unused. While the common argument to use only the magnitude spectrum is that the human ear is phasedeaf, phasebased features have remained less explored due to additional signal processing difficulties they introduce. A useful representation of the phase is the group delay function, but its robust computation remains difficult. This paper advocates the use of group delay functions derived from parametric allpole models instead of their direct computation from the discrete Fourier transform. Using a subset of the vocal effort data in the NIST 2010 speaker recognition evaluation (SRE) corpus, we show that group delay features derived via parametric allpole models improve recognition accuracy, especially under high vocal effort. Additionally, the group delay features provide comparable or improved accuracy over conventional magnitudebased MFCC features. Thus, the use of group delay functions derived from allpole models provide an effective way to utilize information from the phase spectrum of speech signals. Index Terms: speaker verification, group delay functions, high vocal effort
Intonational Speaker Verification: A Study on Parameters and Performance Under Noisy Conditions
"... Prosodybased speaker verification using fundamental frequency (f0) is considered. Our study consists of two phases. First, we do extensive optimization of parameters to establish a baseline system before dealing with noisy conditions. This includes a study of f0 extractor parameters, choice of feat ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Prosodybased speaker verification using fundamental frequency (f0) is considered. Our study consists of two phases. First, we do extensive optimization of parameters to establish a baseline system before dealing with noisy conditions. This includes a study of f0 extractor parameters, choice of features (discrete cosine transform, discrete Fourier transform, Legendre polynomials, linear prediction), f0 track interpolation (none, linear, Hermite), framing parameters and windowing (none, Hamming), f0 representation domain (linear, log), number of transformation coefficients and, finally, use of higherlevel delta coefficients. Using the optimized parameters, we then explore the robustness of prosody features under white noise and factory noise degradations. Using a GMMUBM system on the NIST 2006 SRE corpus, we reach an EER of 28.4 % and 27.6 % for the intonational and MFCC features respectively at20 dB SNR white noise contamination; fusion of the two yields an EER of
SHOUT DETECTION IN NOISE
"... For the task of detecting shouted speech in a noisy environment, this paper introduces a system based on mel frequency cepstral coefficient (MFCC) feature extraction, unsupervised frame dropping and Gaussian mixture model (GMM) classification. The evaluation material consists of phonemically identic ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
For the task of detecting shouted speech in a noisy environment, this paper introduces a system based on mel frequency cepstral coefficient (MFCC) feature extraction, unsupervised frame dropping and Gaussian mixture model (GMM) classification. The evaluation material consists of phonemically identical speech and shouting as well as environmental noise of varying levels. The performance of the shout detection system is analyzed by varying the MFCC feature extraction with respect to 1) the feature vector length and 2) the spectrum estimation method. As for feature vector length, the best performance is obtained using 30 MFCC coefficients, which is more than what is conventionally used. In spectrum estimation, a scheme that combines a linear prediction spectrum envelope with spectral fine structure outperforms the conventional FFT. Index Terms — shout detection 1.
Mixture Linear Prediction in Speaker Verification Under Vocal Effort Mismatch
"... Abstract—This paper describes an approach to robust signal analysis using iterative parameter reestimation of a mixture autoregressive (AR) model. The model’s focus can be adjusted by initialization of the target and nontarget states. The variant examined in this study uses an i.i.d. mixture AR mo ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—This paper describes an approach to robust signal analysis using iterative parameter reestimation of a mixture autoregressive (AR) model. The model’s focus can be adjusted by initialization of the target and nontarget states. The variant examined in this study uses an i.i.d. mixture AR model and is designed to tackle the spectral biasing effect caused by the voice excitation in speech signals with variable fundamental frequency. In our speaker verification experiments, this method performed competitively against standard spectrum analysis techniques in nonmismatch conditions and showed significant improvements in vocal effort mismatch conditions. Index Terms—Robust acoustic features, speaker recognition, spectrum analysis, speech feature extraction. I.