Results 1 - 10
of
31
Temporally weighted linear prediction features for tackling additive noise in speaker verification
, 2010
"... We consider text-independent speaker verification under additive noise corruption. In the popular mel-frequency cepstral coefficient (MFCC) front-end, we substitute the conventional Fourier-based spectrum estimation with weighted linear predictive methods, which have earlier shown success in noise-r ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
We consider text-independent speaker verification under additive noise corruption. In the popular mel-frequency cepstral coefficient (MFCC) front-end, we substitute the conventional Fourier-based spectrum estimation with weighted linear predictive methods, which have earlier shown success in noise-robust speech recognition. We introduce two temporally weighted variants of linear predictive (LP) modeling to speaker verification and compare them to FFT, which is normally used in computing MFCCs, and to conventional LP. We also investigate the effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations. Our experiments on the NIST 2002 SRE corpus indicate that the accuracy of the conventional and proposed features are close to each other on clean data. On 0 dB SNR level, baseline FFT and the better of the proposed features give EERs of 17.4 % and 15.6 %, respectively. These accuracies improve to 11.6 % and 11.2 %, respectively, when spectral subtraction is included as a pre-processing method. The new features hold a promise for noise-robust speaker verification. 1.
Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation
"... Abstract—The introduction of interview speech in recent NIST Speaker Recognition Evaluations (SREs) has necessitated the development of robust voice activity detectors (VADs) that can work under very low signal-to-noise ratio. This paper highlights the characteristics of interview speech files in NI ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—The introduction of interview speech in recent NIST Speaker Recognition Evaluations (SREs) has necessitated the development of robust voice activity detectors (VADs) that can work under very low signal-to-noise ratio. This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties of detecting speech/non-speech segments in these files. To alleviate these difficulties, this paper proposes a VAD that uses noise reduction as a pre-processing step. A strategy to avoid the undesirable effects of impulsive signals and sinusoidal background-signals on the VAD is also proposed. The proposed VAD is compared with the VAD in the ETSI-AMR speech coder for removing silence regions of interview speech files. The results show that the proposed VAD is more robust in detecting speech segments under very low SNR, leading to a significant performance gain in Common Conditions 1–4 of NIST 2008 SRE. Index Terms—Voice activity detection; far-field microphone; speaker verification; noise reduction; spectral subtraction; NIST speaker recognition evaluations. A. Speaker Verification I.
A PARTIAL LEAST SQUARES FRAMEWORK FOR SPEAKER RECOGNITION
"... Modern approaches to speaker recognition (verification) operate in a space of “supervectors ” created via concatenation of the mean vectors of a Gaussian mixture model (GMM) adapted from a universal background model (UBM). In this space, a number of approaches to model inter-class separability and n ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Modern approaches to speaker recognition (verification) operate in a space of “supervectors ” created via concatenation of the mean vectors of a Gaussian mixture model (GMM) adapted from a universal background model (UBM). In this space, a number of approaches to model inter-class separability and nuisance attribute variability have been proposed. We develop a method for modeling the variability associated with each class (speaker) by using partial-least-squares – a latent variable modeling technique, which isolates the most informative subspace for each speaker. The method is tested on NIST SRE 2008 data and provides promising results. The method is shown to be noise-robust and to be able to efficiently learn the subspace corresponding to a speaker on training data consisting of multiple utterances. Index Terms — Partial least squares, speaker recognition, latent vector, GMM supervectors 1.
Classifier subset selection and fusion for speaker verification
- in ICASSP 2011
"... State-of-the-art speaker verification systems consists of a number of complementary subsystems whose outputs are fused, to arrive at more accurate and reliable verification decision. In speaker verification, fusion is typically implemented as a linear combination of the subsystem scores. Parameters ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
State-of-the-art speaker verification systems consists of a number of complementary subsystems whose outputs are fused, to arrive at more accurate and reliable verification decision. In speaker verification, fusion is typically implemented as a linear combination of the subsystem scores. Parameters of the linear model are commonly estimated using the logistic regression method, as implemented in the popular FoCal toolkit. In this paper, we study simultaneous use of classifier selection and fusion. We study four alternative fusion strategies, three score warping techniques, and provide interesting experimental bounds on optimal classifier subset selection. Detailed experiments are carried out on the NIST 2008 and 2010 SRE corpora. Index Terms — Classifier selection, linear fusion 1.
Low-Variance Multitaper MFCC Features: a Case Study in Robust Speaker Verification
"... Abstract—In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant ex ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the so-called multitaper method which uses multiple time-domain windows (tapers) with frequencydomain averaging. Multitapers have received little attention in speech processing even though they produce low-variance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, firstly, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMM-UBM), support vector machine (GMM-SVM) and joint factor analysis (GMM-JFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4 % (GMM-SVM) and 13.7 % (GMM-JFA) on the interview-interview condition in NIST 2008. The GMM-JFA system further reduces MinDCF by 18.7 % on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs. Index Terms—Mel-frequency cepstral coefficient (MFCC), multitaper, speaker verification, small-variance estimation I.
Speaker Verification with Long-Term Ageing Data
"... The change experienced by the voice due to ageing must be considered in the development of a long-term speaker verification system. This difficult, largely open, research problem has received little attention to date. For this study, a new Speaker Ageing Database has been collected, containing speec ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The change experienced by the voice due to ageing must be considered in the development of a long-term speaker verification system. This difficult, largely open, research problem has received little attention to date. For this study, a new Speaker Ageing Database has been collected, containing speech from 18 speakers over a 30-60 year time span. A speaker verification evaluation of this data with a Gaussian Mixture Model- Universal Background Model system reveals that the verification scores of genuine speakers decrease progressively as the time span between training and testing increases, while the imposter scores are less affected. As a consequence, applying a decision threshold fixed at time of enrolment results in a high classification error rate after only a few years. A stacked classifier method of introducing an ageing-dependent decision boundary is applied, significantly improving long-term verification accuracy. Due to score variability at extended time spans however, accurate classification remains a challenging research problem. The ageing-dependent classification approach introduced here represents a first step towards dealing with long-term ageing in speaker verification systems. 1.
FISHERVIOCE: A DISCRIMINANT SUBSPACE FRAMEWORK FOR SPEAKER RECOGNITION
"... We propose a new framework for speaker recognition, referred as Fishervoice. It includes the design of a feature representation known as the structured score vector (SSV), which relates acoustic structures with “key ” frames in an input utterance in capturing relevant speaker characteristics. The fr ..."
Abstract
- Add to MetaCart
We propose a new framework for speaker recognition, referred as Fishervoice. It includes the design of a feature representation known as the structured score vector (SSV), which relates acoustic structures with “key ” frames in an input utterance in capturing relevant speaker characteristics. The framework also applies nonparametric Fisher’s discriminant analysis to map the SSVs into a compressed discriminant subspace, where matching is performed between a test sample and reference speaker samples to achieve speaker recognition. The objective is to reduce intra-speaker variability and emphasize discriminative class boundary information to facilitate speaker recognition. Experiments based on the XM2VTSDB corpus shows that the Fishervoice framework gave superior performance, compared with other commonly used approaches, e.g. GMM-UBM and Eigenvoice. Index Terms — speaker recognition, GMM, subspace model, discriminant analysis, Fishervoice
On the use of perceptual Line Spectral pairs Frequencies and higher-order residual moments for Speaker Identification
"... Abstract: Conventional Speaker Identification (SI) systems utilise spectral features like Mel-Frequency Cepstral Coefficients (MFCC) or Perceptual Linear Prediction (PLP) as a frontend module. Line Spectral pairs Frequencies (LSF) are popular alternative representation of Linear Prediction Coefficie ..."
Abstract
- Add to MetaCart
Abstract: Conventional Speaker Identification (SI) systems utilise spectral features like Mel-Frequency Cepstral Coefficients (MFCC) or Perceptual Linear Prediction (PLP) as a frontend module. Line Spectral pairs Frequencies (LSF) are popular alternative representation of Linear Prediction Coefficients (LPC). In this paper, an investigation is carried out to extract LSF from perceptually modified speech. A new feature set extracted from the residual signal is also proposed. SI system based on this residual feature containing complementary information to spectral characteristics, when fused with the conventional spectral feature based system as well as the proposed perceptually modified LSF, shows improved performance.
Approaching Human Listener Accuracy with Modern Speaker Verification
"... Being able to recognize people from their voice is a natural ability that we take for granted. Recent advances have shown significant improvement in automatic speaker recognition performance. Besides being able to process large amount of data in a fraction of time required by human, automatic system ..."
Abstract
- Add to MetaCart
Being able to recognize people from their voice is a natural ability that we take for granted. Recent advances have shown significant improvement in automatic speaker recognition performance. Besides being able to process large amount of data in a fraction of time required by human, automatic systems are now able to deal with diverse channel effects. The goal of this paper is to examine how state-of-the-art automatic system performs in comparison with human listeners, and to investigate the strategy for human-assisted form of automatic speaker recognition, which is useful in forensic investigation. We set up an experimental protocol using data from the NIST SRE 2008 core set. A total of 36 listeners have participated in the listening experiments from three sites, namely Australia, Finland and Singapore. State-of-the-art automatic system achieved 20 % error rate, whereas fusion of human listeners achieved 22%. 1.
Improving Monaural Speaker Identification by Double-Talk Detection
"... This paper describes a novel approach to improve monoaural speaker identification where two speakers are present in a single-microphone recording. The goal is to identify both of the underlying speakers in the given mixture. The proposed approach is composed of a double-talk detector (DTD) as a prep ..."
Abstract
- Add to MetaCart
This paper describes a novel approach to improve monoaural speaker identification where two speakers are present in a single-microphone recording. The goal is to identify both of the underlying speakers in the given mixture. The proposed approach is composed of a double-talk detector (DTD) as a preprocessor and speaker identification back-end. We demonstrate that including the double-talk detector improves the speaker identification accuracy. Experiments on GRID corpus show that including the DTD improves average recognition accuracy from 96.53 % to 97.43%. Index Terms: speaker identification, double-talk detection, single-channel, Gaussian mixture models.

