Results 1 - 10
of
34
An overview of text-independent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of ..."
Abstract
-
Cited by 156 (37 self)
- Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech
"... In this paper, we evaluate the vulnerability of a speaker verification (SV) system to synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both SV and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer, whic ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
In this paper, we evaluate the vulnerability of a speaker verification (SV) system to synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both SV and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and a GMM-UBM-based SV system. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV system has a 0.4 % EER. When the system is tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, 90 % of the matched claims are accepted. This result suggests a possible vulnerability in SV systems to synthetic speech. In order to detect synthetic speech prior to recognition, we investigate the use of an automatic speech recognizer (ASR), dynamic-timewarping (DTW) distance of mel-frequency cepstral coefficients (MFCC), and previously-proposed average inter-frame difference of log-likelihood (IFDLL). Overall, while SV systems have impressive accuracy, even with the proposed detector, high-quality synthetic speech can lead to an unacceptably high acceptance rate of synthetic speakers. 1.
REVISITING THE SECURITY OF SPEAKER VERIFICATION SYSTEMS AGAINST IMPOSTURE USING SYNTHETIC SPEECH
"... In this paper, we investigate imposture using synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both speaker verification (SV) and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer which creates syntheti ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
In this paper, we investigate imposture using synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both speaker verification (SV) and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer which creates synthetic speech for a targeted speaker through adaptation of a background model. We use two SV systems: standard GMM-UBM-based and a newer SVM-based. Our results show when the systems are tested with human speech, there are zero false acceptances and zero false rejections. However, when the systems are tested with synthesized speech, all claims for the targeted speaker are accepted while all other claims are rejected. We propose a two-step process for detection of synthesized speech in order to prevent this imposture. Overall, while SV systems have impressive accuracy, even with the proposed detector, high-quality synthetic speech will lead to an unacceptably high false acceptance rate. Index Terms — Speech synthesis, Speaker recognition, Security
Particle swarm optimization for sorted adapted Gaussian mixture models
- IEEE Trans. Audio, Speech, Lang. Process
, 2009
"... Abstract—Recently, we introduced the sorted Gaussian mixture models (SGMMs) algorithm providing the means to tradeoff per-formance for operational speed and thus permitting the speed-up of GMM-based classification schemes. The performance of the SGMM algorithm depends on the proper choice of the sor ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
(Show Context)
Abstract—Recently, we introduced the sorted Gaussian mixture models (SGMMs) algorithm providing the means to tradeoff per-formance for operational speed and thus permitting the speed-up of GMM-based classification schemes. The performance of the SGMM algorithm depends on the proper choice of the sorting function, and the proper adjustment of its parameters. In the present work, we employ particle swarm optimization (PSO) and an appropriate fitness function to find the most advantageous parameters of the sorting function. We evaluate the practical significance of our approach on the text-independent speaker verification task utilizing the NIST 2002 speaker recognition evaluation (SRE) database while following the NIST SRE ex-perimental protocol. The experimental results demonstrate a superior performance of the SGMM algorithm using PSO when compared to the original SGMM. For comprehensiveness we also compared these results with those from a baseline Gaussian mix-ture model–universal background model (GMM-UBM) system. The experimental results suggest that the performance loss due to speed-up is partially mitigated using PSO-derived weights in a sorted GMM-based scheme. Index Terms—Gaussian mixture model–universal background model (GMM-UBM), particle swarm optimization (PSO), sorted GMM, speed-up, text-independent speaker verification. I.
Speaker Model Clustering for Efficient Speaker Identification
- in Large Population, IEEE Transaction on Audio, Speech and Language Processing, Volume
"... Abstract—In large population speaker identification (SI) systems, likelihood computations between an unknown speaker’s feature vectors and the registered speaker models can be very time-consuming and impose a bottleneck. For applications requiring fast SI, this is a recognized problem and improvemen ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
Abstract—In large population speaker identification (SI) systems, likelihood computations between an unknown speaker’s feature vectors and the registered speaker models can be very time-consuming and impose a bottleneck. For applications requiring fast SI, this is a recognized problem and improvements in efficiency would be beneficial. In this paper, we propose a method whereby GMM-based speaker models are clustered using a simple k-means algorithm. Then during the test stage only a small proportion of speaker models in selected clusters are used in the likelihood computations resulting in a significant speed-up with little to no loss in accuracy. In general, as the number of selected clusters is reduced, the identification accuracy decreases, however, this loss can be controlled through proper trade-off. The proposed method may also be combined with other test stage speed-up techniques resulting in even greater speed-up gains without additional sacrifices in accuracy. I.
Hautamäki, “Long-term F0 modeling for text-independent speaker recognition
- in Int. Conf. on Speech and Computer (SPECOM’2005
, 2005
"... ..."
(Show Context)
On the Use of Long-Term Average Spectrum in Automatic Speaker Recognition
"... Abstract. State-of-the-art automatic speaker recognition systems use mel-frequency cepstral coefficients (MFCC) features to describe the spectral properties of speakers. In forensic phonetics, the long-term average spectrum (LTAS) has been used for the same purpose. LTAS provides an intuitive graphi ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Abstract. State-of-the-art automatic speaker recognition systems use mel-frequency cepstral coefficients (MFCC) features to describe the spectral properties of speakers. In forensic phonetics, the long-term average spectrum (LTAS) has been used for the same purpose. LTAS provides an intuitive graphical representation which can be used to visualize and quantify speaker differences. However, few studies have reported the use of LTAS in automatic speaker recognition. Thus, the purpose of this paper is to systematically study how to use the LTAS in automatic speaker recognition. We will also find out whether it provides additional discriminative information in respect to the MFCC-based system. 1
REDUCING SPEAKER MODEL SEARCH SPACE IN SPEAKER IDENTIFICATION
"... For large population speaker identification (SID) systems, likelihood computations between an unknown speaker’s test feature set and speaker models can be very time-consuming and detrimental to applications where fast SID is required. In this paper, we propose a method whereby speaker models are clu ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
(Show Context)
For large population speaker identification (SID) systems, likelihood computations between an unknown speaker’s test feature set and speaker models can be very time-consuming and detrimental to applications where fast SID is required. In this paper, we propose a method whereby speaker models are clustered during the training stage. Then during the testing stage, only those clusters which are likely to contain high-likelihood speaker models are searched. The proposed method reduces the speaker model space which directly results in faster SID. Although there maybe a slight loss in identification accuracy depending on the number of clusters searched, this loss can be controlled by trading off speed and accuracy. 1.
Speaker identification in the presence of room reverberation
- in Proc. IEEE Biometrics Symp
, 2007
"... ABSTRACT Speaker identification (SI) systems based on Gaussian Mixture Models (GMMs) have demonstrated high levels of accuracy when both training and testing signals are acquired in near ideal conditions. These same systems when trained and tested with signals acquired under non-ideal channels such ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
ABSTRACT Speaker identification (SI) systems based on Gaussian Mixture Models (GMMs) have demonstrated high levels of accuracy when both training and testing signals are acquired in near ideal conditions. These same systems when trained and tested with signals acquired under non-ideal channels such as telephone have been shown to have markedly lower accuracy levels. In this paper, we consider a reverberant test environment and its impact on SI. We measure the degradation in SI accuracy when the system is trained with clean signals but tested with reverberant signals. Next, we propose a method whereby training signals are first filtered with a family of reverberation filters prior to construction of speaker models; the reverberation filters are designed to approximate expected test room reverberation. Reverberant test signals are then scored against the family of speaker models and identification is made. Our research demonstrates that by approximating test room reverberation in the training signals, the channel mismatch problem can be reduced and SI accuracy increased.
Joint Frame and Gaussian Selection for Text Independent Speaker Verification
"... Gaussian selection is a technique applied in the GMM-UBM framework to accelerate score calculation. We have recently introduced a novel Gaussian selection method known as sorted GMM (SGMM). SGMM uses scalar-indexing of the universal background model mean vectors to achieve fast search of the topscor ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Gaussian selection is a technique applied in the GMM-UBM framework to accelerate score calculation. We have recently introduced a novel Gaussian selection method known as sorted GMM (SGMM). SGMM uses scalar-indexing of the universal background model mean vectors to achieve fast search of the topscoring Gaussians. In the present work we extend this method by using 2-dimensional indexing, which leads to simultaneous frame and Gaussian selection. Our results on the NIST 2002 speaker recognition evaluation corpus indicate that both the 1- and 2-dimensional SGMMs outperform frame decimation and temporal tracking of top-scoring Gaussians by a wide margin (in terms of Gaussian computations relative to GMM-UBM as baseline). Index Terms — speaker verification, Gaussian selection, particle swarm optimization