Results 1 - 10
of
34
An overview of automatic speaker diarization systems
- IEEE TASLP
, 2006
"... Abstract—Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/ ..."
Abstract
-
Cited by 100 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel char-acteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification. Index Terms—Speaker diarization, speaker segmentation and clustering. I.
Robust speaker recognition in noisy conditions
- IEEE TRANS. AUDIO, SPEECH LANG. PROCESS
, 2007
"... ..."
(Show Context)
Improving Speaker Diarization
- IN PROC. FALL 2004 RICH TRANSCRIPTION WORKSHOP (RT-04
, 2004
"... This paper describes the LIMSI speaker diarization system used in the RT-04F evaluation. The RT-04F system builds upon the LIMSI baseline data partitioner, which is used in the broadcast news transcription system. This partitioner provides a high cluster purity but has a tendency to split the data f ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
This paper describes the LIMSI speaker diarization system used in the RT-04F evaluation. The RT-04F system builds upon the LIMSI baseline data partitioner, which is used in the broadcast news transcription system. This partitioner provides a high cluster purity but has a tendency to split the data from a speaker into several clusters when there is a large quantity of data for the speaker. In the RT-03S evaluation the baseline partitioner had a 24.5% diarization error rate. Several improvements to the baseline diarization system have been made. A standard Bayesian information criterion (BIC) agglomerative clustering has been integrated replacing the iterative Gaussian mixture model (GMM) clustering; a local BIC criterion is used for comparing single Gaussians with full covariance matrices. A second clustering stage has been added, making use of a speaker identification method: maximum a posteriori adaptation of a reference GMM with 128 Gaussians. A final post-processing stage refines the segment boundaries using the output of the transcription system. Compared to the best configuration baseline system for this task, the improved system reduces the speaker error time by over 75% on the development data. On evaluation data, a 8.5% overall diarization error rate was obtained, a 60% reduction in error compared to the baseline.
Combining Derivative and Parametric Kernels for Speaker Verification
, 2007
"... Support Vector Machine-based speaker verification (SV) has become a standard approach in recent years. These systems typically use dynamic kernels to handle the dynamic nature of the speech utterances. This paper shows that many of these kernels fall into one of two general classes, derivative and p ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
(Show Context)
Support Vector Machine-based speaker verification (SV) has become a standard approach in recent years. These systems typically use dynamic kernels to handle the dynamic nature of the speech utterances. This paper shows that many of these kernels fall into one of two general classes, derivative and parametric kernels. The attributes of these classes are contrasted and the conditions under which the two forms of kernel are identical are described. By avoiding these conditions gains may be obtained by combining derivative and parametric kernels. One combination strategy is to combine at the kernel level. This paper describes a maximum-margin based scheme for learning kernel weights for the SV task. Various dynamic kernels and combinations were evaluated on the NIST 2002 SRE task, including derivative and parametric kernels based upon different model structures. The best overall performance was 7.78 % EER achieved when combining five kernels.
The MIT Mobile Device Speaker Verification Corpus: Data collection and preliminary experiments
- In Proc. of Odyssey, The Speaker & Language Recognition Workshop
, 2006
"... In this paper we discuss data collection and preliminary experiments for a new speaker verification corpus collected on a small handheld device in multiple environments using multiple microphones. This corpus, which has been made publically available by MIT, is intended for explorations of the probl ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
In this paper we discuss data collection and preliminary experiments for a new speaker verification corpus collected on a small handheld device in multiple environments using multiple microphones. This corpus, which has been made publically available by MIT, is intended for explorations of the problem of robust speaker verification on handheld devices in noisy environments with limited training data. To provide a set of preliminary results, we examine text-dependent speaker verification under a variety of cross-conditional environment and microphone training constraints. Our preliminary results indicate that the presence of noise in the training data improves the robustness of our speaker verification models even when tested in mismatched environments. 1.
Particle swarm optimization for sorted adapted Gaussian mixture models
- IEEE Trans. Audio, Speech, Lang. Process
, 2009
"... Abstract—Recently, we introduced the sorted Gaussian mixture models (SGMMs) algorithm providing the means to tradeoff per-formance for operational speed and thus permitting the speed-up of GMM-based classification schemes. The performance of the SGMM algorithm depends on the proper choice of the sor ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
(Show Context)
Abstract—Recently, we introduced the sorted Gaussian mixture models (SGMMs) algorithm providing the means to tradeoff per-formance for operational speed and thus permitting the speed-up of GMM-based classification schemes. The performance of the SGMM algorithm depends on the proper choice of the sorting function, and the proper adjustment of its parameters. In the present work, we employ particle swarm optimization (PSO) and an appropriate fitness function to find the most advantageous parameters of the sorting function. We evaluate the practical significance of our approach on the text-independent speaker verification task utilizing the NIST 2002 speaker recognition evaluation (SRE) database while following the NIST SRE ex-perimental protocol. The experimental results demonstrate a superior performance of the SGMM algorithm using PSO when compared to the original SGMM. For comprehensiveness we also compared these results with those from a baseline Gaussian mix-ture model–universal background model (GMM-UBM) system. The experimental results suggest that the performance loss due to speed-up is partially mitigated using PSO-derived weights in a sorted GMM-based scheme. Index Terms—Gaussian mixture model–universal background model (GMM-UBM), particle swarm optimization (PSO), sorted GMM, speed-up, text-independent speaker verification. I.
Image-set matching using a geodesic distance and cohort normalization
- In IEEE International Conference on Automatic Face and Gesture Recognition
, 2008
"... An image-set based face recognition algorithm is proposed that exploits the full geometrical interpretation of Canonical Correlation Analysis (CCA). CCA maximizes the correlation between two linear subspaces associated with image-sets, where an image-set is assumed to contain multiple images of a pe ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
(Show Context)
An image-set based face recognition algorithm is proposed that exploits the full geometrical interpretation of Canonical Correlation Analysis (CCA). CCA maximizes the correlation between two linear subspaces associated with image-sets, where an image-set is assumed to contain multiple images of a person’s face. When these linear subspaces are viewed as points on a Grassmann manifold, then geodesic distance on the manifold becomes the natural way to compare image-sets. The proposed method is tested on the ORL data set where it achieves a rank one identification rate of 98.75%. The proposed method is also tested on a subset of the Face Recognition Grand Challenge Experiment 4 data. Specifically, 82 probe and 230 gallery subjects with 32 images per probe and gallery image-set. Our algorithm achieves a rank one identification rate of 87 % and a verification rate of 81 % at a false accept rate
Unsupervised Online Adaptation for Speaker Verification Over The Telephone
- IN PROC. SPEAKER ODYSSEY
, 2004
"... This paper presents experiments of unsupervised adaptation for a speaker detection system. The system used is a standard speaker verification system based on cepstral features and Gaussian mixture models. Experiments were performed on cellular speech data taken from the NIST 2002 speaker detection e ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper presents experiments of unsupervised adaptation for a speaker detection system. The system used is a standard speaker verification system based on cepstral features and Gaussian mixture models. Experiments were performed on cellular speech data taken from the NIST 2002 speaker detection evaluation. There was a total of about 30.000 trials involving 330 target speakers and more than 90% of impostor trials. Unsupervised adaptation significantly increases the system accuracy, with a reduction of the minimal detection cost function (DCF) from 0.33 for the baseline system to 0.25 with unsupervised online adaptation. Two incremental adaptation modes were tested, either by using a fixed decision threshold for adaptation, or by using the a posteriori probability of the true target for weighting the adaptation. Both methods provide similar results in the best configurations, but the latter is less sensitive to the actual threshold value.
Discriminative adaptation for speaker verification
- in Proceedings InterSpeech
, 2006
"... Speaker verification is a binary classification task to determine whether a claimed speaker uttered a phrase. Current approaches to speaker verification tasks typically involve adapting a general speaker Universal Background Model (UBM), normally a Gaussian Mixture Model (GMM), to model a particular ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
(Show Context)
Speaker verification is a binary classification task to determine whether a claimed speaker uttered a phrase. Current approaches to speaker verification tasks typically involve adapting a general speaker Universal Background Model (UBM), normally a Gaussian Mixture Model (GMM), to model a particular speaker. Verification is then performed by comparing the likelihoods from the speaker model to the UBM. Maximum A-Posteriori (MAP) is commonly used to adapt the UBM to a particular speaker. However speaker verification is a classification task. Thus, robust discriminative-based adaptation schemes should yield gains over the standard MAP approach. This paper describes and evaluates two discriminative approaches to speaker verification. The first is a discriminative version of MAP based on Maximum Mutual Information (MMI-MAP). The second is to use an augmented-GMM (A-GMM) as the speaker-specific model. The additional, augmented, parameters are discriminatively, and robustly, trained using a maximum margin estimation approach. The performance of these models is evaluated on the NIST 2002 SRE dataset. Though no gains were obtained using MMI-MAP, the A-GMM system gave an Equal Error Rate (EER) of 7.31%, a 30 % relative reduction in EER compared to the best performing GMM system. Index Terms: augmented statistical models, discriminative training, sequence kernels, speaker verification.
Singer identification in rembetiko music
- Sound and Music Computing
, 2007
"... Abstract — In this paper, the problem of the automatic identification of a singer is investigated using methods known from speaker identification. Ways for using world models are presented and the usage of Cepstral Mean Subtraction (CMS) is evaluated. In order to minimize the difference due to music ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Abstract — In this paper, the problem of the automatic identification of a singer is investigated using methods known from speaker identification. Ways for using world models are presented and the usage of Cepstral Mean Subtraction (CMS) is evaluated. In order to minimize the difference due to musical style we use a novel data set, consisting of samples from greek Rembetiko music, being very similar in style. The data set also explores for the first time the influence of the recording quality, by including many historical gramophone recordings. Experimental evaluations show the benefits of world models for frame selection and CMS, resulting in an average classification accuracy of about 81 % among 21 different singers.