Results 1 - 10
of
15
SPEAKER RECOGNITION USING SYLLABLE-BASED CONSTRAINTS FOR CEPSTRAL FRAME SELECTION
"... We describe a new GMM-UBM speaker recognition system that uses standard cepstral features, but selects different frames of speech for different subsystems. Subsystems, or “constraints”, are based on syllable-level information and combined at the score level. Results on both the NIST 2006 and 2008 te ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
We describe a new GMM-UBM speaker recognition system that uses standard cepstral features, but selects different frames of speech for different subsystems. Subsystems, or “constraints”, are based on syllable-level information and combined at the score level. Results on both the NIST 2006 and 2008 test data sets for the English telephone train and test condition reveal that a set of eight constraints performs extremely well, resulting in better performance than other commonly-used cepstral models. Given the still largely-unexplored world of possible constraints and combinations, it is likely that the approach can be even further improved. Index Terms — Speaker recognition, higher-level features, GMMs, cepstral features, MFCCs, syllables [7]. The resulting system outperforms SRI’s otherwise top current cepstral-based systems on English telephone data, for both the NIST SRE 2006 and NIST SRE 2008 test data sets. 1.
Prosodic and other Long-Term Features for Speaker Diarization
"... Abstract—Speaker diarization is defined as the task of determining “who spoke when ” given an audio track and no other prior knowledge of any kind. The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Abstract—Speaker diarization is defined as the task of determining “who spoke when ” given an audio track and no other prior knowledge of any kind. The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic and other long-term features. First, we present a framework to study the speaker discriminability of 70 different long-term features. Then, we show how the top-ranked long-term features can be combined with short-term features to increase the accuracy of speaker diarization. The results were measured on standardized datasets (NIST RT) and show a consistent improvement of about 30 % relative in diarization error rate compared to the best system presented at the NIST evaluation in 2007. Index Terms—Long-term features, prosody, speaker diarization. I.
L.: A Text-Constrained Prosodic System for Speaker Verification
- In: Proceedings of Interspeech
, 2007
"... We describe four improvements to a prosody SVM system, including a new method based on text- and part-of-speechconstrained prosodic features. The improved system shows remarkably good performance on NIST SRE06 data, reducing the error rate of an MLLR system by as much as 23 % after combination. In a ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We describe four improvements to a prosody SVM system, including a new method based on text- and part-of-speechconstrained prosodic features. The improved system shows remarkably good performance on NIST SRE06 data, reducing the error rate of an MLLR system by as much as 23 % after combination. In addition, anN-best system analysis using eight systems reveals that the prosody SVM is the third and second most important system for 1- and 8-side training conditions, respectively—providing more complementary information than other state-of-the-art cepstral systems. We conclude that as cepstral systems continue to improve, it should become only more important to develop systems based on higher-level features.
A New Adaptation Approach to High-Level Speaker-Model Creation in Speaker Verification
"... Research has shown that speaker verification based on high-level speaker features requires long enrollment utterances to guarantee low error rate during verification. However, in practical speaker verification, it is common to model speakers based on a limited amount of enrollment data, which will m ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Research has shown that speaker verification based on high-level speaker features requires long enrollment utterances to guarantee low error rate during verification. However, in practical speaker verification, it is common to model speakers based on a limited amount of enrollment data, which will make the speaker models unreliable. This paper proposes four new adaptation methods for creating high-level speaker models to alleviate this undesirable effect. Unlike conventional methods in which only the phoneme-dependent background model is adapted, the proposed adaptation methods also adapts the phoneme-independent speaker model to fully utilize all the information available in the training data. A proportional factor, which is derived from the ratio between the phoneme-dependent background model and the phoneme-independent background model, is used to adjust the phonemeindependent speaker models during adaptation. The proposed method was evaluated under the NIST 2000 and NIST 2002 SRE frameworks. Experimental results show that the proposed adaptation method can alleviate the data-sparseness problem effectively and achieves a better performance when compared with traditional MAP adaptation. Key words: speaker verification, high-level features, model adaptation, maximum-a-posterior (MAP) adaptation 1
S.: Duration and Pronunciation Conditioned Lexical Modeling for Speaker Verification
- In: Proceedings of Interspeech
, 2007
"... We propose a method to improve speaker recognition lexical model performance using acoustic-prosodic information. More specifically, the lexical model is trained using duration- and pronunciation-conditioned word N-grams, simultaneously modeling lexical information along with their acoustic and pros ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We propose a method to improve speaker recognition lexical model performance using acoustic-prosodic information. More specifically, the lexical model is trained using duration- and pronunciation-conditioned word N-grams, simultaneously modeling lexical information along with their acoustic and prosodic characteristics. Support vector machines are used for modeling and scoring, with N-gram frequency vectors serving as features. Experimental results using NIST Speaker Recognition Evaluation data sets show that this method outperforms the regular word N-gram-based lexical models. Furthermore, our approach gives additional information when combined with a high-accuracy acoustic speaker model. We believe that this is a promising step toward integrated speaker recognition models that combine multiple types of high-level features. Index Terms — speaker verification, speaker recognition, lexical modeling, SVM. 1.
Speaker Diarization: A Review of Recent Research
, 2010
"... Abstract—Speaker diarization is the task of determining “who spoke when? ” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarizatio ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Speaker diarization is the task of determining “who spoke when? ” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher-level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research. Index Terms—Speaker diarization, rich transcription, meetings I.
IMPROVING ACOUSTIC SPEAKER VERIFICATION WITH VISUAL BODY- LANGUAGE FEATURES
"... We show how an SVM based acoustic speaker verification system can be significantly improved in incorporating new visual features that capture the speaker’s “Body Language. ” We apply this system to many hours of Internet videos and TV broadcasts of politicians and other public figures. Our data rang ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We show how an SVM based acoustic speaker verification system can be significantly improved in incorporating new visual features that capture the speaker’s “Body Language. ” We apply this system to many hours of Internet videos and TV broadcasts of politicians and other public figures. Our data ranges from current and former US election candidates to the Queen of England, the President of France, and the Pope, while giving speeches. Index Terms — Speaker recognition, Machine vision,
FUSING SHORT TERM AND LONG TERM FEATURES FOR IMPROVED SPEAKER DIARIZATION
"... The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic and other longterm features. First, we present a framework to study the speaker discriminability of 70 different long-term features. Then, ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic and other longterm features. First, we present a framework to study the speaker discriminability of 70 different long-term features. Then, we show how the top-ranked long-term features can be combined with short-term features to increase the accuracy of speaker diarization. The results were measured on standardized data sets (NIST RT) and show a consistent improvement of about 30 % relative in diarization error rate compared to the best system presented at the NIST evaluation in 2007. This result was also verified on a wide set of meetings, which we call CombDev, that contains 21 meetings from previous evaluations. Since the prosodic and long-term features were selected using a diarization-independent speakerdiscriminability study, we are confident that the same features are able to improve other systems that perform similar tasks
The Case for Automatic Higher-Level Features in Forensic Speaker Recognition
"... Approaches from standard automatic speaker recognition, which rely on cepstral features, suffer the problem of lack of interpretability for forensic applications. But the growing practice of using “higher-level ” features in automatic systems offers promise in this regard. We provide an overview of ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Approaches from standard automatic speaker recognition, which rely on cepstral features, suffer the problem of lack of interpretability for forensic applications. But the growing practice of using “higher-level ” features in automatic systems offers promise in this regard. We provide an overview of automatic higher-level systems and discuss potential advantages, as well as issues, for their use in the forensic context. Index Terms: speaker recognition, higher-level features, forensics 1.
Darwin Phones: the Evolution of Sensing and Inference on Mobile Phones
"... We present Darwin, an enabling technology for mobile phone sensing that combines collaborative sensing and classification techniques to reason about human behavior and context on mobile phones. Darwin advances mobile phone sensing through the deployment of efficient but sophisticated machine learnin ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We present Darwin, an enabling technology for mobile phone sensing that combines collaborative sensing and classification techniques to reason about human behavior and context on mobile phones. Darwin advances mobile phone sensing through the deployment of efficient but sophisticated machine learning techniques specifically designed to run directly on sensor-enabled mobile phones (i.e., smartphones). Darwin tackles three key sensing and inference challenges that are barriers to mass-scale adoption of mobile phone sensing applications: (i) the human-burden of training classifiers, (ii) the ability to perform reliably in different environments (e.g., indoor, outdoor) and (iii) the ability to scale to a large number of phones without jeopardizing the “phone experience ” (e.g., usability and battery lifetime). Darwin is a collaborative reasoning framework built on three concepts: classifier/model evolution, model pooling, and collaborative inference. To the best of our knowledge Darwin is the first system that applies distributed machine learning techniques and collaborative inference concepts to mobile phones. We implement the Darwin system on the Nokia N97 and Apple iPhone. While Darwin represents a general framework applicable to a wide variety of emerging mobile sensing applications, we implement a speaker recognition application and an augmented reality application to evaluate the benefits of Darwin. We show experimental results from eight individuals carrying Nokia N97s and demonstrate that Darwin improves the reliability and scalability of the proof-ofconcept speaker recognition application without additional burden to users.

