Results 1 -
5 of
5
Speaker verification using Adapted Gaussian mixture models
- Digital Signal Processing
, 2000
"... In this paper we describe the major elements of MIT Lincoln Laboratory’s Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but ef ..."
Abstract
-
Cited by 385 (15 self)
- Add to MetaCart
In this paper we describe the major elements of MIT Lincoln Laboratory’s Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. The development and use of a handset detector and score normalization to greatly improve verification performance is also described and discussed. Finally, representative performance benchmarks and system behavior experiments on NIST SRE corpora are presented. © 2000 Academic Press Key Words: speaker recognition; Gaussian mixture models; likelihood ratio detector; universal background model; handset normalization; NIST evaluation. 1.
The NIST 1999 Speaker Recognition Evaluation - An Overview
, 2000
"... This article summarizes the 1999 NIST Speaker Recognition Evaluation. It discusses the overall research objectives, the three task definitions, the development and evaluation data sets, the specified performance measures and their manner of presentation, the overall quality of the results. More than ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
This article summarizes the 1999 NIST Speaker Recognition Evaluation. It discusses the overall research objectives, the three task definitions, the development and evaluation data sets, the specified performance measures and their manner of presentation, the overall quality of the results. More than a dozen sites from the United States, Europe, and Asia participated in this evaluation. There were three primary tasks for which automatic systems could be designed: one-speaker detection, twospeaker detection, and speaker tracking. All three tasks were performed in the context of mu-law encoded conversational telephone speech. The one-speaker detection task used single channel data, while the other two tasks used summed two-channel data. About 500 target speakers were specified, with two minutes of training speech data provided for each. Both multiple and single speaker test segments were selected from about 2000 conversations that were not used for training material. The duration of the multiple speaker test data was nominally one minute, while the duration of the single speaker test segments varied from near zero up to sixty seconds. For each task, systems had to make independent decisions for selected combinations of a test segment and a hypothesized target speaker. The data sets for each task were designed to be large enough to provide statistically meaningful results on test subsets of interest. Results were analyzed with respect to various conditions including, duration, pitch differences, and handset types. Keywords speaker recognition, speaker verification, speaker detection, speaker tracking, DET Curve, NIST evaluation 3 1
Model-based sequential organization in cochannel speech
- IEEE TRANS. ON AUDIO, SPEECH AND LANGUAGE PROCESSING
, 2006
"... A human listener has the ability to follow a speaker’s voice while others are speaking simultaneously; in particular, the listener can organize the time–frequency energy of the same speaker across time into a single stream. In this paper, we focus on sequential organization in cochannel speech, or m ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
A human listener has the ability to follow a speaker’s voice while others are speaking simultaneously; in particular, the listener can organize the time–frequency energy of the same speaker across time into a single stream. In this paper, we focus on sequential organization in cochannel speech, or mixtures of two voices. We extract minimally corrupted segments, or usable speech, in cochannel speech using a robust multipitch tracking algorithm. The extracted usable speech is shown to capture speaker characteristics and improves speaker identification (SID) performance across various target-to-interferer ratios. To utilize speaker characteristics for sequential organization, we extend the traditional SID framework to cochannel speech and derive a joint objective for sequential grouping and SID, leading to a problem of search for the optimum hypothesis. Subsequently we propose a hypothesis pruning algorithm based on speaker models in order to make the search computationally efficient. Evaluation results show that the proposed system approaches the ceiling SID performance obtained with prior pitch information and yields significant improvement over alternative approaches to sequential organization.
NIST Rich Transcription Workshop November 2004 The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast Audio and Telephone Conversations •
"... Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel c ..."
Abstract
- Add to MetaCart
Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization has utility in making automatic transcripts more readable and in searching and indexing audio archives. In this paper we describe the systems developed by MITLL and used in DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarization evaluation. The primary system is based on a new proxy speaker model approach and the secondary system follows a more standard BIC based clustering approach. We present experiments analyzing performance of the systems and present a cross-cluster recombination approach that significantly improves performance. In addition, we also present results applying our system to a telephone speech, summed channel speaker detection task. 1.

