Results 1 - 10
of
45
A Tutorial on Text-Independent Speaker Verification
- EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING 2004:4, 430–451
, 2004
"... This paper presents an overview of a state-of-the-art text-independent speaker verification system. First, an introduction proposes a modular scheme of the training and test phases of a speaker verification system. Then, the most commonly speech parameterization used in speaker verification, namely, ..."
Abstract
-
Cited by 138 (13 self)
- Add to MetaCart
This paper presents an overview of a state-of-the-art text-independent speaker verification system. First, an introduction proposes a modular scheme of the training and test phases of a speaker verification system. Then, the most commonly speech parameterization used in speaker verification, namely, cepstral analysis, is detailed. Gaussian mixture modeling, which is the speaker modeling technique used in most systems, is then explained. A few speaker modeling alternatives, namely, neural networks and support vector machines, are mentioned. Normalization of scores is then explained, as this is a very important step to deal with real-world data. The evaluation of a speaker verification system is then detailed, and the detection error trade-off (DET) curve is explained. Several extensions of speaker verification are then enumerated, including speaker tracking and segmentation by speakers. Then, some applications of speaker verification are proposed, including on-site applications, remote applications, applications relative to structuring audio information, and games. Issues concerning the forensic area are then recalled, as we believe it is very important to inform people about the actual performance and limitations of speaker verification systems. This paper concludes by giving a
Robust speaker change detection
- IEEE Signal Process. Lett
, 2004
"... Abstract—Most commonly used criteria for speaker change detection like log likelihood ratio (LLR) and Bayesian information criterion (BIC) have an adjustable threshold/penalty parameter to make speaker change decisions. These parameters are not always robust to different acoustic conditions and have ..."
Abstract
-
Cited by 49 (8 self)
- Add to MetaCart
Abstract—Most commonly used criteria for speaker change detection like log likelihood ratio (LLR) and Bayesian information criterion (BIC) have an adjustable threshold/penalty parameter to make speaker change decisions. These parameters are not always robust to different acoustic conditions and have to be tuned. In this letter, we present a criterion which can be used to identify speaker changes in an audio stream without such tuning. The criterion consists of calculating the LLR of two models with the same number of parameters. Results on the Hub4 1997 evaluation set indicate that we achieve a performance comparable to using BIC with optimal penalty term. Index Terms—Bayesian Information Criterion (BIC), Log Likelihood Ratio (LLR), speaker change detection.
Location Based Speaker Segmentation
- in Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-03), Hong Kong
, 2003
"... This paper proposes a technique that segments audio according to speakers based on their location. In many multi-party conversations, such as meetings, the location of participants is restricted to a small number of regions, such as seats around a table, or at a whiteboard. In such cases, segmentati ..."
Abstract
-
Cited by 33 (12 self)
- Add to MetaCart
(Show Context)
This paper proposes a technique that segments audio according to speakers based on their location. In many multi-party conversations, such as meetings, the location of participants is restricted to a small number of regions, such as seats around a table, or at a whiteboard. In such cases, segmentation according to these discrete regions would be a reliable means of determining speaker turns. We propose a system that uses microphone pair time delays as features to represent speaker locations. These features are integrated in a GMM/HMM framework to determine an optimal segmentation of the audio according to location. The HMM framework also allows extensions to recognise more complex structure, such as the presence of two simultaneous speakers. Experiments testing the system on real recordings from a meeting room show that the proposed location features can provide greater discrimination than standard cepstral features, and also demonstrate the success of an extension to handle dual-speaker overlap.
Music Thumbnailing Via Structural Analysis
- Proceedings of ACM Multimedia Conference
, 2003
"... Music thumbnailing (or music summarization) aims at finding the most representative part of a song, which can be used for web browsing, web searching and music recommendation. Three strategies are proposed in this paper for automatically generating the thumbnails of music. All the strategies are bas ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
(Show Context)
Music thumbnailing (or music summarization) aims at finding the most representative part of a song, which can be used for web browsing, web searching and music recommendation. Three strategies are proposed in this paper for automatically generating the thumbnails of music. All the strategies are based on the results of music structural analysis, which identifies the recurrent structure of musical signals. Instead of being evaluated subjectively, the generated thumbnails are evaluated by several criteria, mainly based on previous human experiments on music thumbnailing and the properties of thumbnails used for commercial web sites. Additionally, the performance of the structural analysis is demonstrated visually using figures for qualitative evaluation, and by three novel structural similarity metrics for quantitative evaluation. The preliminary results obtained using a corpus of Beatles ’ songs demonstrate the promise of our method and suggest that different thumbnailing strategies might be proper for different applications.
Automated Analysis of Musical Structure
- Phd thesis, MIT
, 2005
"... Listening to music and perceiving its structure is a fairly easy task for humans, even for listeners without formal musical training. For example, we can notice changes of notes, chords and keys, though we might not be able to name them (segmentation based on tonality and harmonic analysis); we can ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Listening to music and perceiving its structure is a fairly easy task for humans, even for listeners without formal musical training. For example, we can notice changes of notes, chords and keys, though we might not be able to name them (segmentation based on tonality and harmonic analysis); we can parse a musical piece into phrases or sections (segmentation based on recurrent structural analysis); we can identify and memorize the main themes or the catchiest parts – hooks- of a piece (summarization based on hook analysis); we can detect the most informative musical parts for making certain judgments (detection of salience for classification). However, building computational models to mimic these processes is a hard problem. Furthermore, the amount of digital music that has been generated and stored has already become unfathomable. How to efficiently store and retrieve the digital content is an important real-world problem. This dissertation presents our research on automatic music segmentation, summarization and
A New Speaker Change Detection Method For Two-Speaker Segmentation
- Proc. of IEEE ICASSP, Volume 4, IV-3908
"... In absence of prior information about speakers, an important step in speaker segmentation is to obtain initial estimates for training speaker models. In this paper, we present a new method for obtaining these estimates. The method assumes that a conversation must be initiated by one of the speakers. ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
(Show Context)
In absence of prior information about speakers, an important step in speaker segmentation is to obtain initial estimates for training speaker models. In this paper, we present a new method for obtaining these estimates. The method assumes that a conversation must be initiated by one of the speakers. Thus one speaker model is estimated from the small segment at the beginning of the conversation and the segment that has the largest distance from the initial segment is used to train second speaker model. We describe a system based on this method and evaluate it on two different tasks: a controlled task with variations in the duration of the initial speaker segment and amount of overlapped speech and 2001 NIST Speaker Recognition Evaluation task that contains natural conversations. This system shows significant improvements over the conventional system in absence of overlapped speech on the controlled task.
Towards Computer Understanding of Human Interactions
- Proc. European Symp. on Ambient Intelligence (EUSAI), LNCS 2875
, 2003
"... People meet in order to interact - disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
(Show Context)
People meet in order to interact - disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted.
Speaker segmentation and clustering
- Signal Processing
, 2008
"... This survey focuses on two challenging speech processing topics, namely: speaker segmen-tation and speaker clustering. Speaker segmentation aims at nding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-base ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
This survey focuses on two challenging speech processing topics, namely: speaker segmen-tation and speaker clustering. Speaker segmentation aims at nding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algo-rithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algo-rithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is of-fered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benet from combined speaker segmentation and clustering.
Blind Change Detection for Audio Segmentation
- In Proc. 2005 IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP’05
, 2005
"... Automatic segmentation of these audio streams according to speaker identities, environmental and channel conditions has become an important preprocessing step for speech recognition, speaker recognition, and audio data mining [7], [8], [?], and [?]. In this paper, we test and compare the cumulative ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Automatic segmentation of these audio streams according to speaker identities, environmental and channel conditions has become an important preprocessing step for speech recognition, speaker recognition, and audio data mining [7], [8], [?], and [?]. In this paper, we test and compare the cumulative sum (CuSum)algorithm [2], and [?], the Bayesian information criterion (BIC) algorithm [?], and [4], and Kolmogrov-Smirnov's test, [?], for detecting changes in speaker identity, environmental conditions and channel conditions in audio signals. We present a novel approach that combines hypothesized boundaries from the three algorithms to achieve the final segmentation of the audio signal. Our experiments on the 1998 EARS Hub4 Broadcast News show that a variation of the CuSum algorithm significantly outperforms the other two approaches and that combining the three approaches using a voting scheme improves the performance slightly compared to using the CuSum algorithm alone.
Semantic Segmentation and Summarization of Music
- IEEE Signal Processing Magazine
, 2006
"... Automatic segmentation and summarization of music is a key issue in music browsing, searching and recommendation. This article presents methods for segmenting music based on its tonality and recurrent structure, and summarizing music based on its structure. Experimental results are evaluated quantit ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Automatic segmentation and summarization of music is a key issue in music browsing, searching and recommendation. This article presents methods for segmenting music based on its tonality and recurrent structure, and summarizing music based on its structure. Experimental results are evaluated quantitatively to demonstrate the promise of the proposed methods. 1.