Results 1 - 10
of
20
Location Based Speaker Segmentation
- in Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-03), Hong Kong
, 2003
"... This paper proposes a technique that segments audio according to speakers based on their location. In many multi-party conversations, such as meetings, the location of participants is restricted to a small number of regions, such as seats around a table, or at a whiteboard. In such cases, segmentati ..."
Abstract
-
Cited by 22 (11 self)
- Add to MetaCart
This paper proposes a technique that segments audio according to speakers based on their location. In many multi-party conversations, such as meetings, the location of participants is restricted to a small number of regions, such as seats around a table, or at a whiteboard. In such cases, segmentation according to these discrete regions would be a reliable means of determining speaker turns. We propose a system that uses microphone pair time delays as features to represent speaker locations. These features are integrated in a GMM/HMM framework to determine an optimal segmentation of the audio according to location. The HMM framework also allows extensions to recognise more complex structure, such as the presence of two simultaneous speakers. Experiments testing the system on real recordings from a meeting room show that the proposed location features can provide greater discrimination than standard cepstral features, and also demonstrate the success of an extension to handle dual-speaker overlap.
Robust speaker change detection
- IEEE Signal Process. Lett
, 2004
"... Abstract—Most commonly used criteria for speaker change detection like log likelihood ratio (LLR) and Bayesian information criterion (BIC) have an adjustable threshold/penalty parameter to make speaker change decisions. These parameters are not always robust to different acoustic conditions and have ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
Abstract—Most commonly used criteria for speaker change detection like log likelihood ratio (LLR) and Bayesian information criterion (BIC) have an adjustable threshold/penalty parameter to make speaker change decisions. These parameters are not always robust to different acoustic conditions and have to be tuned. In this letter, we present a criterion which can be used to identify speaker changes in an audio stream without such tuning. The criterion consists of calculating the LLR of two models with the same number of parameters. Results on the Hub4 1997 evaluation set indicate that we achieve a performance comparable to using BIC with optimal penalty term. Index Terms—Bayesian Information Criterion (BIC), Log Likelihood Ratio (LLR), speaker change detection.
Music Thumbnailing Via Structural Analysis
- Proceedings of ACM Multimedia Conference
, 2003
"... Music thumbnailing (or music summarization) aims at finding the most representative part of a song, which can be used for web browsing, web searching and music recommendation. Three strategies are proposed in this paper for automatically generating the thumbnails of music. All the strategies are bas ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
Music thumbnailing (or music summarization) aims at finding the most representative part of a song, which can be used for web browsing, web searching and music recommendation. Three strategies are proposed in this paper for automatically generating the thumbnails of music. All the strategies are based on the results of music structural analysis, which identifies the recurrent structure of musical signals. Instead of being evaluated subjectively, the generated thumbnails are evaluated by several criteria, mainly based on previous human experiments on music thumbnailing and the properties of thumbnails used for commercial web sites. Additionally, the performance of the structural analysis is demonstrated visually using figures for qualitative evaluation, and by three novel structural similarity metrics for quantitative evaluation. The preliminary results obtained using a corpus of Beatles ’ songs demonstrate the promise of our method and suggest that different thumbnailing strategies might be proper for different applications.
Towards Computer Understanding of Human Interactions
- Proc. European Symp. on Ambient Intelligence (EUSAI), LNCS 2875
, 2003
"... People meet in order to interact - disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
People meet in order to interact - disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted.
A New Speaker Change Detection Method For Two-Speaker Segmentation
- Proc. of IEEE ICASSP, Volume 4, IV-3908
"... In absence of prior information about speakers, an important step in speaker segmentation is to obtain initial estimates for training speaker models. In this paper, we present a new method for obtaining these estimates. The method assumes that a conversation must be initiated by one of the speakers. ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
In absence of prior information about speakers, an important step in speaker segmentation is to obtain initial estimates for training speaker models. In this paper, we present a new method for obtaining these estimates. The method assumes that a conversation must be initiated by one of the speakers. Thus one speaker model is estimated from the small segment at the beginning of the conversation and the segment that has the largest distance from the initial segment is used to train second speaker model. We describe a system based on this method and evaluate it on two different tasks: a controlled task with variations in the duration of the initial speaker segment and amount of overlapped speech and 2001 NIST Speaker Recognition Evaluation task that contains natural conversations. This system shows significant improvements over the conventional system in absence of overlapped speech on the controlled task.
Automated Analysis of Musical Structure
- Phd thesis, MIT
, 2005
"... Listening to music and perceiving its structure is a fairly easy task for humans, even for listeners without formal musical training. For example, we can notice changes of notes, chords and keys, though we might not be able to name them (segmentation based on tonality and harmonic analysis); we can ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Listening to music and perceiving its structure is a fairly easy task for humans, even for listeners without formal musical training. For example, we can notice changes of notes, chords and keys, though we might not be able to name them (segmentation based on tonality and harmonic analysis); we can parse a musical piece into phrases or sections (segmentation based on recurrent structural analysis); we can identify and memorize the main themes or the catchiest parts – hooks- of a piece (summarization based on hook analysis); we can detect the most informative musical parts for making certain judgments (detection of salience for classification). However, building computational models to mimic these processes is a hard problem. Furthermore, the amount of digital music that has been generated and stored has already become unfathomable. How to efficiently store and retrieve the digital content is an important real-world problem. This dissertation presents our research on automatic music segmentation, summarization and
Semantic Segmentation and Summarization of Music
- IEEE Signal Processing Magazine
, 2006
"... Automatic segmentation and summarization of music is a key issue in music browsing, searching and recommendation. This article presents methods for segmenting music based on its tonality and recurrent structure, and summarizing music based on its structure. Experimental results are evaluated quantit ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Automatic segmentation and summarization of music is a key issue in music browsing, searching and recommendation. This article presents methods for segmenting music based on its tonality and recurrent structure, and summarizing music based on its structure. Experimental results are evaluated quantitatively to demonstrate the promise of the proposed methods. 1.
Blind Change Detection for Audio Segmentation
- In Proc. 2005 IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP’05
, 2005
"... Automatic segmentation of these audio streams according to speaker identities, environmental and channel conditions has become an important preprocessing step for speech recognition, speaker recognition, and audio data mining [7], [8], [?], and [?]. In this paper, we test and compare the cumulative ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Automatic segmentation of these audio streams according to speaker identities, environmental and channel conditions has become an important preprocessing step for speech recognition, speaker recognition, and audio data mining [7], [8], [?], and [?]. In this paper, we test and compare the cumulative sum (CuSum)algorithm [2], and [?], the Bayesian information criterion (BIC) algorithm [?], and [4], and Kolmogrov-Smirnov's test, [?], for detecting changes in speaker identity, environmental conditions and channel conditions in audio signals. We present a novel approach that combines hypothesized boundaries from the three algorithms to achieve the final segmentation of the audio signal. Our experiments on the 1998 EARS Hub4 Broadcast News show that a variation of the CuSum algorithm significantly outperforms the other two approaches and that combining the three approaches using a voting scheme improves the performance slightly compared to using the CuSum algorithm alone.
Automatic Segmentation of Sung Melodies
, 2002
"... The present work explores several techniques for the automatic segmentation of sung melodies. Most contemporary music information retrieval (MIR) systems require sung queries to be segmented into disjoint regions representing individual notes for database searching. The fundamental philosophy adhere ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The present work explores several techniques for the automatic segmentation of sung melodies. Most contemporary music information retrieval (MIR) systems require sung queries to be segmented into disjoint regions representing individual notes for database searching. The fundamental philosophy adhered to throughout this work is that a melody segmentation algorithm should rely primarily on fundamental pitch information. Three classes of segmentation algorithms are explored: predictive filtering, LMS detection and curve fitting. Two different predictive filter formulations are presented, Kalman filters and adaptive RLS filters. A single-node neural network, or perceptron, is implemented as a LMS detector. Lastly, a curve fitting algorithm is developed. The curve fitting algorithm makes use of dynamic programming to keep the computational complexity manageable. 1
Speech/Music/Silence and Gender Detection Algorithm
- In Proceedings of the 7th International conference on Distributed Multimedia Systems DMS01
, 2001
"... A speech - music - silence discrimination and a gender detection algorithm is presented in this paper. First silence segments are extracted from audio stream by using energy and ZCR features. The speech are detected using the energy envelope and harmonic features. Music segments are then classified ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
A speech - music - silence discrimination and a gender detection algorithm is presented in this paper. First silence segments are extracted from audio stream by using energy and ZCR features. The speech are detected using the energy envelope and harmonic features. Music segments are then classified using energy envelope, and harmonic features too. For gender detection, we propose a feature that we used to discriminate between men and women's voices. The proposed algorithm needs no training phase, as in Gaussian Mixture Models based algorithms, and it classifies audio stream into 4 classes, speech, music, silence, and else with a delay of 1 s. Once speech is extracted, gender detection could be applied, and the detection could be in real time. As an evaluation of the proposed algorithm, we applied it on 10 min of audio extracted from CNN programs, 93% of classification accuracy for speech and 84% of classification accuracy for music is achieved. 80% of gender detection accuracy in speech segments is achieved.

