Results 1 - 10
of
150
Speaker, Environment And Channel Change Detection And Clustering Via The Bayesian Information Criterion
, 1998
"... In this paper, we are interested in detecting changes in speaker identity, environmental condition and channel condition; we call this the problem of acoustic change detection. The input audio stream can be modeled as a Gaussian process in the cepstral space. We present a maximum likelihood approach ..."
Abstract
-
Cited by 272 (2 self)
- Add to MetaCart
In this paper, we are interested in detecting changes in speaker identity, environmental condition and channel condition; we call this the problem of acoustic change detection. The input audio stream can be modeled as a Gaussian process in the cepstral space. We present a maximum likelihood approach to detect turns of a Gaussian process; the decision of a turn is based on the Bayesian Information Criterion (BIC), a model selection criterion well-known in the statistics literature. The BIC criterion can also be applied as a termination criterion in hierarchical methods for clustering of audio segments: two nodes can be merged only if the merging increases the BIC value. Our experiments on the Hub4 1996 and 1997 evaluation data show that our segmentation algorithm can successfully detect acoustic changes; our clustering algorithm can produce clusters with high purity, leading to improvements in accuracy through unsupervised adaptation as much as the ideal clustering by the true speaker i...
The LIMSI Broadcast News Transcription System
- Speech Communication
, 2002
"... This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluatio ..."
Abstract
-
Cited by 131 (12 self)
- Add to MetaCart
(Show Context)
This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluations. Two main problems needed to be addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers).
A robust speaker clustering algorithm
- In Proc. IEEE Automatic Speech Recognition Understanding Workshop
, 2003
"... In this paper, we present a novel speaker segmentation and clustering algorithm. The algorithm automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers. Advantages of this algorithm over other approaches are: no need for ..."
Abstract
-
Cited by 106 (16 self)
- Add to MetaCart
(Show Context)
In this paper, we present a novel speaker segmentation and clustering algorithm. The algorithm automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers. Advantages of this algorithm over other approaches are: no need for training/development data, no threshold adjustment requirements, and robustness to different data conditions. This paper also reports the performance of the algorithm on different datasets released by NIST with different initial conditions and parameter settings. The consistently low speaker diarization error rate clearly indicates the robustness of the algorithm. 1.
An overview of automatic speaker diarization systems
- IEEE TASLP
, 2006
"... Abstract—Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/ ..."
Abstract
-
Cited by 100 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel char-acteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification. Index Terms—Speaker diarization, speaker segmentation and clustering. I.
Speaker-Based Segmentation For Audio Data Indexing
- Speech Communication
, 1999
"... In this paper, we address the problem of the speakerbased segmentation, which is the first necessary step for several indexing tasks. It consists in recognizing from their voice the sequence of people engaged in a conversation. In our context, we make no assumptions about prior knowledge of the spea ..."
Abstract
-
Cited by 93 (1 self)
- Add to MetaCart
In this paper, we address the problem of the speakerbased segmentation, which is the first necessary step for several indexing tasks. It consists in recognizing from their voice the sequence of people engaged in a conversation. In our context, we make no assumptions about prior knowledge of the speaker characteristics (no speaker model, no speech model, no training phase). However, we assume that people do not speak simultaneously. Our segmentation technique takes advantages of two different types of segmentation algorithms. It is organized in two passes: first, the most likely speaker changing points are detected and then, they are validated or discarded. Our algorithm is efficient to detect speaker changing points even close to one another and is thus suited for segmenting conversations containing segments of any length. 1. INTRODUCTION With the development of telecommunications and of computer science, it is now easy to store large amounts of speech data. The problem is however how...
Music Summarization Using Key Phrases
- In Proc. IEEE ICASSP
, 2000
"... Systems to automatically provide a representative summary or 'Key Phrase' of a piece of music axe described. For a 'rock' song with 'verse' and 'chorus' sections, we aim to return the chorus or in any case the most repeated and hence most memorable section. Th ..."
Abstract
-
Cited by 88 (1 self)
- Add to MetaCart
(Show Context)
Systems to automatically provide a representative summary or 'Key Phrase' of a piece of music axe described. For a 'rock' song with 'verse' and 'chorus' sections, we aim to return the chorus or in any case the most repeated and hence most memorable section. The techniques axe less applicable to music with more complicated structure although possibly our general framework could still be used with different heuristics.
Speech/music Discrimination Based On Posterior Probability Features
, 1999
"... A hybrid connectionist-HMM speech recognizer uses a neural network acoustic classifier. This network estimates the posterior probability that the acoustic feature vectors at the current time step should be labelled as each of around 50 phone classes. We sought to exploit informal observations of the ..."
Abstract
-
Cited by 51 (9 self)
- Add to MetaCart
A hybrid connectionist-HMM speech recognizer uses a neural network acoustic classifier. This network estimates the posterior probability that the acoustic feature vectors at the current time step should be labelled as each of around 50 phone classes. We sought to exploit informal observations of the distinctions in this posterior domain between nonspeech audio and speech segments well-modeled by the network. We describe four statistics that successfully capture these differences, and which can be combined to make a reliable speech/nonspeech categorization that is closely related to the likely performance of the speech recognizer. We test these features on a database of speech/music examples, and our results match the previously-reported classification error, based on a variety of special-purpose features, of 1.4% for 2.5 second segments. We also show that recognizing segments ordered according to their resemblance to clean speech can result in an error rate close to the ideal minimum o...
Unknown-Multiple Speaker Clustering Using Hmm
- IN PROCEEDINGS OF ICSLP-2002
, 2002
"... An HMM-based speaker clustering framework is presented, where the number of speakers and segmentation boundaries are unknown a priori. Ideally, the system aims to create one pure cluster for each speaker. The HMM is ergodic in nature with a minimum duration topology. The final number of clusters is ..."
Abstract
-
Cited by 50 (15 self)
- Add to MetaCart
(Show Context)
An HMM-based speaker clustering framework is presented, where the number of speakers and segmentation boundaries are unknown a priori. Ideally, the system aims to create one pure cluster for each speaker. The HMM is ergodic in nature with a minimum duration topology. The final number of clusters is determined automatically by merging closest clusters and retraining this new cluster, until a decrease in likelihood is observed. In the same framework, we also examine the effect of using only the features from highly voiced frames as a means of improving the robustness and computational complexity of the algorithm. The proposed system is assessed on the 1996 HUB-4 evaluation test set in terms of both cluster and speaker purity. It is shown that the number of clusters found often correspond to the actual number of speakers.
Strategies for automatic segmentation of audio data
- in Proc. ICASSP
, 2000
"... In many applications, like indexing of broadcast news or surveillance applications, the input data consists of a continuous, unsegmented audio stream. Speech recognition technology, however, usually requires segments of relatively short length as input. For such applications, effective methods to se ..."
Abstract
-
Cited by 45 (0 self)
- Add to MetaCart
(Show Context)
In many applications, like indexing of broadcast news or surveillance applications, the input data consists of a continuous, unsegmented audio stream. Speech recognition technology, however, usually requires segments of relatively short length as input. For such applications, effective methods to segment continuous audio streams into homogeneous segments are required. In this paper, three different segmenting strategies (model-based, metric-based and energy-based) are compared on the same broadcast news test data. It is shown that model-based and metric-based techniques outperform the simpler energy-based algorithms. While model-based segmenters achieve very high level of segment boundary precision, the metric-based segmenter performes better in terms of segment boundary recall (RCL). To combine the advantages of both strategies, a new hybrid algorithm is introduced. For this, the results of a preliminary metric-based segmentation are used to construct the models for the final model-based segmenter run. The new hybrid approach is shown to outperform the other segmenting strategies. 1.
Connectionist speech recognition of Broadcast News
, 2002
"... This paper describes connectionist techniques for recognition of Broadcast News. The fundamental difference between connectionist systems and more conventional mixture-of-Gaussian systems is that connectionist models directly estimate posterior probabilities as opposed to likelihoods. Access to post ..."
Abstract
-
Cited by 38 (15 self)
- Add to MetaCart
This paper describes connectionist techniques for recognition of Broadcast News. The fundamental difference between connectionist systems and more conventional mixture-of-Gaussian systems is that connectionist models directly estimate posterior probabilities as opposed to likelihoods. Access to posterior probabilities has enabled us to develop a number of novel approaches to confidence estimation, pronunciation modelling and search. In addition we have investigated a new feature extraction technique based on the modulation-filtered spectrogram (MSG), and methods for combining multiple information sources. We have incorporated all of these techniques into a system for the transcription