Results 1 - 10
of
19
Speaker, Environment And Channel Change Detection And Clustering Via The Bayesian Information Criterion
, 1998
"... In this paper, we are interested in detecting changes in speaker identity, environmental condition and channel condition; we call this the problem of acoustic change detection. The input audio stream can be modeled as a Gaussian process in the cepstral space. We present a maximum likelihood approach ..."
Abstract
-
Cited by 153 (2 self)
- Add to MetaCart
In this paper, we are interested in detecting changes in speaker identity, environmental condition and channel condition; we call this the problem of acoustic change detection. The input audio stream can be modeled as a Gaussian process in the cepstral space. We present a maximum likelihood approach to detect turns of a Gaussian process; the decision of a turn is based on the Bayesian Information Criterion (BIC), a model selection criterion well-known in the statistics literature. The BIC criterion can also be applied as a termination criterion in hierarchical methods for clustering of audio segments: two nodes can be merged only if the merging increases the BIC value. Our experiments on the Hub4 1996 and 1997 evaluation data show that our segmentation algorithm can successfully detect acoustic changes; our clustering algorithm can produce clusters with high purity, leading to improvements in accuracy through unsupervised adaptation as much as the ideal clustering by the true speaker i...
Modeling Those F-Conditions -- Or Not
- PROC. DARPA SPEECH RECOGNITION WORKSHOP
, 1997
"... After several disappointing preliminary attempts to make condition-specific models, we decided that it would be more advantageous to spend our time just trying to improve the core recognition system and use general adaptation techniques to deal with variations. This simplified the system immensely a ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
After several disappointing preliminary attempts to make condition-specific models, we decided that it would be more advantageous to spend our time just trying to improve the core recognition system and use general adaptation techniques to deal with variations. This simplified the system immensely and freed up people and also made the transition from the PE system to the UE system much easier. We describe our attempts at condition-specific modeling/adaptation/training. We show that the benefit for channel-specific modeling is smaller than that for general adaption procedures and we argue that the cost is too high.
A Distance Measure Between Collections Of Distributions And Its Application To Speaker Recognition
- Proc. ICASSP98
, 1998
"... This paper presents a distance measure for evaluating the closeness of two sets of distributions. The job of finding the distance between two distributions has been addressed with many solutions present in the literature. To cluster speakers using the pre-computed models of their speech, a need aris ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
This paper presents a distance measure for evaluating the closeness of two sets of distributions. The job of finding the distance between two distributions has been addressed with many solutions present in the literature. To cluster speakers using the pre-computed models of their speech, a need arises for computing a distance between these models which are normally built of a collection of distributions such as Gaussians (e.g., comparison between two HMM models). The definition of this distance measure creates many possibilities for speaker verification, speaker adaptation, speaker segmentation and many other related applications. A distance measure is presented for evaluating the closeness of a collection of distributions with centralized atoms such as Gaussians (but not limited to Gaussians). Several applications including some in speaker recognition with some results are presented using this distance measure. 1. INTRODUCTION A practical solution to the problem of computing a meanin...
A Hierarchical Approach To Large-Scale Speaker Recognition
- In Proc. Eurospeech 1999
, 1999
"... This paper presents a hierarchical approach to the LargeScale Speaker Recognition problem. In here the authors present a binary tree data-base approach for arranging the trained speaker models based on a distance measure designed for comparing two sets of distributions. The combination of this hiera ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
This paper presents a hierarchical approach to the LargeScale Speaker Recognition problem. In here the authors present a binary tree data-base approach for arranging the trained speaker models based on a distance measure designed for comparing two sets of distributions. The combination of this hierarchical structure and the distance measure [1] provide the means for conducting a large-scale verification task. In addition, two techniques are presented for creating a model of the complement-space to the cohort which is used for rejection purposes. Results are presented for the drastic improvements achieved mainly in reducing the falseacceptance of the speaker verification system without any significant false-rejection degradation. 1. INTRODUCTION Let us consider a possible model for speech as being a collection of distributions (e.g., Gaussian distributions) . To be able to rank speakers within a database based on their similarity of speech characteristics, one needs a distance measure ...
Automatic Transcription of Broadcast News
, 2001
"... This paper describes the IBM approach to Broadcast News Transcription. Typical problems in the Broadcast News Transcription task are segmentation, clustering, acoustic modeling, language modeling and acoustic model adaptation. This paper presents new algorithms for each of these focus problems. S ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
This paper describes the IBM approach to Broadcast News Transcription. Typical problems in the Broadcast News Transcription task are segmentation, clustering, acoustic modeling, language modeling and acoustic model adaptation. This paper presents new algorithms for each of these focus problems. Some key ideas include Bayesian Information Criterion (for segmentation, clustering and acoustic modeling) and Speaker/Cluster Adapted Training
The 1996 Bbn Byblos Hub-4 Transcription System
- In Proc. of DARPA Speech Recognition Workshop
, 1996
"... In this paper, we describe the BBN Byblos system used for the 1996 Hub-4 Partitioned Evaluation (PE) and Unpartitioned Evaluation (UE) tests. For the PE, we chose to ignore the segment feature labels that were given to the system as side-information so that our approach would generalize trivially to ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this paper, we describe the BBN Byblos system used for the 1996 Hub-4 Partitioned Evaluation (PE) and Unpartitioned Evaluation (UE) tests. For the PE, we chose to ignore the segment feature labels that were given to the system as side-information so that our approach would generalize trivially to the UE. Moreover, we chose not to model specific channel conditions in the training because the observed gains were too small to warrant the additional system complexity required to support them. In the end, we estimated a single set of acoustic models from only 40 hours of broadcast news data. For the UE, the data was automatically segmented with a simple dual-gender phoneme recognizer that efficiently located pauses and changes in speakers' gender. After this preliminary stage of segmentation and gender-classification, our UE and PE systems were identical. We achieved a 30.2% word error rate on the PE test and 31.8% on the UE test - only a 5% relative degradation from our PE result. 1. I...
Speaker Tracking in Broadcast Audio Material in the Framework of the THISL Project
- In Proceedings of the ESCA ETRW workshop Accessing Information in Spoken Audio
, 1999
"... In this paper, we present a first approach to build an automatic system for broadcast news speaker-based segmentation. Based on a Chop-and-Recluster method, this system is developed in the framework of the THISL project. A metric-based segmentation is used for the Chop procedure and different distan ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper, we present a first approach to build an automatic system for broadcast news speaker-based segmentation. Based on a Chop-and-Recluster method, this system is developed in the framework of the THISL project. A metric-based segmentation is used for the Chop procedure and different distances have been investigated. The Recluster procedure relies on a bottom-up clustering of segments obtained beforehand and represented by non-parametricmodels. Various hierarchical clustering schemes have been tested. Some experiments on BBC broadcast news recordings show that the system can detect real speaker changes with high accuracy (mean error ' 0.7s) and fair false alarm rate (mean false alarm rate ' 5.5% ). The Recluster procedure can produce homogeneous clusters but it is not already robust enough to tackle too complex classification tasks. 1. INTRODUCTION THISL (THematic Indexing of Spoken Language) 1 is an ESPRIT Long Term Research project that is investigating the development ...
Applied Clustering for Automatic Speaker-based segmentation of Audio Material
- Belgian Journal of Operations Research, Statistics and Computer Science (JORBEL
, 2002
"... In this paper, we introduce an algorithm dedicated to speaker-based segmentation of audio material. The algorithm consists in two distinct procedures namely splitting and merging. Its performance is assessed on broadcast news recordings provided by the British Broadcast Corporation (BBC). Results sh ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
In this paper, we introduce an algorithm dedicated to speaker-based segmentation of audio material. The algorithm consists in two distinct procedures namely splitting and merging. Its performance is assessed on broadcast news recordings provided by the British Broadcast Corporation (BBC). Results show that the splitting is performed with high accuracy and low missed detection rate while the merging procedure provides satisfying results. 1.
FAST AND ROBUST SPEAKER CLUSTERING USING THE EARTH MOVER’S DISTANCE AND MIXMAX MODELS
"... or lists, ..."
Speech and Music Classification and Separation: A Review
, 2006
"... Abstract. The classification and separation of speech and music signals have attracted attention by many researchers. The purpose of the classification process is needed to build two different libraries: speech library and music library, from a stream of sounds. However, the separation process is ne ..."
Abstract
- Add to MetaCart
Abstract. The classification and separation of speech and music signals have attracted attention by many researchers. The purpose of the classification process is needed to build two different libraries: speech library and music library, from a stream of sounds. However, the separation process is needed in a cocktail-party problem to separate speech from music and remove the undesired one. In this paper, a review of the existing classification and separation algorithms is presented and discussed. The classification algorithms will be divided into three categories: time-domain, frequency-domain, and time-frequency domain approaches. The time-domain approaches used in literature are: the zero-crossing rate (ZCR), the short-time energy (STE), the ZCR and the STE with positive derivative, with some of their modified versions, the variance of the roll-off, and the neural networks. The frequency-domain approaches are mainly based on: spectral centroid, variance of the spectral centroid, spectral flux, variance of the spectral flux, roll-off of the spectrum, cepstral residual, and the delta pitch. The time-frequency domain approaches have not been yet tested thoroughly in literature; so, the spectrogram and the evolutionary spectrum will be introduced. Also, some new algorithms dealing with music and speech separation and segregation processes will be presented. 1.

