Results 1 - 10
of
19
Prosody-based automatic segmentation of speech into sentences and topics
- SPEECH COMMUNICATION
, 2000
"... A crucial step in processing speech audio data for information-extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are abse ..."
Abstract
-
Cited by 137 (41 self)
- Add to MetaCart
A crucial step in processing speech audio data for information-extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (informationgleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models—for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.
The LIMSI Broadcast News Transcription System
- Speech Communication
, 2002
"... This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluations. T ..."
Abstract
-
Cited by 84 (5 self)
- Add to MetaCart
This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluations. Two main problems needed to be addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers).
Segmentation, Classification and Clustering of an Italian Broadcast News Corpus
- IN PROC. OF RIAO
, 2000
"... This work reports on preliminary activity at ITC-irst on the problem of acoustic segmentation, classification and clustering of an Italian audio broadcast news corpus. The approach is based on the following stages. First, the input data stream is segmented by detecting spectral changes through the ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
This work reports on preliminary activity at ITC-irst on the problem of acoustic segmentation, classification and clustering of an Italian audio broadcast news corpus. The approach is based on the following stages. First, the input data stream is segmented by detecting spectral changes through the Bayesian Information Criterion (BIC). Second, segments are classified in terms of acoustic conditions, modeled by mixtures of Gaussians. Finally, segments from the same speakers are clustered, by using again the BIC. The scheme proposed for the automatic segmentation, classification and clustering causes a degradation of the recognition error rate, with respect to the fully supervisioned experiment, equal to 1.3% before adaptation, and 3.4% after adaptation.
Real-Time Speaker Identification and Verification
- ACCEPTED FOR PUBLICATION IN IEEE TRANS. SPEECH & AUDIO PROCESSING
"... In speaker identification, most of the computation originates from the distance or likelihood computations between the feature vectors of the unknown speaker and the models in the database. The identification time depends on the number of feature vectors, their dimensionality, the complexity of the ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
In speaker identification, most of the computation originates from the distance or likelihood computations between the feature vectors of the unknown speaker and the models in the database. The identification time depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers. In this paper, we concentrate on optimizing vector quantization (VQ) based speaker identification. We reduce the number of test vectors by pre-quantizing the test sequence prior to matching, and the number of speakers by pruning out unlikely speakers during the identification process. The best variants are then generalized to Gaussian mixture model (GMM) based modeling. We apply the algorithms also to efficient cohort set search for score normalization in speaker verification. We obtain a speed-up factor of 16:1 in the case of VQ-based modeling with minor degradation in the identification accuracy, and 34:1 in the case of GMM-based modeling. An equal error rate of 7 % can be reached in 0.84 seconds on average when the length of test utterance is 30.4 seconds.
Spectral Features for Automatic Text-Independent Speaker Recognition
, 2003
"... Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but e#ective representation that is more stable and discriminative than the original signal. Since the front-end is the first component ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but e#ective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components (speaker modeling and pattern matching) is strongly determined by the quality of the front-end. In other words, classification can be at most as accurate as the features.
Model Selection Criteria for Acoustic Segmentation
- in Proc. of the ISCA ITRW ASR2000 Automatic Speech Recognition
, 2000
"... Robust acoustic segmentation has become a critical issue in order to apply speech recognition to audio streams with variable acoustic content, e.g. radio programs. Many techniques in the literature base segmentation on statistical model selection, by applying the Bayesian Information Criterion. This ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Robust acoustic segmentation has become a critical issue in order to apply speech recognition to audio streams with variable acoustic content, e.g. radio programs. Many techniques in the literature base segmentation on statistical model selection, by applying the Bayesian Information Criterion. This work reviews alternative model selection criteria and presents comparative experiments both under controlled conditions and on a broadcast news corpus.
The 1999 BBN BYBLOS 10xRT Broadcast News Transcription System
- in 2000 Speech Transcription Workshop
, 2000
"... In this paper, we describe the BBN BYBLOS system used for the 1999 Hub-4E 10xRT evaluation benchmark, and discuss the improvements made to the system in 1999. We focus on the techniques that were new in this year's system to achieve an optimal tradeoff between accuracy and speed for the evaluation b ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper, we describe the BBN BYBLOS system used for the 1999 Hub-4E 10xRT evaluation benchmark, and discuss the improvements made to the system in 1999. We focus on the techniques that were new in this year's system to achieve an optimal tradeoff between accuracy and speed for the evaluation benchmark test. Overall, we improved the recognition accuracy on the 1998 Hub-4E evaluation test by 14% relative to our 1998 10xRT system (from 17.1% to 14.7%), or equivalently we sped up the 1998 Primary system 24 times (from 240xRT to 10xRT) while maintaining the same word error rate (14.7%). This progress was attributed to improvement in fast segmentation using dual-band and dual-gender phone-class models based on RASTA-normalized features, supervised MLLR adaptation of band-limited models to real telephone training data, adaptation between decoding passes, and various adaptation speedups. 1. INTRODUCTION The 1999 BBN BYBLOS 10xRT broadcast news transcription system was based on both the...
Blind Segmentation and Labeling of Speakers via the Bayesian Information Criterion for Video-Conference Indexing
"... The purpose of this paper is to present a system which breaks input speech into segments and identifies each new appearance of the same speaker with a consistent label. This task adds up to a topic detection system that makes use of key-word recognition to obtain suitable labels for an automatic ind ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The purpose of this paper is to present a system which breaks input speech into segments and identifies each new appearance of the same speaker with a consistent label. This task adds up to a topic detection system that makes use of key-word recognition to obtain suitable labels for an automatic indexing system project. Both the segments definition and the identification of the speaker for each segment are performed using an acoustic similarity measure. Our task is to separate and identify the different speakers who appear in a video-conference session without any prior knowledge of the speakers or their number. The first aim is to detect the time points where a speaker change takes place using a robust acoustic change detection (ACD) system. Afterwards, the regions defined by these time marks are labeled with the use of a clustering algorithm. The Bayesian Information Criterion (BIC) is the key element in the system, and is used in several ways as a measure to compare speech. EERs of 13.66 % are obtained for this task with a soft feeding back of clustering information to enhance ACD performance.
Speaker Diarization: A Review of Recent Research
, 2010
"... Abstract—Speaker diarization is the task of determining “who spoke when? ” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarizatio ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Speaker diarization is the task of determining “who spoke when? ” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher-level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research. Index Terms—Speaker diarization, rich transcription, meetings I.
Using Acoustic Condition Clustering To Improve Acoustic Change Detection On Broadcast News
- Proc. ICSLP 2000
, 2000
"... We have developed a system that breaks input speech into segments using an acoustic similarity measure. The aim is t o detect the time points where the acoustic characteristics change, usually due to speaker changes but also resulting from changes in the acoustic environment. We have also developed ..."
Abstract
- Add to MetaCart
We have developed a system that breaks input speech into segments using an acoustic similarity measure. The aim is t o detect the time points where the acoustic characteristics change, usually due to speaker changes but also resulting from changes in the acoustic environment. We have also developed a system to cluster the segments generated by the first system into clusters composed of homogeneous acoustic conditions. In this paper, we present a technique to improve the robustness of the acoustic change detection by feeding back the results of the segment clustering, exploiting the extra information available in the distance between the two clusters to which the segments belong. The interaction between the acoustic change detection and clustering systems gives us a substantial improvement over results previously reported on the 1997 Hub-4 Broadcast News test set that we employed [1][2]: Feedback of clustering information improved the Equal Error Rate (EER) of our acoustic change detect...

