Results 1 - 10
of
109
Automatic Dialog Act Segmentation and Classification in Multiparty Meetings
- in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP
, 2005
"... We explore the two related tasks of dialog act (DA) segmentation and DA classification for speech from the ICSI Meeting Corpus. We employ simple lexical and prosodic knowledge sources, and compare results for human-transcribed versus automatically recognized words. Since there is little previous wor ..."
Abstract
-
Cited by 58 (10 self)
- Add to MetaCart
We explore the two related tasks of dialog act (DA) segmentation and DA classification for speech from the ICSI Meeting Corpus. We employ simple lexical and prosodic knowledge sources, and compare results for human-transcribed versus automatically recognized words. Since there is little previous work on DA segmentation and classification in the meeting domain, our study provides baseline performance rates for both tasks. We introduce a range of metrics for use in evaluation, each of which measures different aspects of interest. Results show that both tasks are difficult, particularly for a fully automatic system. We find that a very simple prosodic model aids performance over lexical information alone, especially for segmentation. Both tasks, but particularly word-based segmentation, are degraded by word recognition errors. Finally, while classification results for meeting data show some similarities to previous results for telephone conversations, findings also suggest a potential difference with respect to the effect of modeling DA context.
Observations on overlap: Findings and implications for automatic processing of multi-party conversation
- Proc. EUROSPEECH
, 2001
"... We examine the distribution of overlapping speech in different corpora of natural multi-party conversations, including two types of meetings, and two corpora of telephone conversations. Analyses are based on forced alignment and speech recognition using an identical recognizer across tasks. Three re ..."
Abstract
-
Cited by 51 (10 self)
- Add to MetaCart
We examine the distribution of overlapping speech in different corpora of natural multi-party conversations, including two types of meetings, and two corpora of telephone conversations. Analyses are based on forced alignment and speech recognition using an identical recognizer across tasks. Three results are discussed. First, all corpora show high overall rates of overlap, with similar rates for meetings and telephone conversations. Second, speech recognition performance in non-overlapped regions of meetings is no worse than that in single-channel telephone conversations, while recognition in overlap regions degrades considerably. Finally, interrupt locations are associated with endpoints of word-level events in a speaker’s turn, including backchannels, discourse markers, and disfluencies. Results suggest that overlap is an important inherent characteristic of conversational speech that should not be ignored; on the contrary, it should be jointly modeled with acoustic and language model information in machine processing of conversation. 1.
Minimum cut model for spoken lecture segmentation
- In Proceedings of the Annual Meeting of the Association for Computational Linguistics (COLING-ACL 2006
, 2006
"... We consider the task of unsupervised lecture segmentation. We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion. Our approach moves beyond localized comparisons and takes into account longrange cohesion dependencies. Our results demonstrate that global a ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
We consider the task of unsupervised lecture segmentation. We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion. Our approach moves beyond localized comparisons and takes into account longrange cohesion dependencies. Our results demonstrate that global analysis improves the segmentation accuracy and is robust in the presence of speech recognition errors. 1
Spotting "Hot Spots" in Meetings: Human Judgments and Prosodic Cues
- in Proc. Eurospeech
, 2003
"... Recent interest in the automatic processing of meetings is motivated by a desire to summarize, browse, and retrieve important information from lengthy archives of spoken data. One of the most useful capabilities such a technology could provide is a way for users to locate "hot spots" or regions in w ..."
Abstract
-
Cited by 30 (3 self)
- Add to MetaCart
Recent interest in the automatic processing of meetings is motivated by a desire to summarize, browse, and retrieve important information from lengthy archives of spoken data. One of the most useful capabilities such a technology could provide is a way for users to locate "hot spots" or regions in which participants are highly involved in the discussion (e.g. heated arguments, points of excitement, etc.). We ask two questions about hot spots in meetings in the ICSI Meeting Recorder corpus. First, we ask whether involvement can be judged reliably by human listeners. Results show that despite the subjective nature of the task, raters show significant agreement in distinguishing involved from non-involved utterances. Second, we ask whether there is a relationship between human judgments of involvement and automatically extracted prosodic features of the associated regions. Results show that there are significant differences in both F0 and energy between involved and non-involved utterances. These findings suggest that humans do agree to some extent on the judgment of hot spots, and that acoustic-only cues could be used for automatic detection of hot spots in natural meetings.
Is the speaker done yet? Faster and more accurate end-of-utterance detection using prosody
- in Proc. ICSLP
"... We examine the problem of end-of-utterance (EOU) detection for real-time speech recognition, particularly in the context of a human-computer dialog system. Current EOU detection algorithms use only a simple pause threshold for making this decision, leading to two problems. First, especially as speec ..."
Abstract
-
Cited by 30 (2 self)
- Add to MetaCart
We examine the problem of end-of-utterance (EOU) detection for real-time speech recognition, particularly in the context of a human-computer dialog system. Current EOU detection algorithms use only a simple pause threshold for making this decision, leading to two problems. First, especially as speech-driven interfaces become more natural, users often pause inside utterances, resulting in a premature cut off by the system. Second, when users really are done, the minimum system wait is always the threshold value, needlessly adding time to the interaction. We have developed a new approach to EOU detection that uses prosodic features to address both of these problems. Prosodic features are modeled by decision trees and combined with an event N-gram language model to obtain a score that measures the likelihood that any nonspeech region is an EOU. We find that this approach dramatically improves both the accuracy and speed of online EOU detection. 1.
Modeling Prosodic Dynamics for Speaker Recognition,” ICASSP
, 2003
"... Most current state-of-the-art automatic speaker recognition systems extract speaker-dependent features by looking at shortterm spectral information. This approach ignores long-term information that can convey supra-segmental information, such as prosodics and speaking style. We propose two approache ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
Most current state-of-the-art automatic speaker recognition systems extract speaker-dependent features by looking at shortterm spectral information. This approach ignores long-term information that can convey supra-segmental information, such as prosodics and speaking style. We propose two approaches that use the fundamental frequency and energy trajectories to capture long-term information. The first approach uses bigram models to model the dynamics of the fundamental frequency and energy trajectories for each speaker. The second approach uses the fundamental frequency trajectories of a pre-defined set of words as the speaker templates and then, using dynamic time warping, computes the distance between the templates and the words from the test message. The results presented in this work are on Switchboard I using the NIST Extended Data evaluation design. We show that these approaches can achieve an equal error rate of 3.7%, which is a 77 % relative improvement over a system based on short-term pitch and energy features alone. 1.
Punctuation annotation using statistical prosody models
- in Proc. ISCA Workshop on Prosody in Speech Recognition and Understanding
, 2001
"... This paper is about the development of statistical models of prosodic features to generate linguistic meta-data for spoken language. In particular, we are concerned with automatically punctuating the output of a broadcast news speech recogniser. We present a statistical finite state model that combi ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
This paper is about the development of statistical models of prosodic features to generate linguistic meta-data for spoken language. In particular, we are concerned with automatically punctuating the output of a broadcast news speech recogniser. We present a statistical finite state model that combines prosodic, linguistic and punctuation class features. Experimental results are presented using the Hub–4 Broadcast News corpus, and in the light of our results we discuss the issue of a suitable method of evaluating the present task. 1.
Using conditional random fields for sentence boundary detection in speech
- In Proceedings of the 43rd Annula Meeting of the ACL
, 2005
"... Sentence boundary detection in speech is important for enriching speech recognition output, making it easier for humans to read and downstream modules to process. In previous work, we have developed hidden Markov model (HMM) and maximum entropy (Maxent) classifiers that integrate textual and prosodi ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
Sentence boundary detection in speech is important for enriching speech recognition output, making it easier for humans to read and downstream modules to process. In previous work, we have developed hidden Markov model (HMM) and maximum entropy (Maxent) classifiers that integrate textual and prosodic knowledge sources for detecting sentence boundaries. In this paper, we evaluate the use of a conditional random field (CRF) for this task and relate results with this model to our prior work. We evaluate across two corpora (conversational telephone speech and broadcast news speech) on both human transcriptions and speech recognition output. In general, our CRF model yields a lower error rate than the HMM and Maxent models on the NIST sentence boundary detection task in speech, although it is interesting to note that the best results are achieved by three-way voting among the classifiers. This probably occurs because each model has different strengths and weaknesses for modeling the knowledge sources. 1
Automatic Punctuation And Disfluency Detection In Multi-Party Meetings Using Prosodic And Lexical Cues
, 2002
"... We investigate automatic approaches to finding "hidden" spontaneous speech events, such as sentence boundaries and disfluencies, in multi-party meetings. Hidden events are characterized prosodically by a large array of automatically extracted energy, duration, and pitch features, and are modeled by ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
We investigate automatic approaches to finding "hidden" spontaneous speech events, such as sentence boundaries and disfluencies, in multi-party meetings. Hidden events are characterized prosodically by a large array of automatically extracted energy, duration, and pitch features, and are modeled by decision tree classifiers; lexical cues are modeled by N-gram language models. Both sources of information are combined in a hidden Markov model framework. Results show that combined classifiers achieve higher accuracy than either single knowledge source alone. We also study classifiers that use only the preceding context for predicting events, simulating online processing. We find that prosodic features are more robust than are language model features to this constraint. Finally, we examine the effect of automatic word recognition errors, in both training and testing, on classification accuracy. We find that lexical models degrade much more severely than do prosodic models in this case, again showing the relative robustness of prosodic information for hidden-event detection in natural conversation.
Meetings About Meetings: Research At ICSI On Speech In Multiparty Conversations
- Proc. IEEE ICASSP
, 2003
"... In early 2001 we reported (at the Human Language Technology meeting) the early stages of an ICSI project on processing speech from meetings (in collaboration with other sites, principally SRI, Columbia, and UW). In this paper we report our progress from the first few years of this effort, including: ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
In early 2001 we reported (at the Human Language Technology meeting) the early stages of an ICSI project on processing speech from meetings (in collaboration with other sites, principally SRI, Columbia, and UW). In this paper we report our progress from the first few years of this effort, including: the collection and subsequent release of a 75-meeting corpus (over 70 meeting-hours and up to 16 channels for each meeting); the development of a prosodic database for a large subset of these meetings, and its subsequent use for punctuation and disfluency detection; the development of a dialog annotation scheme and its implementation for a large subset of the meetings; and the improvement of both near-mic and far-mic speech recognition results for meeting speech test sets.

