Results 1 - 10
of
20
Prosody-based automatic segmentation of speech into sentences and topics
- SPEECH COMMUNICATION
, 2000
"... A crucial step in processing speech audio data for information-extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are abse ..."
Abstract
-
Cited by 137 (41 self)
- Add to MetaCart
A crucial step in processing speech audio data for information-extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (informationgleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models—for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.
Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech?
, 1998
"... Identifying whether an utterance is a statement, question, greeting, and so forth is integral to effective automatic understanding of natural dialog. Little is known, however, about how such dialog acts (DAs) can be automatically classified in truly natural conversation. This study asks whether curr ..."
Abstract
-
Cited by 72 (16 self)
- Add to MetaCart
Identifying whether an utterance is a statement, question, greeting, and so forth is integral to effective automatic understanding of natural dialog. Little is known, however, about how such dialog acts (DAs) can be automatically classified in truly natural conversation. This study asks whether current approaches, which use mainly word information, could be improved by adding prosodic information. The study is based on more than 1000 conversations from the Switchboard corpus. DAs were handannotated, and prosodic features (duration, pause, F0, energy, and speaking rate) were automatically extracted for each DA. In training, decision trees based on these features were inferred
Integrating prosodic and lexical cues for automatic topic segmentation
- Computational Linguistics
, 2001
"... SRI International SRI International We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov models and decision trees. L ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
SRI International SRI International We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov models and decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach on the Broadcast News corpus, using the DARPATDT evaluation metrics. Results show that the prosodic model alone is competitive with wordbased segmentation methods. Furthermore, we achieve a significant reduction in error by combining the prosodic and wordbased knowledge sources. 1.
Prosodic Cues to Recognition Errors
- In Proceedings of the Automatic Speech Recognition and Understanding Workshop
, 1999
"... We identify methods of distinguishing between correctly and incorrectly recognized utterances (scored by hand for semantic concept accuracy) for a speech recognition system, using acoustic/prosodic characteristics. The analysis was performed on data collected during independent experiments done with ..."
Abstract
-
Cited by 24 (7 self)
- Add to MetaCart
We identify methods of distinguishing between correctly and incorrectly recognized utterances (scored by hand for semantic concept accuracy) for a speech recognition system, using acoustic/prosodic characteristics. The analysis was performed on data collected during independent experiments done with an interactive voice response system that provides travel information over the phone. 1. INTRODUCTION There has been little research in the field of automatic speech recognition (ASR) on the question of how misrecognized utterances differ from correctly recognized utterances. Recognition performance is known to vary depending upon the relative formality or casualness of speaking style [14], but there has been little attempt to identify this variation precisely. An exception is a study of the effect of speaking style on recognition performance in the Switchboard Corpus in which a standard recognition system was augmented with a conditioning variable, the speaking style (mode) [8]. Lexical ...
Predicting Automatic Speech Recognition Performance Using Prosodic Cues
- IN PROCEEDINGS OF NAACL-00
, 2000
"... In spoken dialogue systems, it is important for a system to know how likely a speech recognition hypothesis is to be correct, so it can reprorapt for fresh input, or, in cases where many errors have occurred, change its interaction strategy or switch the caller to a human attendant. We have discov- ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
In spoken dialogue systems, it is important for a system to know how likely a speech recognition hypothesis is to be correct, so it can reprorapt for fresh input, or, in cases where many errors have occurred, change its interaction strategy or switch the caller to a human attendant. We have discov- ered prosodic features which more accurately predict when a recognition hypothesis contains a word error than the acoustic confidence score thresholds tradi- tionally used in automatic speech recognition. We present analytic results indicating that there are significant prosodic differences between correctly and incorrectly recognized turns in the TOOT train information corpus. We then present machine learning results showing how the use of prosodic features to automatically predict correct versus incorrectly recognized turns improves over the use of acoustic confidence scores alone.
Combining Words and Prosody for Information Extraction from Speech
- in Proc. Eurospeech
, 1999
"... Information extraction from speech is a crucial step on the way from speech recognition to speech understanding. A preliminary step toward speech understanding is the detection of topic boundaries, sentence boundaries, and proper names in speech recognizer output. This is important since speech reco ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Information extraction from speech is a crucial step on the way from speech recognition to speech understanding. A preliminary step toward speech understanding is the detection of topic boundaries, sentence boundaries, and proper names in speech recognizer output. This is important since speech recognizer output lacks the usual textual cues to these entities (such as headers, paragraphs, sentence punctuation, and capitalization). Numerous word-based approaches to these tasks have been developed in the past; in this work we demonstrate the use of prosodic cues, alone and in combination with words, for segmentation and name finding. In experiments on the Broadcast News corpus, we find that prosodic cues alone allow sentence and topic segmentation that is at least as good as word-based methods alone, and that combining both types of cues gives significant wins. Named entity recognition, on the other hand, currently does not seem to benefit from prosodic cues, for several interesting reasons. 1.
Filled Pauses As Markers Of Discourse Structure
, 1996
"... This study aims to test quantitatively whether #lled pauses #FPs# may highlight discourse structure. More speci#cally, it is #rst investigated whether FPs are more typical in the vicinity of major discourse boundaries. Secondly, the FPs are analyzed acoustically, to check whether those occurring at ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
This study aims to test quantitatively whether #lled pauses #FPs# may highlight discourse structure. More speci#cally, it is #rst investigated whether FPs are more typical in the vicinity of major discourse boundaries. Secondly, the FPs are analyzed acoustically, to check whether those occurring at major discourse boundaries are segmentally and prosodically di#erent from those at shallower breaks. Analyses of twelve spontaneous monologues #Dutch# show that phrases following major discourse boundaries more often contain FPs. Additionally, FPs after stronger breaks tend to occur phraseinitially, whereas the majority of the FPs after weak boundaries are in phrase-internal position. Also, acoustic observations reveal that FPs at major discourse boundaries are both segmentally and prosodically distinct. They also di#er with respect to the distribution of neighbouring silent pauses.
Combining words and speech prosody for automatic topic segmentation
- In Proceedings DARPA Broadcast News Workshop (pp. 61–64
, 1999
"... We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topic units. The approach combines hidden Markov models, statistical language models, and prosody-based decision trees. Lexical information is obtained from a speech recognizer, an ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topic units. The approach combines hidden Markov models, statistical language models, and prosody-based decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach on the Broadcast News corpus, using standard evaluation metrics. Results show that the prosodic model alone outperforms the word-based segmentation method. Furthermore, we achieve an additional reduction in error by combining the prosodic and wordbased knowledge sources. 1.
Linguistic adaptations during spoken and multimodal error resolution. Language and Speech
- Language and Speech. Special issue on Prosody and Conversation
, 1998
"... error resolution hypet^rticulation linguistic contrast multimodal intetycUon spiral errors spoken and multimodal interaction ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
error resolution hypet^rticulation linguistic contrast multimodal intetycUon spiral errors spoken and multimodal interaction
Information Extraction from Broadcast News
- Philosophical Transactions of the Royal Society of London, Series A
, 2000
"... This paper discusses the development of trainable statistical models for extracting content from television and radio news broadcasts. In particular we concentrate on statistical finite state models for identifying proper names and other named entities in broadcast speech. Two models are presented: ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
This paper discusses the development of trainable statistical models for extracting content from television and radio news broadcasts. In particular we concentrate on statistical finite state models for identifying proper names and other named entities in broadcast speech. Two models are presented: the first models name class information as a word attribute; the second explicitly models both word-word and class-class transitions. A common n-gram based formulation is used for both models. The task of named entity identification is characterized by relatively sparse training data and issues related to smoothing are discussed. Experiments are reported using the DARPA/NIST Hub--4E evaluation for North American Broadcast News.

