Results 1 - 10
of
53
Characterizing and Recognizing Spoken Corrections in Human-Computer Dialogue
- In Proceedings of the 36th Annual Meeting of the Association of Computational Linguistics, COLING/ACL 98
, 1998
"... Miscommunication in speech recognition systems is unavoidable, but a detailed characterization of user corrections will enable speech systems to identify when a correction is taking place and to more accurately recognize the content of correction utterances. In this paper we investigate the adaptati ..."
Abstract
-
Cited by 51 (5 self)
- Add to MetaCart
Miscommunication in speech recognition systems is unavoidable, but a detailed characterization of user corrections will enable speech systems to identify when a correction is taking place and to more accurately recognize the content of correction utterances. In this paper we investigate the adaptations of users when they encounter recognition errors in interactions with a voice-in/voice-out spoken language system. In analyzing more than 300 pairs of original and repeat correction utterances, matched on speaker and lexical content, we found overall increases in both utterance and pause duration from original to correction. Interestingly, corrections of misrecognition errors (CME) exhibited significantly heightened pitch variability, while corrections of rejection errors (CRE) showed only a small but significant decrease in pitch minimum. CME's demonstrated much greater increases in measures of duration and pitch variability than CRE's. These contrasts allow the development of decision t...
How to find trouble in communication
, 2003
"... Automatic dialogue systems used, for instance, in call centers, should be able to determine in a critical phase of the dialogue––indicated by the customers vocal expression of anger/irritation––when it is better to pass over to a human operator. At a first glance, this does not seem to be a complica ..."
Abstract
-
Cited by 48 (7 self)
- Add to MetaCart
Automatic dialogue systems used, for instance, in call centers, should be able to determine in a critical phase of the dialogue––indicated by the customers vocal expression of anger/irritation––when it is better to pass over to a human operator. At a first glance, this does not seem to be a complicated task: It is reported in the literature that emotions can be told apart quite reliably on the basis of prosodic features. However, these results are achieved most of the time in a laboratory setting, with experienced speakers (actors), and with elicited, controlled speech. We compare classification results obtained with the same feature set for elicited speech and for a Wizard-of-Oz scenario, where users believe that they are really communicating with an automatic dialogue system. It turns out that the closer we get to a realistic scenario, the less reliable is prosody as an indicator of the speakersÕ emotional state. As a consequence, we propose to change the target such that we cease looking for traces of particular emotions in the usersÕ speech, but instead look for indicators of TROUBLE INCOMMUNICATION. INCOMMUNICATION For this reason, we propose the module Monitoring of User State [especially of] Emotion (MOUSE MOUSE) in which a prosodic classifier is combined with other knowledge sources, such as conversationally peculiar linguistic behavior, for example, the use of repetitions. For this module, preliminary exper-imental results are reported showing a more adequate modelling of TROUBLE INCOMMUNICATION.
Let's stop pushing the envelope and start addressing it: a Reference Task Agenda for HCI
, 2000
"... We identify a problem with the process of research in the HCI community -- an overemphasis on "radical invention" at the price of achieving a common research focus. Without such a focus, it is difficult to build on previous work, to compare different interaction techniques objectively, and to make p ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
We identify a problem with the process of research in the HCI community -- an overemphasis on "radical invention" at the price of achieving a common research focus. Without such a focus, it is difficult to build on previous work, to compare different interaction techniques objectively, and to make progress in developing theory. These problems at the research level have implications for practice, too; as
Desperately Seeking Emotions Or: Actors, Wizards, And Human Beings
, 2000
"... Automatic dialogue systems used in call-centers, for instance, should be able to determine in a critical phase of the dialogue - indicated by the costumers vocal expression of anger/irritation - when it is better to pass over to a human operator. At a first glance, this seems not to be a complicated ..."
Abstract
-
Cited by 35 (4 self)
- Add to MetaCart
Automatic dialogue systems used in call-centers, for instance, should be able to determine in a critical phase of the dialogue - indicated by the costumers vocal expression of anger/irritation - when it is better to pass over to a human operator. At a first glance, this seems not to be a complicated task: It is reported in the literature that emotions can be told apart quite reliably on the basis of prosodic features. However, these results are most of the time achieved in a laboratory setting, with experienced speakers (actors), and with elicited, controlled speech. We report classification results obtained within different experimental settings for the two-class-problem `neutral vs. anger' using a vector of prosodic features and discuss the impact of single features on the classification rate. Recognition rates for these settings are best for a speaker-specific classifier (one experienced speaker, acting), worse for a speaker-independent classifier (several less experienced speakers, reading), and even worse for a speaker-independent classifier with naive subjects performing the task of appointment scheduling in a Wizard-of-Oz-scenario where a malfunctioning system is simulated in order to evoke anger. The first situation mirrors most of the settings reported in the literature, the third is closest to the `real-life'-task. It thus turns out that prosody alone is not reliable as an indicator of the speakers emotional state the closer we get to a realistic scenario. As a consequence, the prosodic classifier was combined with other knowledge sources in the module Monitoring Of User State [especially of] Emotion (MoUSE).
Corrections In Spoken Dialogue Systems
- In Proceedings of the Sixth International Conference on Spoken Language Processing
, 2000
"... This study analyzes user corrections of system errors in the TOOT spoken dialogue system. We find that corrections differ from noncorrections prosodically, in ways consistent with hyperarticulated speech, although many corrections are not hyperarticulated. Yet both are misrecognized more frequently ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
This study analyzes user corrections of system errors in the TOOT spoken dialogue system. We find that corrections differ from noncorrections prosodically, in ways consistent with hyperarticulated speech, although many corrections are not hyperarticulated. Yet both are misrecognized more frequently than non-corrections --- though no more likely to be rejected by the system. Corrections more distant from the error they correct tend to exhibit greater prosodic differences, and also to be recognized more poorly. System dialogue strategy affects users' choice of correction type, suggesting that strategy-specific methods of detecting or coaching users on corrections may be useful. Strategies that produce longer tasks but fewer misrecognitions and subsequent corrections are preferred by users. 1. INTRODUCTION Since spoken dialogue systems often make mistakes in recognizing user input, accurate methods of detecting and correcting system errors are essential to supporting successful interact...
Repetition and its phonetic realizations: Investigating a Swedish database of spontaneous computer-directed speech
- In Proceedings of ICPhS-99, San Francisco. International Congress of Phonetic Sciences
, 1999
"... This paper is an investigation of repetitive utterances in a Swedish database of spontaneous computer-directed speech. A spoken dialogue system was installed in a public location in downtown Stockholm and spontaneous human-computer interactions with adults and children were recorded [1]. Several aco ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
This paper is an investigation of repetitive utterances in a Swedish database of spontaneous computer-directed speech. A spoken dialogue system was installed in a public location in downtown Stockholm and spontaneous human-computer interactions with adults and children were recorded [1]. Several acoustic and prosodic features such as duration, shifting of focus and hyperarticulation were examined to see whether repetitions could be distinguished from what the users first said to the system. The present study indicates that adults and children use partly different strategies as they attempt to resolve errors by means of repetition. As repetition occurs, duration is increased and words are often hyperarticulated or contrastively focused. These results could have implications for the development of future spoken dialogue systems with robust error handling. 1.
Real-time Handling of Fragmented Utterances
- in Proceedings of the NAACL Workshop on Adaption in Dialogue Systems
, 2001
"... this paper, we discuss an adaptive method of handling fragmented user utterances to a speech-based multimodal dialogue system. Inserted silent pauses between fragments present the following problem: Does the current silence indicate that the user has completed her utterance, or is the silence just a ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
this paper, we discuss an adaptive method of handling fragmented user utterances to a speech-based multimodal dialogue system. Inserted silent pauses between fragments present the following problem: Does the current silence indicate that the user has completed her utterance, or is the silence just a pause between two fragments, so that the system should wait for more input? Our system incrementally classifies user utterances as either closing (more input is unlikely to come) or non-closing (more input is likely to come), partly depending on the current dialogue state. Utterances that are categorized as non-closing allow the dialogue system to await additional spoken or graphical input before responding
Predicting Automatic Speech Recognition Performance Using Prosodic Cues
- IN PROCEEDINGS OF NAACL-00
, 2000
"... In spoken dialogue systems, it is important for a system to know how likely a speech recognition hypothesis is to be correct, so it can reprorapt for fresh input, or, in cases where many errors have occurred, change its interaction strategy or switch the caller to a human attendant. We have discov- ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
In spoken dialogue systems, it is important for a system to know how likely a speech recognition hypothesis is to be correct, so it can reprorapt for fresh input, or, in cases where many errors have occurred, change its interaction strategy or switch the caller to a human attendant. We have discov- ered prosodic features which more accurately predict when a recognition hypothesis contains a word error than the acoustic confidence score thresholds tradi- tionally used in automatic speech recognition. We present analytic results indicating that there are significant prosodic differences between correctly and incorrectly recognized turns in the TOOT train information corpus. We then present machine learning results showing how the use of prosodic features to automatically predict correct versus incorrectly recognized turns improves over the use of acoustic confidence scores alone.
Designing and evaluating conversational interfaces with animated characters
- in Embodied Conversational Agents
, 2000
"... During the past decade, due largely to progress inspired by the DARPA Speech Grand Challenge project and similar international efforts (Martin et al. 1997; Cole et al. 1997), significant progress has occurred in the development of spoken language technology (SLT). Spoken language systems now are ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
During the past decade, due largely to progress inspired by the DARPA Speech Grand Challenge project and similar international efforts (Martin et al. 1997; Cole et al. 1997), significant progress has occurred in the development of spoken language technology (SLT). Spoken language systems now are
Predicting user reactions to system error
- in Proc.of ACL
, 2001
"... diane/julia¡ This paper focuses on the analysis and prediction of so-called aware sites, defined as turns where a user of a spoken dialogue system first becomes aware that the system has made a speech recognition error. We describe statistical comparisons of features of these aware sites in a train ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
diane/julia¡ This paper focuses on the analysis and prediction of so-called aware sites, defined as turns where a user of a spoken dialogue system first becomes aware that the system has made a speech recognition error. We describe statistical comparisons of features of these aware sites in a train timetable spoken dialogue corpus, which reveal significant prosodic differences between such turns, compared with turns that ‘correct ’ speech recognition errors as well as with ‘normal’ turns that are neither aware sites nor corrections. We then present machine learning results in which we show how prosodic features in combination with other automatically available features can predict whether or not a user turn was a normal turn, a correction, and/or an aware site. 1

