Results 1 - 10
of
57
Preliminaries to a Theory of Speech Disfluencies
, 1994
"... This thesis examines disfluencies (e.g., "um", repeated words, and a variety of forms of self-repair) in the spontaneous speech of adult normal speakers of American English. Despite their prevalence, disfluencies have traditionally been viewed as irregular events and have received little attention. ..."
Abstract
-
Cited by 97 (7 self)
- Add to MetaCart
This thesis examines disfluencies (e.g., "um", repeated words, and a variety of forms of self-repair) in the spontaneous speech of adult normal speakers of American English. Despite their prevalence, disfluencies have traditionally been viewed as irregular events and have received little attention. The goal of the thesis is to provide evidence that, on the contrary, disfluencies show remarkably regular trends in a number of dimensions. These regularities have consequences for models of human language production; they can also be exploited to improve performance in speech applications. The method includes analysis of over 5000 hand-annotated disfluencies from a database (250,000 words) containing three different styles of spontaneous speech: task-oriented human-computer dialog, task-oriented human-human dialog, and human-human conversation on a prescribed topic. The approach is theory-neutral and strongly data-driven. The annotations correspond to observable characteristics ("features") ...
Statistical language modeling for speech disfluencies
- in Proc. ICASSP
, 1996
"... Speech disfluencies (such as filled pauses, repetitions, restarts) are among the characteristics distinguishing spontaneous speech from planned or read speech. We introduce a language model that predicts disfluencies probabilistically and uses an edited,fluent context to predict following words. The ..."
Abstract
-
Cited by 52 (11 self)
- Add to MetaCart
Speech disfluencies (such as filled pauses, repetitions, restarts) are among the characteristics distinguishing spontaneous speech from planned or read speech. We introduce a language model that predicts disfluencies probabilistically and uses an edited,fluent context to predict following words. The model is based on a generalization of the standard N-gram language model. It uses dynamic programming to compute the probability of a word sequence, taking into account possible hidden disfluency events. We analyze the model’s performance for various disfluency types on the Switchboard corpus. We find that the model reduces word perplexity in the neighborhood of disfluency events; however, overall differences are small and have no significant impact on recognition accuracy. We also note that for modeling of the most frequent type of disfluency, filled pauses, a segmentation of utterances into linguistic (rather than acoustic) units is required. Our analysis illustrates a generally useful technique for language model evaluation based on local perplexity comparisons. 1.
Characterizing and Recognizing Spoken Corrections in Human-Computer Dialogue
- In Proceedings of the 36th Annual Meeting of the Association of Computational Linguistics, COLING/ACL 98
, 1998
"... Miscommunication in speech recognition systems is unavoidable, but a detailed characterization of user corrections will enable speech systems to identify when a correction is taking place and to more accurately recognize the content of correction utterances. In this paper we investigate the adaptati ..."
Abstract
-
Cited by 51 (5 self)
- Add to MetaCart
Miscommunication in speech recognition systems is unavoidable, but a detailed characterization of user corrections will enable speech systems to identify when a correction is taking place and to more accurately recognize the content of correction utterances. In this paper we investigate the adaptations of users when they encounter recognition errors in interactions with a voice-in/voice-out spoken language system. In analyzing more than 300 pairs of original and repeat correction utterances, matched on speaker and lexical content, we found overall increases in both utterance and pause duration from original to correction. Interestingly, corrections of misrecognition errors (CME) exhibited significantly heightened pitch variability, while corrections of rejection errors (CRE) showed only a small but significant decrease in pitch minimum. CME's demonstrated much greater increases in measures of duration and pitch variability than CRE's. These contrasts allow the development of decision t...
Predicting Spoken Disfluencies During Human-Computer Interaction
, 1995
"... This research characterizes the spontaneous spoken disfluencies typical of human-computer interaction, and presents a predictive model accounting for their occurrence. Data were collected during three empirical studies in which people spoke or wrote to a highly interactive simulated system as they c ..."
Abstract
-
Cited by 47 (6 self)
- Add to MetaCart
This research characterizes the spontaneous spoken disfluencies typical of human-computer interaction, and presents a predictive model accounting for their occurrence. Data were collected during three empirical studies in which people spoke or wrote to a highly interactive simulated system as they completed service transactions. The studies involved within-subject factorial designs in which the input modality and presentation format were varied. Spoken disfluency rates during human-computer interaction were documented to be substantially lower than rates typically observed during comparable human-human speech. Two separate factors, both associated with increased planning demands, were statistically related to higher disfluency rates: (1) length of utterance, and (2) lack of structure in the presentation format. Regression techniques demonstrated that a linear model based simply on utterance length accounted for over 77% of the variability in spoken disfluencies. Therefore, design methods ca...
A prosody-only decision-tree model for disfluency detection
- Proc. EUROSPEECH
, 1997
"... Speech disfluencies (filled pauses, repetitions, repairs, and false starts) are pervasive in spontaneous speech. The ability to detect and correct disfluencies automatically is important for effective natural language understanding, as well as to improve speech models in general. Previous approaches ..."
Abstract
-
Cited by 45 (14 self)
- Add to MetaCart
Speech disfluencies (filled pauses, repetitions, repairs, and false starts) are pervasive in spontaneous speech. The ability to detect and correct disfluencies automatically is important for effective natural language understanding, as well as to improve speech models in general. Previous approaches to disfluency detection have relied heavily on lexical information, which makes them less applicable when word recognition is unreliable. We have developed a disfluency detection method using decision tree classifiers that use only local and automatically extracted prosodic features. Because the model doesn’t rely on lexical information, it is widely applicable even when word recognition is unreliable. The model performed significantly better than chance at detecting four disfluency types. It also outperformed a language model in the detection of false starts, given the correct transcription. Combining the prosody model with a specialized language model improved accuracy over either model alone for the detection of false starts. Results suggest that a prosody-only model can aid the automatic detection of disfluencies in spontaneous speech. 1.
Edit Detection and Parsing for Transcribed Speech
- In Proc. NAACL
, 2001
"... We present a simple architecture for parsing transcribed speech in which an edited-word detector first removes such words from the sentence string, and then a standard statistical parser trained on transcribed speech parses the remaining words. The edit detector achieves a misclassification rate on ..."
Abstract
-
Cited by 42 (5 self)
- Add to MetaCart
We present a simple architecture for parsing transcribed speech in which an edited-word detector first removes such words from the sentence string, and then a standard statistical parser trained on transcribed speech parses the remaining words. The edit detector achieves a misclassification rate on edited words of 2.2%. (The NULL-model, which marks everything as not edited, has an error rate of 5.9%.) To evaluate our parsing results we introduce a new evaluation metric, the purpose of which is to make evaluation of a parse tree relatively indi#erent to the exact tree position of EDITED nodes. By this metric the parser achieves 85.3% precision and 86.5% recall.
Phonetic Consequences Of Speech Disfluency
, 1999
"... Unlike read or laboratory speech, spontaneous speech contains high rates of disfluencies (e.g., repetitions, repairs, filled pauses). Such events reflect production problems frequently encountered in everyday conversation. Analyses of American English show that disfluency affects a variety of phonet ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
Unlike read or laboratory speech, spontaneous speech contains high rates of disfluencies (e.g., repetitions, repairs, filled pauses). Such events reflect production problems frequently encountered in everyday conversation. Analyses of American English show that disfluency affects a variety of phonetic aspects of speech, including segment durations, intonation, voice quality, vowel quality, and coarticulation patterns. These effects provide clues about production processes, and can guide methods for disfluency processing in speech recognition applications.
Intonational Boundaries, Speech Repairs and Discourse Markers: Modeling Spoken Dialog
, 1997
"... To understand a speaker's turn of a conversation, one needs to segment it into intonational phrases, clean up any speech repairs that might have occurred, and identify discourse markers. In this paper, we argue that these problems must be resolved together, and that they must be resolved earl ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
To understand a speaker's turn of a conversation, one needs to segment it into intonational phrases, clean up any speech repairs that might have occurred, and identify discourse markers. In this paper, we argue that these problems must be resolved together, and that they must be resolved early in the processing stream. We put forward a statistical language model that resolves these problems, does POS tagging, and can be used as the language model of a speech recognizer. We find that by accounting for the interactions between these tasks that the performance on each task improves, as does POS tagging and perplexity.
Corrections In Spoken Dialogue Systems
- In Proceedings of the Sixth International Conference on Spoken Language Processing
, 2000
"... This study analyzes user corrections of system errors in the TOOT spoken dialogue system. We find that corrections differ from noncorrections prosodically, in ways consistent with hyperarticulated speech, although many corrections are not hyperarticulated. Yet both are misrecognized more frequently ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
This study analyzes user corrections of system errors in the TOOT spoken dialogue system. We find that corrections differ from noncorrections prosodically, in ways consistent with hyperarticulated speech, although many corrections are not hyperarticulated. Yet both are misrecognized more frequently than non-corrections --- though no more likely to be rejected by the system. Corrections more distant from the error they correct tend to exhibit greater prosodic differences, and also to be recognized more poorly. System dialogue strategy affects users' choice of correction type, suggesting that strategy-specific methods of detecting or coaching users on corrections may be useful. Strategies that produce longer tasks but fewer misrecognitions and subsequent corrections are preferred by users. 1. INTRODUCTION Since spoken dialogue systems often make mistakes in recognizing user input, accurate methods of detecting and correcting system errors are essential to supporting successful interact...
Modeling the prosody of hidden events for improved word recognition
- in Proc. EUROSPEECH
, 1999
"... We investigate a new approach for using speech prosody as a knowledge source for speech recognition. The idea is to penalize word hypotheses that are inconsistent with prosodic features such as duration and pitch. To model the interaction between words and prosody we modify the language model to rep ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
We investigate a new approach for using speech prosody as a knowledge source for speech recognition. The idea is to penalize word hypotheses that are inconsistent with prosodic features such as duration and pitch. To model the interaction between words and prosody we modify the language model to represent hidden events such as sentence boundaries and various forms of disfluency, and combine with it decision trees that predict such events from prosodic features. N-best rescoring experiments on the Switchboard corpus show a small but consistent reduction of word error as a result of this modeling. We conclude with a preliminary analysis of the types of errors that are corrected by the prosodically informed model. 1.

