Results 1 - 10
of
32
Discourse segmentation of multi-party conversation
- in 41st Annual Meeting of ACL
, 2003
"... We present a domain-independent topic segmentation algorithm for multi-party speech. Our feature-based algorithm combines knowledge about content using a text-based algorithm as a feature and about form using linguistic and acoustic cues about topic shifts extracted from speech. This segmentation al ..."
Abstract
-
Cited by 65 (1 self)
- Add to MetaCart
We present a domain-independent topic segmentation algorithm for multi-party speech. Our feature-based algorithm combines knowledge about content using a text-based algorithm as a feature and about form using linguistic and acoustic cues about topic shifts extracted from speech. This segmentation algorithm uses automatically induced decision rules to combine the different features. The embedded text-based algorithm builds on lexical cohesion and has performance comparable to state-of-the-art algorithms based on lexical information. A significant error reduction is obtained by combining the two knowledge sources. 1
Conversational Scene Analysis
, 2002
"... In this thesis, we develop computational tools for analyzing conversations based on nonverbal auditory cues. We develop a notion of conversations as being made up of a variety of scenes: in each scene, either one speaker is holding the floor or both are speaking at equal levels. Our goal is to find ..."
Abstract
-
Cited by 39 (0 self)
- Add to MetaCart
In this thesis, we develop computational tools for analyzing conversations based on nonverbal auditory cues. We develop a notion of conversations as being made up of a variety of scenes: in each scene, either one speaker is holding the floor or both are speaking at equal levels. Our goal is to find conversations, find the scenes within them, determine what is happening inside the scenes, and then use the scene structure to characterize entire conversations. We begin by
Scanmail: a voicemail interface that makes speech browsable, readable and searchable
- in Proceedings of CHI2002 Conference on Human Computer Interaction
, 2002
"... Increasing amounts of public, corporate, and private speech data are now available on-line. These are limited in their usefulness, however, by the lack of tools to permit their browsing and search. The goal of our research is to provide tools to overcome the inherent difficulties of speech access, b ..."
Abstract
-
Cited by 38 (10 self)
- Add to MetaCart
Increasing amounts of public, corporate, and private speech data are now available on-line. These are limited in their usefulness, however, by the lack of tools to permit their browsing and search. The goal of our research is to provide tools to overcome the inherent difficulties of speech access, by supporting visual scanning, search, and information extraction. We describe a novel principle for the design of UIs to speech data: What You See Is Almost What You Hear (WYSIAWYH). In WYSIAWYH, automatic speech recognition (ASR) generates a transcript of the speech data. The transcript is then used as a visual analogue to that underlying data. A graphical user interface allows users to visually scan, read, annotate and search these transcripts. Users can also use the transcript to access and play specific regions of the underlying message. We first summarize previous studies of voicemail usage that motivated the WYSIAWYH principle, and describe a voicemail UI, SCANMail, that embodies WYSIAWYH. We report on a laboratory experiment and a two-month field trial evaluation. SCANMail outperformed a state of the art voicemail system on core voicemail tasks. This was attributable to SCANMail’s support for visual scanning, search and information extraction. While the ASR transcripts contain errors, they nevertheless improve the efficiency of voicemail processing. Transcripts either provide enough information for users to extract key points or to navigate to important regions of the underlying speech, which they can then play directly.
SCAN: Designing and evaluating user interfaces to support retrieval from speech archives
- IN PROCEEDINGS OF THE 22ND ACM-SIGIR INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL
, 1999
"... Previous examinations of search in textual archives have assumed that users first retrieve a ranked set of documents relevant to their query, and then visually scan through these documents, to identify the information they seek. While document scanning is possible in text, it is much more laborious ..."
Abstract
-
Cited by 36 (7 self)
- Add to MetaCart
Previous examinations of search in textual archives have assumed that users first retrieve a ranked set of documents relevant to their query, and then visually scan through these documents, to identify the information they seek. While document scanning is possible in text, it is much more laborious in speech archives, due to the inherently serial nature of speech. Yet, in developing tools for speech access, little attention has so far been paid to users' problems in scanning and extracting information from within "speech documents". We demonstrate the extent of these problems in two user studies. We show that users experience severe problems with local navigation in extracting relevant information from within "speech documents". Based on these results, we propose a new user interface (UI) design paradigm: What You See Is (Almost) What You Hear, (WYSIAWYH) - a multimodal method for accessing speech archives. This paradigm presents a visual analogue to the underlying speech, enabling vi...
Sentence Boundary Detection in Broadcast Speech Transcripts
- in Proc. of ISCA Workshop: Automatic Speech Recognition: Challenges for the new Millennium ASR-2000
, 2000
"... This paper presents an approach to identifying sentence boundaries in broadcast speech transcripts. We describe finite state models that extract sentence boundary information statistically from text and audio sources. An n-gram language model is constructed from a collection of British English news ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
This paper presents an approach to identifying sentence boundaries in broadcast speech transcripts. We describe finite state models that extract sentence boundary information statistically from text and audio sources. An n-gram language model is constructed from a collection of British English news broadcasts and scripts. An alternative model is estimated from pause duration information in speech recogniser outputs aligned with their programme script counterparts. Experimental results show that the pause duration model alone outperforms the language modelling approach and that, by combining these two models, it can be improved further and precision and recall scores of over 70% were attained for the task. 1. INTRODUCTION Spoken audio data is a rich information source. Extensive research efforts during past decades have resulted in automatic speech transcription systems that can perform certain tasks (e.g., large vocabulary dictation from a cooperative speaker) with a high degree of a...
Punctuation annotation using statistical prosody models
- in Proc. ISCA Workshop on Prosody in Speech Recognition and Understanding
, 2001
"... This paper is about the development of statistical models of prosodic features to generate linguistic meta-data for spoken language. In particular, we are concerned with automatically punctuating the output of a broadcast news speech recogniser. We present a statistical finite state model that combi ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
This paper is about the development of statistical models of prosodic features to generate linguistic meta-data for spoken language. In particular, we are concerned with automatically punctuating the output of a broadcast news speech recogniser. We present a statistical finite state model that combines prosodic, linguistic and punctuation class features. Experimental results are presented using the Hub–4 Broadcast News corpus, and in the light of our results we discuss the issue of a suitable method of evaluating the present task. 1.
Meeting Structure Annotation: Data and Tools
- In Proceedings of the SIGdial Workshop on Discourse and Dialogue
, 2005
"... We present a set of annotations of hierarchical topic segmentations and action item subdialogues collected over 65 meetings from the ICSI and ISL meeting corpora, designed to support automatic meeting understanding and analysis. We describe an architecture for representing, annotating, and analyzing ..."
Abstract
-
Cited by 17 (8 self)
- Add to MetaCart
We present a set of annotations of hierarchical topic segmentations and action item subdialogues collected over 65 meetings from the ICSI and ISL meeting corpora, designed to support automatic meeting understanding and analysis. We describe an architecture for representing, annotating, and analyzing multi-party discourse, including: an ontology of multimodal discourse, a programming interface for that ontology, and an audiovisual toolkit which facilitates browsing and annotating discourse, as well as visualizing and adjusting features for machine learning tasks. 1
Prosody modeling for automatic speech recognition and understanding
- in Proc. Workshop on Mathematical Foundations of Natural Language Modeling
, 2002
"... Abstract. This paper summarizes statistical modeling approaches for the use of prosody (the rhythm and melody of speech) in automatic recognition and understanding of speech. We outline effective prosodic feature extraction, model architectures, and techniques to combine prosodic with lexical (word- ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Abstract. This paper summarizes statistical modeling approaches for the use of prosody (the rhythm and melody of speech) in automatic recognition and understanding of speech. We outline effective prosodic feature extraction, model architectures, and techniques to combine prosodic with lexical (word-based) information. We then survey a number of applications of the framework, and give results for automatic sentence segmentation and disfluency detection, topic segmentation, dialog act labeling, and word recognition. Key words. Prosody, speech recognition and understanding, hidden Markov models. 1. Introduction. Prosody
Can Prosody Aid the Automatic Processing of Multi-Party Meetings? Evidence from Predicting . . .
- IN PROC. ISCA TUTORIAL AND RESEARCH WORKSHOP ON PROSODY IN SPEECH RECOGNITION AND UNDERSTANDING (PROSODY
, 2001
"... We investigate whether probabilistic modeling of prosody can aid various automatic labeling tasks essential for processing of multi-party meetings. Task 1, automatic punctuation, seeks to classify sentence boundaries and disfluencies. Task 2, jumpin points, predicts locations within foreground spee ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
We investigate whether probabilistic modeling of prosody can aid various automatic labeling tasks essential for processing of multi-party meetings. Task 1, automatic punctuation, seeks to classify sentence boundaries and disfluencies. Task 2, jumpin points, predicts locations within foreground speech at which background speakers start talking; Task 3, jump-in words,ex- amines characteristics of the speech they use to do so. Data are from the ICSI Meeting Recorder corpus. To infer inherent cues, analyses are based on close-talking microphone signals and recognizer forced alignments. As a generous baseline for word-level cues, we compare prosodic models to those of a language model given the true words. Results for Task 1 show prosody reduces classification error by 10% relative over the cheating language model; furthermore when this task is run in "online" mode the prosodic model degrades less than does the language model. For Task 2, the language model provides no information, while the prosodic model reduces entropy by 13% over chance. For Task 3, a prosodic model reduces entropy by 25% over chance. Analyses also show interesting prosodic patterns, which differ over tasks. Task 1 uses cues similar to those for Switchboard (but not Broadcast News) data. Task 2 predicts jump-in points that look prosodically like sentence boundaries but that are not actually such boundaries. And Task 3 shows that speakers "raise" their voice when starting during another's talk, compared to starting during silence. These results provide evidence that prosodic modeling can be of use for the automatic processing of meetings. Further results and implications for future automatic meeting processing systems are discussed.
Finding information in audio: A new paradigm for audio browsing/retrieval
- In Proceedings of the ESCA workshop: Accessing information in spoken audio
, 1999
"... Information retrieval from audio data is sharply different from information retrieval from text, not simply because speech recognition errors affect retrieval effectiveness, but more fundamentally because of the linear nature of speech, and of the differences in human capabilities for processing spe ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Information retrieval from audio data is sharply different from information retrieval from text, not simply because speech recognition errors affect retrieval effectiveness, but more fundamentally because of the linear nature of speech, and of the differences in human capabilities for processing speech versus text. We describe SCAN, a prototype speech retrieval and browsing system that addresses these challenges of speech retrieval in an integrated way. On the retrieval side, we use novel document expansion techniques to improve retrieval from automatic transcription to a level competitive with retrieval from human transcription. Given these retrieval results, our graphical user interface, based on the novel WYSIAWYH (“What you see is almost what you hear”) paradigm, infers text formatting such as paragraph boundaries and highlighted words from acoustic information and information retrieval term scores to help users navigate the errorful automatic transcription. This interface supports information extraction and relevance ranking demonstrably better than simple speech-alone interfaces, according to results of empirical studies. 1.

