Results 1 -
5 of
5
An Annotation Scheme for Speech Reconstruction on a Dialog Corpus
- In Fourth International Workshop on Human-Computer Conversation. Bellagio, Italy: [http://www.companions-project.org/events
, 2008
"... Abstract. This1 paper presents the ongoing manual speech reconstruction annotation of the NAP corpus, which is a corpus of recorded conversations between pairs of people above family photographs, relating it to a more complex annotation scheme of the Prague Dependency Treebank family. The result of ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract. This1 paper presents the ongoing manual speech reconstruction annotation of the NAP corpus, which is a corpus of recorded conversations between pairs of people above family photographs, relating it to a more complex annotation scheme of the Prague Dependency Treebank family. The result of this effort will be a resource that will contain, on top of the audio recording of the dialog and its usual transcription, an edited and fully grammatical “reconstructed ” dialog. The format and alignment with the original audio and transcription on one side and a similar alignment (linking) to a deep analysis of the natural language sentences uttered in the dialog on the other side will be such that the resource can serve as a training and testing material for machine learning experiments in both intelligent editing as well as in dialog language understanding. The resource will be used in the Companions project, but it will be publicly available outside of the project as well. 1
Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations
"... Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and to downstream automatic processes. It may be achieved by inserting boundaries of syntactic/semantic units to the flow of spee ..."
Abstract
- Add to MetaCart
Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and to downstream automatic processes. It may be achieved by inserting boundaries of syntactic/semantic units to the flow of speech, labeling non-content words like filled pauses and discourse markers for optional removal, and identifying sections of disfluent speech. This paper compares two Czech MDE speech corpora – one in the domain of broadcast news and the other in the domain of broadcast conversations. A variety of statistics about fillers, edit disfluencies, and syntactic/semantic units are presented. Among many others, we report the statistics indicating that disfluent portions of speech show differences in the distribution of parts of speech (POS) of their word content in comparison with the overall POS distribution. The two Czech corpora are not only compared with each other, but also with available statistics relating to English MDE corpora of broadcast news and telephone conversations. 1.
The Czech Broadcast Conversation Corpus
"... Abstract. This paper presents the final version of the Czech Broadcast Conversation Corpus released at the Linguistic Data Consortium (LDC). The corpus contains 72 recordings of a radio discussion program, which yield about 33 hours of transcribed conversational speech from 128 speakers. The release ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. This paper presents the final version of the Czech Broadcast Conversation Corpus released at the Linguistic Data Consortium (LDC). The corpus contains 72 recordings of a radio discussion program, which yield about 33 hours of transcribed conversational speech from 128 speakers. The release not only includes verbatim transcripts and speaker information, but also structural metadata (MDE) annotation that involves labeling of sentence-like unit boundaries, marking of non-content words like filled pauses and discourse markers, and annotation of speech disfluencies. The annotation is based on the LDC’s MDE annotation standard for English, with changes applied to accommodate phenomena that are specific for Czech. In addition to its importance to speech recognition, speaker diarization, and structural metadata extraction research, the corpus is also useful for linguistic analysis of conversational Czech. 1
Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations
"... Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and to downstream automatic processes. It may be achieved by inserting boundaries of syn-tactic/semantic units to the flow of spe ..."
Abstract
- Add to MetaCart
Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and to downstream automatic processes. It may be achieved by inserting boundaries of syn-tactic/semantic units to the flow of speech, labeling non-content words like filled pauses and discourse markers for optional removal, and identifying sections of disfluent speech. This paper compares two Czech MDE speech corpora – one in the domain of broadcast news and the other in the domain of broadcast conversations. A variety of statistics about fillers, edit disfluencies, and syntactic/semantic units are presented. Among many others, we report the statistics indicating that disfluent portions of speech show differences in the distribution of parts of speech (POS) of their word content in comparison with the overall POS distribution. The two Czech corpora are not only compared with each other, but also with available statistics relating to English MDE corpora of broadcast news and telephone conversations. 1.
The Czech Broadcast Conversation Corpus
"... Abstract. This paper presents the final version of the Czech Broadcast Conver-sation Corpus that will shortly be released at the Linguistic Data Consortium (LDC). The corpus contains 72 recordings of a radio discussion program, which yields about 33 hours of transcribed conversational speech from 12 ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. This paper presents the final version of the Czech Broadcast Conver-sation Corpus that will shortly be released at the Linguistic Data Consortium (LDC). The corpus contains 72 recordings of a radio discussion program, which yields about 33 hours of transcribed conversational speech from 128 speakers. The release does not only include verbatim transcripts and speaker information, but also structural metadata (MDE) annotation that involves labeling of sentence-like unit boundaries, marking of non-content words like filled pauses and dis-course markers, and annotation of speech disfluencies. The MDE annotation is based on the LDC’s annotation standard for English, with changes applied to ac-commodate phenomena that are specific for Czech. In addition to its importance to speech recognition, speaker diarization, and structural metadata extraction re-search, the corpus is also useful for linguistic analysis of conversational Czech. 1