Results 1 - 10
of
18
Transcriber: Development and Use of a Tool for Assisting Speech Corpora Production
, 2000
"... We present Transcriber, a tool for assisting in the creation of speech corpora, and describe some aspects of its development and use. Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and ac ..."
Abstract
-
Cited by 73 (5 self)
- Add to MetaCart
We present Transcriber, a tool for assisting in the creation of speech corpora, and describe some aspects of its development and use. Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and acoustic conditions. It is highly portable, relying on the scripting language Tcl/Tk with extensions such as Snack for advanced audio functions and tcLex for lexical analysis, and has been tested on various Unix systems and Windows. The data format follows the XML standard with Unicode support for multilingual transcriptions. Distributed as free software in order to encourage the production of corpora, ease their sharing, increase user feedback and motivate software contributions, Transcriber has been in use for over a year in several countries. As a result of this collective experience, new requirements arose to support additional data formats, video control, and a better management of conv...
A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
- IN ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 16
, 2004
"... ... In this paper we suggest an alternative procedure to the Fisher kernel for systematically finding kernel functions that naturally handle variable length sequence data in multimedia domains. In particular for domains such as speech and images we explore the use of kernel functions that take f ..."
Abstract
-
Cited by 58 (2 self)
- Add to MetaCart
... In this paper we suggest an alternative procedure to the Fisher kernel for systematically finding kernel functions that naturally handle variable length sequence data in multimedia domains. In particular for domains such as speech and images we explore the use of kernel functions that take full advantage of well known probabilistic models such as Gaussian Mixtures and single full covariance Gaussian models. We derive a kernel distance based on the Kullback-Leibler (KL) divergence between generative models. In effect our approach combines the best of both generative and discriminative methods and replaces the standard SVM kernels. We perform experiments on speaker identification/verification and image classification tasks and show that these new kernels have the best performance in speaker verification and mostly outperform the Fisher kernel based SVM's and the generative classifiers in speaker identification and image classification.
An Experimental Study Of An Audio Indexing System For The Web
- in Proc. ICSLP
, 1996
"... We have developed a speech recognition based audio search engine for indexing spoken documents found on the World Wide Web. Our site (http://www.compaq.com/speechbot) indexes around 20 news and talk radio shows covering a wide range of topics, speaking styles and acoustic conditions from a selection ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
We have developed a speech recognition based audio search engine for indexing spoken documents found on the World Wide Web. Our site (http://www.compaq.com/speechbot) indexes around 20 news and talk radio shows covering a wide range of topics, speaking styles and acoustic conditions from a selection of public Web sites with multimedia archives. In this paper, we describe our system and its performance, focusing on the speech recognition and retrieval aspects. We describe our training procedure in some detail and report our historical error rate since the site launch. We also investigate the impact of Out Of Vocabulary (OOV) words. Finally we report the results of retrieval experiments which demonstrate that our system can index effectively.
The Effect of Speech Recognition Accuracy Rates on the Usefulness and Usability of Webcast Archives
, 2006
"... The widespread availability of broadband connections has led to an increase in the use of Internet broadcasting (webcasting). Most webcasts are archived and accessed numerous times retrospectively. In the absence of transcripts of what was said, users have difficulty searching and scanning for speci ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
The widespread availability of broadband connections has led to an increase in the use of Internet broadcasting (webcasting). Most webcasts are archived and accessed numerous times retrospectively. In the absence of transcripts of what was said, users have difficulty searching and scanning for specific topics. This research investigates user needs for transcription accuracy in webcast archives, and measures how the quality of transcripts affects user performance in a question-answering task, and how quality affects overall user experience. We tested 48 subjects in a within-subjects design under 4 conditions: perfect transcripts, transcripts with 25 % Word Error Rate (WER), transcripts with 45 % WER, and no transcript. Our data reveals that speech recognition accuracy linearly influences both user performance and experience, shows that transcripts with 45 % WER are unsatisfactory, and suggests that transcripts having a WER of 25 % or less would be useful and usable in webcast archives.
Perceptually Inspired Signal-processing Strategies for Robust Speech Recognition in Reverberant Environments
, 1998
"... Natural, hands-free interaction with computers is currently one of the great unfulfilled promises of automatic speech recognition (ASR), in part because ASR systems cannot reliably recognize speech under everyday, reverberant conditions that pose no problems for most human listeners. The specific pr ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Natural, hands-free interaction with computers is currently one of the great unfulfilled promises of automatic speech recognition (ASR), in part because ASR systems cannot reliably recognize speech under everyday, reverberant conditions that pose no problems for most human listeners. The specific properties of the auditory representation of speech likely contribute to reliable human speech recognition under such conditions. This dissertation explores the use of perceptually inspired signal-processing strategies -- critical-band-like frequency analysis, an emphasis of slow changes in the spectral structure of the speech signal, adaptation, integration of phonetic information over syllabic durations, and use of multiple signal representations for...
The development of SRI’s 1997 Broadcast News transcription system
- In Proceedings DARPA BroadcastNews Transcription and Understanding Workshop
"... This paper describes SRI’s 1997 broadcastnews transcription system used for the 1997 DARPA H4 evaluations. Our system had several novel components. These include automatic segmentation of entire broadcast shows, word-internal and crossword acoustic models robustly estimated with a new Gaussian Mergi ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
This paper describes SRI’s 1997 broadcastnews transcription system used for the 1997 DARPA H4 evaluations. Our system had several novel components. These include automatic segmentation of entire broadcast shows, word-internal and crossword acoustic models robustly estimated with a new Gaussian Merging-Splitting (GMS) algorithm, the use of trigram language models (LMs) in lattices instead of for rescoring N-best lists, and an LM pruning algorithm that allows efficient representation of high-order (like 4- or 5-gram) LMs. We briefly describe these features and give comparative experimental results. We achieved a 18.7 % relative improvement in performance on our 1996 H4 partitioned evaluation (PE) development test set as compared to our 1996 H4 PE evaluation system. 1.
Improved Modeling and Efficiency for Automatic Transcription of Broadcast News
, 2000
"... Over the last few years, the DARPA-sponsored Hub4 continuous speech recognition evaluations have pushed speech recognition technology for the very interesting and difficult task of automatically transcribing broadcast news. In this paper, we report on our research and progress on this problem. We fo ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Over the last few years, the DARPA-sponsored Hub4 continuous speech recognition evaluations have pushed speech recognition technology for the very interesting and difficult task of automatically transcribing broadcast news. In this paper, we report on our research and progress on this problem. We focus on individual techniques we developed, rather than on descriptions of our evaluation systems. We provide comparative experimental results showing the improvements obtained with the novel approaches we developed. 1 Introduction In recent years there has been increasing interest in developing large-vocabulary continuous speech recognition (LVCSR) systems for speech found in real sources. Broadcast news, in particular, has been the testbed for the DARPA-sponsored Hub4 continuous speech recognition (CSR) evaluations over the last few years, and represents a significant challenge to speech recognition researchers. Many interesting problems are associated with the automatic recognition of b...
Acoustic Confidence Measures For Segmenting Broadcast News
- In Proceedings of the International Conference on Spoken Language Processing
"... In this paper we define an acoustic confidence measure based on the estimates of local posterior probabilities produced by a HMM/ANN large vocabulary continuous speech recognition system. We use this measure to segment continuous audio into regions where it is and is not appropriate to expend recogn ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
In this paper we define an acoustic confidence measure based on the estimates of local posterior probabilities produced by a HMM/ANN large vocabulary continuous speech recognition system. We use this measure to segment continuous audio into regions where it is and is not appropriate to expend recognition effort. The segmentation is computationally inexpensive and provides reductions in both overall word error rate and decoding time. The technique is evaluated using material from the Broadcast News corpus. 1. INTRODUCTION Most speech recognition tasks to date have required the recognition of discrete utterances over which both the speaker and channel characteristics remain constant. It is given that the data supplied to the recogniser is speech and so speech detection amounts to little more than trimming off leading and trailing silences. However, practical speech recognition systems cannot expect to be supplied with such pre-segmented data. Faced with an unsegmented stream of audio, f...
Acoustic Modeling for the SRI Hub4 Partitioned Evaluation Continuous Speech Recognition System
- In Proceedings of the DARPA Speech Recognition Workshop
, 1997
"... We describe the development of the SRI systemevaluated in the 1996 DARPA continuous speechrecognition (CSR) Hub4 partitioned evaluation (PE). The task for the Hub4evaluation was to recognize speech from broadcast television and radio shows. Recognizingsuch speech by machines poses many challenges. F ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
We describe the development of the SRI systemevaluated in the 1996 DARPA continuous speechrecognition (CSR) Hub4 partitioned evaluation (PE). The task for the Hub4evaluation was to recognize speech from broadcast television and radio shows. Recognizingsuch speech by machines poses many challenges. First, the segments to be recognized could be very long. This introduces a problem in training and recognition becauseof the consequentincreasedsystem memory requirement. A simple segmentation technique is used to break long segments into shorter, more manageable lengths. The speech from broadcast news sources exhibits a variety of difficult acoustic conditions, such as spontaneous speech, band-limited speech, and speech in the presence of noise, music, or background speakers. Such background conditions lead to significant degradation in performance. We describe techniques, based on acoustic adaptation, that adapt recognition models to the different acoustic background conditions, so as to im...
Leeuwen, “N-best: The northern- and southern-dutch benchmark evaluation of speech recognition technology
- in Interspeech
, 2007
"... In this paper, we describe N-best 2008, the first Large Vocabulary Speech Recognition (LVCSR) benchmark evaluation held for the Dutch language. Both the accent as spoken in the Netherlands (Northern-Dutch) and in Belgium (Southern-Dutch or Flemish), will be evaluated. The evaluation tasks are broadc ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper, we describe N-best 2008, the first Large Vocabulary Speech Recognition (LVCSR) benchmark evaluation held for the Dutch language. Both the accent as spoken in the Netherlands (Northern-Dutch) and in Belgium (Southern-Dutch or Flemish), will be evaluated. The evaluation tasks are broadcast news (BN) and conversational telephone speech (CTS). The N-best evaluation will take place in the spring of 2008 and is open to all research institutes and industries on voluntary basis. The goals of this first N-best evaluation is to define, set-up and conduct a Dutch LVCSR benchmark evaluation. In this paper, we will describe the state-of-the-art of Dutch LVCSR, recognition problems that are typical for the Dutch language, and the evaluation protocol. Index Terms: Northern- and Southern-Dutch, large vocabulary speech recognition, benchmark test, evaluation, conversational telephone speech, broadcast news. 1.

