Results 1 - 10
of
15
Informedia: News-on-Demand Multimedia Information Acquisition and Retrieval
- INTELLIGENT MULTIMEDIA INFORMATION RETRIEVAL
, 1997
"... In theory, speech recognition technology can make any spoken words in video or audio media subject to text indexing, search and retrieval. This article describes the News-on-Demand application created within the Informedia Digital Video Library project and discusses how speech recognition is used f ..."
Abstract
-
Cited by 75 (6 self)
- Add to MetaCart
In theory, speech recognition technology can make any spoken words in video or audio media subject to text indexing, search and retrieval. This article describes the News-on-Demand application created within the Informedia Digital Video Library project and discusses how speech recognition is used for transcript creation from video, time alignment of closed-captioned transcripts, a speech query interface, and audio paragraph segmentation. Our results show that speech recognition accuracy varies dramatically depending on the quality and type of data used, but the system is quite useable with only moderate speech recognition accuracy.
An Architecture for a Generic Dialogue Shell
, 2000
"... Architecture of the Dialogue Shell ***DRAFT*** 2/00 to appear in Natural Language Engineering, 2000. 7 mantic hierarchy and to a world KB manager that handles queries about the current situation, managing the interfaces to domain dependent reasoners and knowledge bases as needed. One of the key th ..."
Abstract
-
Cited by 72 (21 self)
- Add to MetaCart
Architecture of the Dialogue Shell ***DRAFT*** 2/00 to appear in Natural Language Engineering, 2000. 7 mantic hierarchy and to a world KB manager that handles queries about the current situation, managing the interfaces to domain dependent reasoners and knowledge bases as needed. One of the key things to note about this architecture is the separation of the basic dialogue system components from the more domain-specific components that provide the application (shown within the dotted lines at the lower left corner of Figure 1). To illustrate this separation, consider a specific example: a travel-agent application. The back-end would provide schedule and reservation information, booking, and so on, much as current computer systems provide to human travel agents. The behavioral agent and plan manager would be driven from a specification of desired behavior of the system as a travel agent, including the actions it typically will be asked to perform (e.g., what information is relevant to ...
Text, Speech and Vision for Video Segmentation: The Informedia Project
- AAAI Fall Symposium, Computational Models for Integrating Language and Vision
, 1995
"... We describe three technologies involved in creating a digital video library suitable for fullcontent search and retrieval. Image processing analyzes scenes, speech processing transcribes the audio signal, and natural language processing determines word relevance. The integration of these technologie ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
We describe three technologies involved in creating a digital video library suitable for fullcontent search and retrieval. Image processing analyzes scenes, speech processing transcribes the audio signal, and natural language processing determines word relevance. The integration of these technologies enables us to include vast amounts of video data in the library.
Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures
- Proc. HLT-NAACL 2003
, 2003
"... Sources of training data suitable for language modeling of conversational speech are limited. In this paper, we show how training data can be supplemented with text from the web filtered to match the style and/or topic of the target recognition task, but also that it is possible to get bigger perfor ..."
Abstract
-
Cited by 36 (8 self)
- Add to MetaCart
Sources of training data suitable for language modeling of conversational speech are limited. In this paper, we show how training data can be supplemented with text from the web filtered to match the style and/or topic of the target recognition task, but also that it is possible to get bigger performance gains from the data by using class-dependent interpolation of N-grams.
Interactive Speech Translation in the DIPLOMAT Project
, 1997
"... The DIPLOMAT rapid-deployment speech translation system is intended to allow naive users to communicate across a language barrier, without strong domain restrictions, despite the errorprone nature of current speech and translation technologies. Achieving this ambitious goal depends in large p ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
The DIPLOMAT rapid-deployment speech translation system is intended to allow naive users to communicate across a language barrier, without strong domain restrictions, despite the errorprone nature of current speech and translation technologies. Achieving this ambitious goal depends in large part on allowing the users to interactively correct recognition and translation errors.
Rapid language model development for new task domains
- Proc. First International Conference on Language Resources and Evaluation (LREC
, 1998
"... Data sparseness has been regularly indicted as the primary problem in statistical language modelling. We go one step further to consider the situation when no text data is available for the target domain. We present two techniques for building efficient language models quickly for new domains. The f ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Data sparseness has been regularly indicted as the primary problem in statistical language modelling. We go one step further to consider the situation when no text data is available for the target domain. We present two techniques for building efficient language models quickly for new domains. The first technique is based on using a context-free grammar to generate a corpus of word collocations. The second is an adaptation technique based on using out-of-domain corpora to estimate target domain language models. We report results of successfully using these two techniques individually and in combination to build efficient models for a spontaneous speech recognition task in a medium-sized vocabulary domain. 1.
Speech Recognition for a Digital Video Library
- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE (JASIS
, 1998
"... The standard method for making the full content of audio and video material searchable and is to annotate it with human-generated meta-data that describes the content in a way that the search can understand, as is done in the creation of multimedia CD-ROMs. However, for the huge amounts of data that ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
The standard method for making the full content of audio and video material searchable and is to annotate it with human-generated meta-data that describes the content in a way that the search can understand, as is done in the creation of multimedia CD-ROMs. However, for the huge amounts of data that could usefully be included in digital video and audio libraries, the cost of producing this meta-data is prohibitive. In the Informedia Digital Video Library, the production of the meta-data supporting the library interface is automated using techniques derived from artificial intelligence (AI) research. By applying speech recognition together with natural language processing, information retrieval and image analysis, an interface has been produced that helps users locate the information they want and navigate or browse the digital video library more effectively. Specific interface components include automatic titles, filmstrips, video skims, word location marking and representative frames f...
The Effects of Corpus Size and Homogeneity on Language Model Quality
, 1997
"... Generic speech recognition systems typically use language models that are trained to cope with a b variety of input. However, many recognition applications are more constrained, often to a specific or domain. In cases such as these, a knowledge of the particular topic can be used to advantage. repor ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Generic speech recognition systems typically use language models that are trained to cope with a b variety of input. However, many recognition applications are more constrained, often to a specific or domain. In cases such as these, a knowledge of the particular topic can be used to advantage. report describes the development of a number of techniques for augmenting domain-specific lang models with dam from a more general source.
Rapid Language Model Development Using External Resources for New Spoken Dialog Domains
- in Proc. ICASSP, 2005
"... This paper addresses a critical problem in deploying a spoken dialog system (SDS). One of the main bottlenecks of SDS deployment for a new domain is data sparseness in building a statistical language model. Our goal is to devise a method to efficiently build a reliable language model for a new SDS. ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper addresses a critical problem in deploying a spoken dialog system (SDS). One of the main bottlenecks of SDS deployment for a new domain is data sparseness in building a statistical language model. Our goal is to devise a method to efficiently build a reliable language model for a new SDS. We consider the worst yet quite common scenario where only a small amount (∼1.7K utterances) of domain specific data is available for the target domain. We present a new method that exploits external static text resources that are collected for other speech recognition tasks as well as dynamic text resources acquired from World Wide Web (WWW). We show that language models built using external resources can jointly be used with limited in–domain (baseline) language model to obtain significant improvements in speech recognition accuracy. Combining language models built using external resources with the in–domain language model provides over 20 % reduction in WER over the baseline in–domain language model. Equivalently, we achieve almost the same level of performance by having ten times as much in–domain data (17K utterances). 1.
Automatic induction of language model data for a spoken dialogue system
- In Proceedings of SIGDIAL
, 2005
"... When building a new spoken dialogue application, large amounts of domain specific data are required. This paper addresses the issue of generating in-domain training data when little or no real user data are available. The twostage approach taken begins with a data induction phase whereby linguistic ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
When building a new spoken dialogue application, large amounts of domain specific data are required. This paper addresses the issue of generating in-domain training data when little or no real user data are available. The twostage approach taken begins with a data induction phase whereby linguistic constructs from out-of-domain sentences are harvested and integrated with artificially constructed in-domain phrases. After some syntactic and semantic filtering, a large corpus of synthetically assembled user utterances is induced. The second stage involves sampling the synthetic corpus towards the goal of obtaining data that would be representative of the statistics of applicationspecific real user interactions. The sampling methods proposed employ an example-based generation framework, a simulated user model and information extracted from development data. Evaluation is conducted on recognition performance in a restaurant information domain. We show that word error rate can be reduced when limited amounts of real user training data are augmented with synthetic data derived by our methods. 1

