Results 1 - 10
of
24
SpeechBuilder: Facilitating Spoken Dialogue System Development
, 2001
"... SpeechBuilder is a suite of tools that helps facilitate the creation of mixed-initiative spoken dialogue systems for both novice and experienced developers of human language applications. SpeechBuilder employs intuitive methods of specification to allow developers to create human language interfaces ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
SpeechBuilder is a suite of tools that helps facilitate the creation of mixed-initiative spoken dialogue systems for both novice and experienced developers of human language applications. SpeechBuilder employs intuitive methods of specification to allow developers to create human language interfaces to structured information stored in a relational database, or to control- and transaction-based applications. The goal of this project has been both to robustly accommodate the various scenarios where spoken dialogue systems may be needed, and to provide a stable and reliable infrastructure for design and deployment of applications. SpeechBuilder has been used in various spoken language domains, including a directory of the people working at the MIT Laboratory for Computer Science, an application to control the various physical items in a typical office environment, and a system for real-time weather information access.
Pronunciation Modeling Using a Finite-State Transducer Representation
- in Proc. ISCA Tutorial and Research Workshop on Pronunciation Modeling and Lexicon Adaptation
, 2002
"... The MIT SUMMIT speech recognition system models pronunciation using a phonemic baseform dictionary along with rewrite rules for modeling phonological variation and multi-word reductions. Each pronunciation component is encoded within a finitestate transducer (FST) representation whose transition wei ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
The MIT SUMMIT speech recognition system models pronunciation using a phonemic baseform dictionary along with rewrite rules for modeling phonological variation and multi-word reductions. Each pronunciation component is encoded within a finitestate transducer (FST) representation whose transition weights can be probabilistically trained using a modified EM algorithm for finite-state networks. This paper explains the modeling approach we use and the details of its realization. We demonstrate the benefits and weaknesses of the approach both conceptually and empirically using the recognizer for our JUPITER weather information system. Our experiments demonstrate that the use of phonological rewrite rules within our system reduces word error rates by between 4% and 8% over different test sets when compared against a system using no phonological rewrite rules.
Word and phone level acoustic confidence scoring
- IN PROC. ICASSP
, 2000
"... This paper presents a word level confidence scoring technique based on a combination of multiple features extracted from the output of a phonetic classifier. The goal of this research was to develop a robust confidence measure based strictly on acoustic information. This research focused on methods ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
This paper presents a word level confidence scoring technique based on a combination of multiple features extracted from the output of a phonetic classifier. The goal of this research was to develop a robust confidence measure based strictly on acoustic information. This research focused on methods for augmenting standard log likelihood ratio techniques with additional information to improve the robustness of the acoustic confidence scores for word recognition tasks. The most successful approach utilized a Fisher linear discriminant projection to reduce a set of acoustic features, extracted from phone level classification results, to a single dimension confidence score. The experiments in this paper were implemented within the JUPITER weather information system. The paper presents results indicating that the technique achieved significant improvements over standard log likelihood ratio techniques for confidence scoring.
A comparison and combination of methods for OOV word detection and word conference scoring
- In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-01). Salt Lake City. IEEE
"... This paper examines an approach for combining two different methods for detecting errors in the output of a speech recognizer. The first method attempts to alleviate recognition errors by using an explicit model for detecting the presence of out-of-vocabulary (OOV) words. The second method identifie ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
This paper examines an approach for combining two different methods for detecting errors in the output of a speech recognizer. The first method attempts to alleviate recognition errors by using an explicit model for detecting the presence of out-of-vocabulary (OOV) words. The second method identifies potentially misrecognized words from a set of confidence features extracted from the recognition process using a confidence scoring model. Since these two methods are inherently different, an approach which combines the techniques can provide significant advantages over either of the individual methods. In experiments in the JUPITER weather domain, we compare and contrast the two approaches and demonstrate the advantage of the combined approach. In comparison to either of the two individual approaches, the combined approach achieves over 25 % fewer false acceptances of incorrectly recognized
HealthLine: Speech-based Access to Health Information by Low-literate Users
"... Abstract—Health information access by low-literate community health workers is a pressing need of community health programs across the developing world. We present results from a needs assessment we conducted to understand the health information access practices and needs of various types of health ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Abstract—Health information access by low-literate community health workers is a pressing need of community health programs across the developing world. We present results from a needs assessment we conducted to understand the health information access practices and needs of various types of health workers in Pakistan. We also present a prototype for speechbased health information access, as well as discuss our experiences from a pilot study involving its use by community health workers in a rural health center. Index Terms—speech recognition, dialog systems, developing regions, health information, community health, Pakistan
Integrating Recognition Confidence Scoring With Language Understanding And Dialogue Modeling
- IN PROC. ICSLP
, 2000
"... In this paper we present a method for integrating confidence scores into the understanding and dialogue components of a speech understanding system. The understanding component of our system receives an #-best list of recognition hypotheses augmented with word-level confidence scores. The confiden ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
In this paper we present a method for integrating confidence scores into the understanding and dialogue components of a speech understanding system. The understanding component of our system receives an #-best list of recognition hypotheses augmented with word-level confidence scores. The confidence scores are used by the understanding component to hypothesize when words in a recognizer's #-best list have been misrecognized. The understanding component has the ability to predict the semantic class of misrecognized words based on the surrounding context and also to suggest when key words which may have been misunderstood should be re-confirmed by the user. The output of the understanding component is passed onto a dialogue control component which can act on various suggestions made by the understanding component. To evaluate the system, experiments were conducted using the JUPITER weather information system. Evaluation was performed at the understanding level using key-value pair conc...
A Framework For Developing Conversational User Interfaces
, 2004
"... In this work we report our efforts to facilitate the creation of mixed-initiative conversational interfaces for novice and experienced developers of human language technology. Our focus has been on a framework that allows developers to easily specify the basic concepts of their applications, and rap ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
In this work we report our efforts to facilitate the creation of mixed-initiative conversational interfaces for novice and experienced developers of human language technology. Our focus has been on a framework that allows developers to easily specify the basic concepts of their applications, and rapidly prototype conversational interfaces for a variety of configurations. In this paper we describe the current knowledge representation, the compilation processes for speech understanding, generation, and dialogue turn management, as well as the user interfaces created for novice users and more experienced developers. Finally, we report our experiences with several user groups in which developers used this framework to prototype a variety of conversational interfaces.
Intelligent Barge-In in Conversational Systems
, 2000
"... In this paper we present novel solutions to problems related to barge-in in telephony-based conversational systems. In particular we address recovery from falsely detected barge-in events and a method for signaling to the user that barge-in is disallowed at a particular dialogue state. The mechanism ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
In this paper we present novel solutions to problems related to barge-in in telephony-based conversational systems. In particular we address recovery from falsely detected barge-in events and a method for signaling to the user that barge-in is disallowed at a particular dialogue state. The mechanisms and signals used to manage turn taking are similar to those in human-human conversation, which makes them easy to understand for users without explanation or prior training. 1. INTRODUCTION In telephony-based spoken language systems, it is desirable to let users interrupt system output at any time, in particular if the output is based on erroneous understanding or contain superfluous information. Thus, enabling barge-in, i.e., the ability for the user to start speaking before system output has ended, can significantly enhance the user experience. However, users' new freedom also poses new challenges. One challenge is sorting out true user barge-in from background noise and nonspeech soun...
FST-based recognition techniques for multi-lingual and multi-domain spontaneous speech
- Proceedings of the European Conference on Speech Communication and Technology
, 2001
"... In this paper we present techniques for building multi-domain and multi-lingual recognizers within a finite-state transducer (FST) framework. The flexibility of the FST approach is also demonstrated on the task of incorporating networks modeling different types of non-speech events into an existing ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
In this paper we present techniques for building multi-domain and multi-lingual recognizers within a finite-state transducer (FST) framework. The flexibility of the FST approach is also demonstrated on the task of incorporating networks modeling different types of non-speech events into an existing word lattice network. The ability to create robust multi-domain and/or multi-lingual recognizers for spontaneous speech will enable a conversational system to switch seamlessly and automatically among different domains and/or languages. Preliminary results using a bi-domain recognizer exhibit only small recognition accuracy degradation in comparison to domain-dependent recognition. Similarly promising results were observed using a bilingual recognizer which performs simultaneous language identification and recognition. When using the FST techniques to add non-speech models to the recognizer, experiments show a 10 % reduction in word error rate across all utterances and a 30% reduction on utterances containing non-speech events. 1.

