• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Towards multi-domain speech understanding with flexible and dynamic vocabulary (2001)

by G Chung
Venue:Ph.D. thesis, MIT
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 11
Next 10 →

Modeling Out-Of-Vocabulary Words For Robust Speech Recognition

by Issam Bazzi, James Glass, Arthur C. Smith , 2000
"... This thesis concerns the problem of unknown or out-of-vocabulary (00V) words in continuous speech recognition. Most of today's state-of-the-art speech recognition systems can recognize only words that belong to some predefined finite word vocabulary. When encountering an OOV word, a speech recognize ..."
Abstract - Cited by 43 (5 self) - Add to MetaCart
This thesis concerns the problem of unknown or out-of-vocabulary (00V) words in continuous speech recognition. Most of today's state-of-the-art speech recognition systems can recognize only words that belong to some predefined finite word vocabulary. When encountering an OOV word, a speech recognizer erroneously substitutes the OOV word with a similarly sounding word from its vocabulary. Furthermore, a recognition error due to an OOV word tends to spread errors into neighboring words; dramatically degrading overall recognition performance.

Jupiter: A Telephone-Based Conversational Interface for Weather Information

by Victor Zue, Stephanie Seneff, James Glass, Joseph Polifroni, Christine Pao, Timothy J. Hazen, Lee Hetherington - IEEE Trans. on Speech and Audio Processing , 2000
"... In early 1997, our group initiated a project to develop jupiter, a conversational interface that allows users to obtain worldwide weather forecast information over the telephone using spoken dialogue. It has served as the primary research platform for our group on many issues related to human langua ..."
Abstract - Cited by 32 (3 self) - Add to MetaCart
In early 1997, our group initiated a project to develop jupiter, a conversational interface that allows users to obtain worldwide weather forecast information over the telephone using spoken dialogue. It has served as the primary research platform for our group on many issues related to human language technology, including telephonebased speech recognition, robust language understanding, language generation, dialogue modelling, and multilingual interfaces. Over a two year period since coming on line in May 1997, jupiter has received, via a toll-free number in North America, over 30,000 calls (totalling over 180,000 utterances), mostly from naive users. The purpose of this paper is to describe our development effort in terms of the underlying human language technologies as well as other system related issues such as utterance rejection and content harvesting. We will also present some evaluation results on the system and its components.

Speech Technology on Trial: Experiences from the August System

by Joakim Gustafson, Linda Bell - Natural Language Engineering , 2000
"... In this paper, the August spoken dialogue system is described. This experimental Swedish dialogue system, which featured an animated talking agent, was exposed to the general public during a trial period of six months. The construction of the system was partly motivated by the need to collect genuin ..."
Abstract - Cited by 17 (8 self) - Add to MetaCart
In this paper, the August spoken dialogue system is described. This experimental Swedish dialogue system, which featured an animated talking agent, was exposed to the general public during a trial period of six months. The construction of the system was partly motivated by the need to collect genuine speech data from people with little or no previous experience of spoken dialogue systems. A corpus of more than 10,000 utterances of spontaneous computer-directed speech was collected and empirical linguistic analyses were carried out. Acoustical, lexical and syntactical aspects of this data were examined. In particular, user behavior and user adaptation during error resolution were emphasized. Repetitive sequences in the database were analyzed in detail. Results suggest that computer-directed speech during error resolution is increased in duration, hyperarticulated and contains inserted pauses. Design decisions which may have influenced how the users behaved when they interacted with August are discussed and implications for the development of future systems are outlined.

VoiceXML-Based Dynamic Plug and Play Dialogue Management for Mobile Environments

by Botond Pakucs - in Proceedings of ISCA T&R Workshop on Multi-Modal Dialogue in Mobile Environments, Kloster Irsee , 2002
"... In this paper it is argued for the necessity of a plug and play functionality in speech interfaces in mobile environments. Further, a VoiceXML based plug and play dialogue management solution is introduced. The paper focuses in particular on the plug and play functionality and the dynamic handling o ..."
Abstract - Cited by 5 (2 self) - Add to MetaCart
In this paper it is argued for the necessity of a plug and play functionality in speech interfaces in mobile environments. Further, a VoiceXML based plug and play dialogue management solution is introduced. The paper focuses in particular on the plug and play functionality and the dynamic handling of the plug and playable dialogue management capabilities. The plug and play solution is applied in the SesaME architecture and employed within the framework of the PER demonstrator. Finally, first experiences related to the plug and play functionality of the dialogue management are discussed.

A Three-Stage Solution For Flexible Vocabulary Speech Understanding

by Grace Chung Spoken, Grace Chung - Proc. of ICSLP
"... This paper discusses our three-stage approach to a flexible vocabulary speech understanding system, which can detect out-ofvocabulary (OOV) words, and hypothesize their phonetic and orthographic transcriptions. In the first stage, we introduce the column-bigram finite-state transducer (FST) which, w ..."
Abstract - Add to MetaCart
This paper discusses our three-stage approach to a flexible vocabulary speech understanding system, which can detect out-ofvocabulary (OOV) words, and hypothesize their phonetic and orthographic transcriptions. In the first stage, we introduce the column-bigram finite-state transducer (FST) which, while embedding ANGIE sublexical models, also supports previously unseen data from unknown words. Secondly, the ANGIE models utilize grapheme information, providing tighter linguistic constraint as well as instantaneous sound-to-letter capability during recognition. Thirdly, the syllable-level lexical units of the first stage are automatically derived via an iterative procedure to optimize performance. The second-stage recognizer employs ANGIE to output a word network which is parsed by TINA, our natural language (NL) processor, in stage three. Experiments with a JUPITER implementation of this system are described in [1]. 1. INTRODUCTION In the future, we foresee conversational systems cap...

Automatically Incorporating Unknownwords In Jupiter

by Grace Chung Spoken, Grace Chung - Proceedings of the International Conference on Spoken Language Processing
"... This paper concerns the handling of out-of-vocabulary (OOV) words in the JUPITER weather information system. Specifically our objective is to deal with weather queries regarding unknown cities. We have implemented a system which can detect the presence of an unknown city name, and immediately propos ..."
Abstract - Add to MetaCart
This paper concerns the handling of out-of-vocabulary (OOV) words in the JUPITER weather information system. Specifically our objective is to deal with weather queries regarding unknown cities. We have implemented a system which can detect the presence of an unknown city name, and immediately propose a plausible spelling for that city. Potentially, the city can be dynamically incorporated into the recognizer lexicon. The three-stage system described in [1] was implemented in the JUPITER domain, and this paper will detail the development of a system that uses an ANGIE-based framework to model both spelling and pronunciation simultaneously, and uses automatically derived novel lexical units in the first stage. We report results on an independent test set containing unknown cities. Compared with a single-stage baseline, word error was reduced by 29.3% (from 24.6% to 17.4%) and understanding error was reduced by 67.5% (from 67.0% to 21.8%) on the three-stage configuration. 1. INTRODUCTION...

Towards a Unified Framework for Sub-lexical and Supra-lexical Linguistic Modeling

by Xiaolong Mou, Xiaolong Mou , 2002
"... Conversational interfaces have received much attention as a promising natural communication channel between humans and computers. A typical conversational interface consists of three major systems: speech understanding, dialog management and spoken language generation. In such a conversational inter ..."
Abstract - Add to MetaCart
Conversational interfaces have received much attention as a promising natural communication channel between humans and computers. A typical conversational interface consists of three major systems: speech understanding, dialog management and spoken language generation. In such a conversational interface, speech recognition as the front-end of speech understanding remains to be one of the fundamental challenges for establishing robust and effective human/computer communications. On the one hand, the speech recognition component in a conversational interface lives in a rich system environment. Diverse sources of knowledge are available and can potentially be beneficial to its robustness and accuracy. For example, the natural language understanding component can provide linguistic knowledge in syntax and semantics that helps constrain the recognition search space. On the other hand, the speech recognition component also faces the challenge of spontaneous speech, and it is important to address the casualness of speech using the knowledge sources available. For example, sub-lexical linguistic information would be very useful in providing linguistic support for previously unseen words, and dynamic reliability modeling may help improve recognition robustness for poorly articulated speech.

From Word-Spotting to OOV Modeling

by Paul Fitzpatrick , 2001
"... This paper explores one dimension along which word spotting and speech recognition differ: the nature of the background model. In word spotting, a relatively small number of keywords float on a sea of unknown words. In speech recognition, an occasional unknown word punctuates utterances that are oth ..."
Abstract - Add to MetaCart
This paper explores one dimension along which word spotting and speech recognition differ: the nature of the background model. In word spotting, a relatively small number of keywords float on a sea of unknown words. In speech recognition, an occasional unknown word punctuates utterances that are otherwise completely invocabulary. Despite this difference in viewpoint, in some circumstances implementations of the two may become very similar. When transcribed data is available for a domain, word spotting benefits from the more detailed background model this can support [9] . The manner in which the background is modeled in these cases is reminiscent of speech recognition. For example, a large vocabulary with good coverage may be extracted from the corpus, so that relatively few words in an utterance remain unmodeled. In this case, the situation is qualitatively similar to OOV modeling in a conventional speech recognizer, except that the vocabulary is strictly divided into "filler" and "keyword "

Towards Dynamic Multi-Domain Dialogue Processing

by Botond Pakucs Centre
"... This paper introduces SesaME, a generic dialogue management framework, especially designed for supporting dynamic multidomain dialogue processing. SesaME supports a multitude of highly distributed applications and facilitates simultaneous adaptation to individual users and their environment. The dyn ..."
Abstract - Add to MetaCart
This paper introduces SesaME, a generic dialogue management framework, especially designed for supporting dynamic multidomain dialogue processing. SesaME supports a multitude of highly distributed applications and facilitates simultaneous adaptation to individual users and their environment. The dynamic multi-domain dialogue processing is supported through the use of standardised and highly distributed domain descriptions. For fast, runtime handling of these domain descriptions a specially developed, dynamic plug and play solution is employed. In this paper, a description of how SesaME's functionality is evaluated within the framework of the PER demonstrator is also presented.

Clustering Wide-Contexts and HMM Topologies for Spontaneous Speech Recognition

by Izhak Shafran , 2001
"... In most speech recognition systems today, all the acoustic variation associated with a phoneme is characterized in terms of the identity of its neighboring phonemes. The neighbors influence only the state observation density of a fixed Hidden Markov Model. Other sources of variation are captured imp ..."
Abstract - Add to MetaCart
In most speech recognition systems today, all the acoustic variation associated with a phoneme is characterized in terms of the identity of its neighboring phonemes. The neighbors influence only the state observation density of a fixed Hidden Markov Model. Other sources of variation are captured implicitly by using Gaussian mixture models for the state observations. Consequently, these models can be very broad, particularly for casual spontaneous speech. In this thesis, we explore conditioning of phonemes on higher level linguistic structure, specifically syllable- and word-level structure to learn models for phonemes that are more specific to the context, reporting experimental results on a large vocabulary (35k words) conversational speech task (Switchboard). In particular, this thesis makes three main contributions related to wide context conditioning. First, we demonstrate that syllable- and word-level structure can be incorporated into current acoustic models to improve recognition accuracy over triphones. For a fixed number of parameters, these models are computationally more efficient than pentaphones, both in training and in testing. In addition, use of syllable and word features leads to a small but significant improvement in performance. The wide-contexts used in our acoustic model can implicitly capture re-syllabification effects to a certain extent. However, we find that explicitly modeling re-syllabification does not improve recognition further, because there are only a small number of phones that exhibit acoustic difference after re-syllabification. The second contribution addresses the difficulties that arise when a large number of additional conditioning features are used. As the number of conditioning features increases, the training cost can increase exponentially. Moreover, a large fraction of the training labels tends to have too few examples to have reliable statistics associated with them, and this could potentially cause decision trees to learn bad clusters. A new method has been developed for clustering with multiple stages, where each stage clusters a different subset of features, and also has a choice of using the partitions learned in the previous stages. Apart from reducing the risk of unreliable statistics, it is designed to ameliorate data fragmentation problem and is computationally less expensive. This method was successfully demonstrated with pentaphones, resulting in equivalent performance at a lower cost. Finally, a new algorithm is described to design context-specific HMMs. The idea is to model reduction of a phone for certain contexts, and to learn a more constrained topology. Using contextual information, the algorithm clusters HMM paths where each path has a different number of states. An HMM distance measure has been formulated to prune out the paths which are similar. During decoding, the paths are allocated dynamically for each sub-word unit according to their context. We investigated this algorithm to model phone topologies, finding improved characterization of speech given known word sequences but no significant improvement in word error rate.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University