Results 1 - 10
of
24
Active Learning For Automatic Speech Recognition
, 2002
"... State-of-the-art speech recognition systems are trained using transcribed utterances, preparation of which is labor intensive and time-consuming. In this paper, we describe a new method for reducing the transcription effort for training in automatic speech recognition (ASR). Active learning aims at ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
State-of-the-art speech recognition systems are trained using transcribed utterances, preparation of which is labor intensive and time-consuming. In this paper, we describe a new method for reducing the transcription effort for training in automatic speech recognition (ASR). Active learning aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function for a human to label. We automatically estimate a confidence score for each word of the utterance, exploiting the lattice output of a speech recognizer, which was trained on a small set of transcribed data. We compute utterance confidence scores based on these word confidence scores, then selectively sample the utterances to be transcribed using the utterance confidence scores. In our experiments, we show that we reduce the amount of labeled data needed for a given word accuracy by 27%.
Stochastic Language Adaptation over Time and State in Natural Spoken Dialogue Systems
, 2000
"... We are interested in adaptive spoken dialogue systems for automated services. Peoples' spoken language usage varies over time for a given task, and furthermore varies depending on the state of the dialogue. Thus, it is crucial to adapt ASR language models to these varying conditions. We characterize ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
We are interested in adaptive spoken dialogue systems for automated services. Peoples' spoken language usage varies over time for a given task, and furthermore varies depending on the state of the dialogue. Thus, it is crucial to adapt ASR language models to these varying conditions. We characterize and quantify these variations based on a database of 30K user-transactions with AT&T's experimental How May I Help You ? spoken dialogue system. We describe a novel adaptation algorithm for language models with time and dialogue-state varying parameters. Our language adaptation framework allows for recognizing and understanding unconstrained speech at each stage of the dialogue, enabling context-switching and error recovery. These models have been used to train state-dependent ASR language models. We have evaluated their performance with respect to word accuracy and perplexity over time and dialogue states. We have achieved a reduction of 40% in perplexity and of 8:4% in word error rate ov...
A Spoken Language System For Automated Call Routing
, 1997
"... We are interested in the problem of understanding fluently spoken language. In particular, we consider people's responses to the open-ended prompt of 'How May I help you?'. We then further restrict the problem to classifying and automatically routing such a call, based on the meaning of the user's r ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
We are interested in the problem of understanding fluently spoken language. In particular, we consider people's responses to the open-ended prompt of 'How May I help you?'. We then further restrict the problem to classifying and automatically routing such a call, based on the meaning of the user's response. Thus, we aim at extracting a relatively small number of semantic actions from the utterances of a very large set of users who are not trained to the system's capabilities and limitations. In this paper, we describe the main components of our speech understanding system: the large vocabulary recognizer and the language understanding module performing the call-type classification. In particular, we propose automatic algorithms for selecting phrases from a training corpus in order to enhance the prediction power of the standard word n-gram The phrase language models are integrated into stochastic finite state machines which outperform standard word n-gram language models. From the spee...
Active And Unsupervised Learning for Automatic Speech Recognition
, 2003
"... State-of-the-art speech recognition systems are trained using human transcriptions of speech utterances. In this paper, we describe a method to combine active and unsupervised learning for automatic speech recognition (ASR). The goal is to minimize the human supervision for training acoustic and lan ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
State-of-the-art speech recognition systems are trained using human transcriptions of speech utterances. In this paper, we describe a method to combine active and unsupervised learning for automatic speech recognition (ASR). The goal is to minimize the human supervision for training acoustic and language models and to maximize the performance given the transcribed and untranscribed data. Active learning aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function. For unsupervised learning, we utilize the remaining untranscribed data by using their ASR output and word confidence scores. Our experiments show that the amount of labeled data needed for a given word accuracy can be reduced by 75% by combining active and unsupervised learning.
Rational Interpolation Of Maximum Likelihood Predictors In Stochastic Language Modeling
, 1997
"... In our paper, we address the problem of estimating stochastic language models based on n-gram statistics. We present a novel approach, rational interpolation, for the combination of a competing set of conditional n-gram word probability predictors, which consistently outperforms the traditional lin ..."
Abstract
-
Cited by 14 (11 self)
- Add to MetaCart
In our paper, we address the problem of estimating stochastic language models based on n-gram statistics. We present a novel approach, rational interpolation, for the combination of a competing set of conditional n-gram word probability predictors, which consistently outperforms the traditional linear interpolation scheme. The superiority of rational interpolation is substantiated by experimental results from language modeling, speech recognition, dialog act classification, and language identification. 1. INTRODUCTION In our paper, we address the problem of estimating stochastic language models P (w) for sentences w = w1 : : : wT of words w t from a finite vocabulary V. The joint distribution P (w) can be decomposed by the wellknown chain rule P (w) = T Y t=1 P (w t jw t\Gamma1 1 ) = T Y t=1 P (w t j w1 : : : w t\Gamma1 ) (1) into a product of conditional word probabilities (by w t s we denote the substring ws : : : w t of w). The latter, in turn, are usually approximate...
Stochastic Language Models For Speech Recognition And Understanding
, 1998
"... Stochastic language models for speech recognition have traditionally been designed and evaluated in order to optimize word accuracy. In this work, we present a novel framework for training stochastic language models by optimizing two different criteria appropriate for speech recognition and language ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Stochastic language models for speech recognition have traditionally been designed and evaluated in order to optimize word accuracy. In this work, we present a novel framework for training stochastic language models by optimizing two different criteria appropriate for speech recognition and language understanding. First, the language entropy and salience measure are used for learning the relevant spoken language features (phrases). Secondly, a novel algorithm for training stochastic finite state machines is presented which incorporates the acquired phrase structure into a single stochastic language model. Thirdly, we show the benefit of our novel framework with an end-toend evaluation of a large vocabulary spoken language system for call routing. 2. INTRODUCTION Traditionally, the design of stochastic language models for data-driven speech understanding systems is partitioned into two sub-problems. In other words, two language models are independently trained as optimize the speech ...
Integration Of Utterance Verification With Statistical Language Modeling And Spoken Language Understanding
, 1998
"... Methods for utterance verification (UV) and their integration into statistical language modeling and spoken language understanding formalisms for a large vocabulary spoken understanding system are presented. The paper consists of three parts. First, a set of acoustic likelihood ratio based utterance ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Methods for utterance verification (UV) and their integration into statistical language modeling and spoken language understanding formalisms for a large vocabulary spoken understanding system are presented. The paper consists of three parts. First, a set of acoustic likelihood ratio based utterance verification techniques are described and applied to the problem of rejecting portions of a hypothesized word string that may have been incorrectly decoded by a large vocabulary continuous speech recognizer. Second, a procedure for integrating the acoustic level confidence measures with the statistical language model is described. Finally, the effect of integrating acoustic level confidence into the spoken language understanding unit (SLU) in a call-- type classification task is discussed. These techniques were evaluated on utterances collected from a highly unconstrained call routing task performed over the telephone network. They have been evaluated in terms of their ability to classify u...
Speech and language processing for next-millenium communications services
- Proceedings of the IEEE
, 2000
"... In the future, the world of telecommunications will be vastly different than it is today. The driving force will be the seamless integration of real-time communications (e.g., voice, video, music, etc.) and data into a single network, with ubiquitous access to that network anywhere, anytime, and by ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
In the future, the world of telecommunications will be vastly different than it is today. The driving force will be the seamless integration of real-time communications (e.g., voice, video, music, etc.) and data into a single network, with ubiquitous access to that network anywhere, anytime, and by a wide range of devices. The only currently available ubiquitous access device to the network is the telephone, and the only ubiquitous user access technology mode is spoken voice commands and natural language dialogues with machines. In the future, new access devices and modes will augment speech in this role, but are unlikely to supplant the telephone and access by speech anytime soon. Speech technologies have progressed to the point where they are now viable for a broad range of communications services, including compression of speech for use over wired and wireless networks; speech synthesis, recognition, and understanding for dialogue access to information, people, and messaging; and speaker verification for secure access to information and services. This paper provides brief overviews of these technologies, discusses some of the unique properties of wireless, plain old telephone service, and Internet protocol networks that make voice communication and control problematic, and describes the types of voice services available in the past and today, and those that we foresee becoming available over the next several years. Keywords—Dialogue management, speaker recognition, speech coding, speech processing, speech recognition, speech synthesis, spoken language understanding. I.
Arc Minimization in Finite State Decoding Graphs with Cross-Word Acoustic Context
- In Proc. ICSLP’02
, 2002
"... Recent approaches to large vocabulary decoding with finite state graphs have focused on the use of state minimization algorithms to produce relatively compact graphs. This paper extends the finite state approach by developing complementary arc-minimization techniques. The use of these techniques in ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Recent approaches to large vocabulary decoding with finite state graphs have focused on the use of state minimization algorithms to produce relatively compact graphs. This paper extends the finite state approach by developing complementary arc-minimization techniques. The use of these techniques in concert with state minimization allows us to statically compile decoding graphs in which the acoustic models utilize a full word of cross-word context. This is in significant contrast to typical systems which use only a single phone. We show that the particular arc-minimization problem that arises is in fact an NP-complete combinatorial optimization problem, and describe the reduction from 3-SAT. We present experimental results that illustrate the moderate sizes and runtimes of graphs for the Switchboard task. 1.
Beyond ASR 1-best: Using word confusion networks in spoken language understanding
, 2006
"... We are interested in the problem of robust understanding from noisy spontaneous speech input. With the advances in automated speech recognition (ASR), there has been increasing interest in spoken language understanding (SLU). A challenge in large vocabulary spoken language understanding is robustnes ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
We are interested in the problem of robust understanding from noisy spontaneous speech input. With the advances in automated speech recognition (ASR), there has been increasing interest in spoken language understanding (SLU). A challenge in large vocabulary spoken language understanding is robustness to ASR errors. State of the art spoken language understanding relies on the best ASR hypotheses (ASR 1-best). In this paper, we propose methods for a tighter integration of ASR and SLU using word confusion networks (WCNs). WCNs obtained from ASR word graphs (lattices) provide a compact representation of multiple aligned ASR hypotheses along with word confidence scores, without compromising recognition accuracy. We present our work on exploiting WCNs instead of simply using ASR one-best hypotheses. In this work, we focus on the tasks of named entity detection and extraction and call classification in a spoken dialog system, although the idea is more general and applicable to other spoken language processing tasks. For named entity detection, we have improved the F-measure by using both word lattices and WCNs, 6–10 % absolute. The processing of WCNs was 25 times faster than lattices, which is very important for real-life applications. For call classification, we have shown between 5 % and 10 % relative reduction in error rate using WCNs compared to ASR 1-best output.

