Results 1 -
9 of
9
Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition
- IEEE Transactions on Audio, Speech, and Language Processing
, 2012
"... Abstract—We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Abstract—We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8 % and 9.2 % (or relative error reduction of 16.0 % and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively. Index Terms—Speech recognition, deep belief network, context-dependent phone, LVSR, DNN-HMM, ANN-HMM I.
Language modeling for voice search: a machine translation approach
- in Proc. ICASSP'08
, 2008
"... This paper presents a novel approach to language modeling for voice search based on the idea and method of statistical machine translation. We propose an n-gram based translation model that can be used for listing-to-query translation. We then leverage the query forms translated from listings to imp ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
This paper presents a novel approach to language modeling for voice search based on the idea and method of statistical machine translation. We propose an n-gram based translation model that can be used for listing-to-query translation. We then leverage the query forms translated from listings to improve language modeling. The translation model is trained in an unsupervised manner using a set of transcribed voice search queries. Experiments show that the translation approach yielded drastic perplexity reductions compared with a baseline language model where no translation is applied. Index Terms — language modeling, machine translation, voice search, directory assistance
People Watcher: A Game for Eliciting Human-Transcribed Data for Automated Directory Assistance
"... Automated Directory Assistance (ADA) allows users to request telephone or address information of residential and business listings using speech recognition. Because callers often express listings differently than how they are registered in the directory, ADA systems require transcriptions of alterna ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Automated Directory Assistance (ADA) allows users to request telephone or address information of residential and business listings using speech recognition. Because callers often express listings differently than how they are registered in the directory, ADA systems require transcriptions of alternative phrasings for directory listings as training data, which can be costly to acquire. As such, a framework in which data can be contributed voluntarily by large numbers of Internet users has tremendous value. In this paper, we introduce People Watcher, a computer game that elicits transcribed, alternative user phrasings for directory listings while at the same time entertaining players. Data generated from the game not only overlapped actual audio transcriptions, but resulted in a statistically significant 15% relative reduction in semantic error rate when utilized for ADA. Furthermore, semantic accuracy was not statistically different than using the actual audio transcriptions. Index Terms: game, automated directory assistance 1.
A Voice Search Approach to Replying to SMS Messages in Automobiles
"... Automotive infotainment systems now provide drivers the ability to hear incoming Short Message Service (SMS) text messages using text-to-speech. However, the question of how best to allow users to respond to these messages using speech recognition remains unsettled. In this paper, we propose a robus ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Automotive infotainment systems now provide drivers the ability to hear incoming Short Message Service (SMS) text messages using text-to-speech. However, the question of how best to allow users to respond to these messages using speech recognition remains unsettled. In this paper, we propose a robust voice search approach to replying to SMS messages based on template matching. The templates are empirically derived from a large SMS corpus and matches are accurately retrieved using a vector space model. In evaluating SMS replies within the acoustically challenging environment of automobiles, the voice search approach consistently outperformed using just the recognition results of a statistical language model or a probabilistic context-free grammar. For SMS replies covered by our templates, the approach achieved as high as 89.7 % task completion when evaluating the top five reply candidates. Index Terms: SMS, information retrieval, voice UI, voice search 1.
Accommodating Explicit User Expressions of Uncertainty in Voice Search or Something Like That
"... Voice search applications encourage users to “just say what you want ” in order to obtain useful mobile content such as automated directory assistance (ADA). Unfortunately, when users only remember part of what they are looking for, they are forced to guess, even though what they know may be suffici ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Voice search applications encourage users to “just say what you want ” in order to obtain useful mobile content such as automated directory assistance (ADA). Unfortunately, when users only remember part of what they are looking for, they are forced to guess, even though what they know may be sufficient to retrieve the desired information. In this paper, we propose expanding the capabilities of voice search to allow users to explicitly express their uncertainties as part of their queries, and as such, to provide partial knowledge. Applied to ADA, we highlight the enhanced user experience uncertain expressions afford and delineate how we performed language modeling and information retrieval. We evaluate our approach by assessing its impact on overall ADA performance and by discussing the results of an experiment in which users generated both uncertain expressions as well as guesses for directory listings. Uncertain expressions reduced relative error rate by 31.8 % compared to guessing. Index Terms: voice search, user uncertainty, something 1.
SEMANTIC CONFIDENCE CALIBRATION FOR SPOKEN DIALOG APPLICATIONS
"... The success of spoken dialog applications depends strongly on the quality of the semantic confidence measure that determines the selection of the dialog strategy. However, the semantic confidence measure obtained from typical automatic speech recognition engines is not optimized for specific semanti ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The success of spoken dialog applications depends strongly on the quality of the semantic confidence measure that determines the selection of the dialog strategy. However, the semantic confidence measure obtained from typical automatic speech recognition engines is not optimized for specific semantic slots and applications. We present our recent work on using a novel maximum entropy model with distribution constraints to calibrate the semantic confidence scores with the inputs of only the raw semantic confidence and the associated raw word confidence scores. We illustrate how features can be constructed from the raw confidence scores with a variable number of words and how the quality of the semantic confidence measure can be further improved by adding another calibration stage for the word confidence measure. We demonstrate the effectiveness of our approach for two types of semantic slots of practical significance. For the ZIP-code semantic slot, the new measure achieves relative 10.6 % mean square error (MSE), 19.3 % normalized negative loglikelihood (NNLL), and 38.5 % equal error rate (EER) reduction. The counterpart of the date-time semantic slot is 37.8%, 38.7%, and 23.1%, respectively. Index Terms — Score calibration, confidence measure, maximum entropy, distribution constraint, semantic confidence
to Voice Search
"... [A look at the technology, the technological challenges, and the solutions] Voice search is the technology underlying many spoken dialog systems ..."
Abstract
- Add to MetaCart
[A look at the technology, the technological challenges, and the solutions] Voice search is the technology underlying many spoken dialog systems
Inductive and Example-Based Learning for Text Classification
"... Text classification has been widely applied to many practical tasks. Inductive models trained from labeled data are the most commonly used technique. The basic assumption underlying an inductive model is that the training data are drawn from the same distribution as the test data. However, labeling ..."
Abstract
- Add to MetaCart
Text classification has been widely applied to many practical tasks. Inductive models trained from labeled data are the most commonly used technique. The basic assumption underlying an inductive model is that the training data are drawn from the same distribution as the test data. However, labeling such a training set is often expensive for practical applications. On the other hand, a large amount of labeled data, which have been drawn from a different distribution, is often available in the same application domain. It is thus very desirable to take advantage of these data even though there is a discrepancy between their underlying distribution and that of the test set. This paper compares three text classification algorithms applied in this scenario, including two inductive Maximum Entropy (MaxEnt) models, one flatly initialized and the other initialized with a term-frequency/inverse document frequency (Tf*Idf) weighted vector space model, and an example-based learning algorithm, which assigns a class label to a text by learning from the labels assigned to the training data that are similar to the text. Experiment results show that examplebased learning has achieved more than 5 % improvement in precisions across almost all coverage levels. Index Terms: text classification, inductive models, maximum entropy model, Tf*Idf vector space model, example-based learning. 1.
USING COLLECTIVE INFORMATION IN SEMI-SUPERVISED LEARNING FOR SPEECH RECOGNITION
"... Training accurate acoustic models typically requires a large amount of transcribed data, which can be expensive to obtain. In this paper, we describe a novel semi-supervised learning algorithm for automatic speech recognition. The algorithm determines whether a hypothesized transcription should be u ..."
Abstract
- Add to MetaCart
Training accurate acoustic models typically requires a large amount of transcribed data, which can be expensive to obtain. In this paper, we describe a novel semi-supervised learning algorithm for automatic speech recognition. The algorithm determines whether a hypothesized transcription should be used in the training by taking into consideration collective information from all utterances available instead of solely based on the confidence from that utterance itself. It estimates the expected entropy reduction each utterance and transcription pair may cause to the whole unlabeled dataset and choose the ones with the positive gains. We compare our algorithm with existing confidence-based semi-supervised learning algorithm and show that the former can consistently outperform the latter when the same amount of utterances is selected into the training set. We also indicate that our algorithm may determine the cutoff-point in a principled way by demonstrating that the point it finds is very close to the achievable peak point. Index Terms — Semi-supervised learning, entropy reduction, lattice, confidence, collective information

