Results 1 -
9 of
9
From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition
, 1996
"... ..."
COMBINING KNOWLEDGE SOURCES TO REORDER N-BEST SPEECH HYPOTHESIS LISTS
, 1994
"... A simple and general method is described that can combine different knowledge sources to reorder N-best lists of hypothe-ses produced by a speech recognizer. The method is automat-ically trainable, acquiring information from both positive and negative examples. In experiments, the method was tested ..."
Abstract
-
Cited by 40 (13 self)
- Add to MetaCart
A simple and general method is described that can combine different knowledge sources to reorder N-best lists of hypothe-ses produced by a speech recognizer. The method is automat-ically trainable, acquiring information from both positive and negative examples. In experiments, the method was tested on a 1000-utterance sample of unseen ATIS data.
Speaker Adaptation Using Combined Transformation and Bayesian Methods
, 1994
"... Adapting the parameters of a statistical speaker-independent continuous-speech recognizer to the speaker and the channel can significantly improve the recognition performance and robustness of the system. In continuous mixture-density hidden Markov models the number of component densities is typical ..."
Abstract
-
Cited by 38 (4 self)
- Add to MetaCart
Adapting the parameters of a statistical speaker-independent continuous-speech recognizer to the speaker and the channel can significantly improve the recognition performance and robustness of the system. In continuous mixture-density hidden Markov models the number of component densities is typically very large, and it may not be feasible to acquire a sufficient amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, we have recently proposed a constrained estimation technique for Gaussian mixture densities. To improve the behavior of our adaptation scheme for large amounts of adaptation data, we combine it here with Bayesian techniques. We evaluate our algorithms on the large-vocabulary Wall Street Journal corpus for nonnative speakers of American English. The recognition error rate is approximately halved with only a small amount of adaptation data, and it approaches the speaker-independent accuracy achieved for native speakers.
A Phone-Dependent Confidence Measure For Utterance Rejection
- In Proceedings of the International Conference on Acoustics, Speech and Signal Processing
"... An acoustic confidence measure for acceptance/rejection of recognition hypotheses for continuous speech utterances is proposed. This measure is useful for rejecting utterances that are out of domain, or contain out-of-vocabulary words or speech disfluencies. A phone-based approach is implemented so ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
An acoustic confidence measure for acceptance/rejection of recognition hypotheses for continuous speech utterances is proposed. This measure is useful for rejecting utterances that are out of domain, or contain out-of-vocabulary words or speech disfluencies. A phone-based approach is implemented so that a single global threshold can be applied to hypothesis rejection for any word sequence. Phone confidence is computed for each frame of speech as the posterior phone probability given the acoustic observation. Word sequence confidence is evaluated as the average phone confidence, either by weighting all frames equally or by normalizing by phone duration. The confidence measure is tested on a database of spoken company names. When normalized by phone duration, it achieves, in some cases with less computational expense, rejection performance comparable to a baseline system implementing a common filler-model approach. When all frames are equally weighted, performance is substantially poorer...
Speech Recognition System Design Based on Automatically Derived Units
, 1999
"... In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken and canonical pronunciations implicitly with Gaussian mixture models. As a result, these models can be very broad, particularly for casual spontaneous speech. An alternative approach, explored in this thesis, is to learn a unit inventory and pronunciation dictionary from training data using a maximum likelihood objective function. In particular,
Large Vocabulary Continuous Speech Recognition: from Laboratory Systems towards Real-World Applications
, 1996
"... This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is to transcribe the speech signal as a sequence of words, the same core technology can be applied to domains other than dictation. The main topics addressed are acoustic-phonetic modeling, lexical representation, language modeling, decoding and model adaptation. After a brief summary of experimental results some directions towards usable systems are given. In moving from laboratory systems towards real-world applications, different constraints arise which influence the system design. The application imposes limitations on computational resources, constraints on signal capture, requirements for noise and channel compensation, and rejection capability. The difficulties and costs of adapting existing technology to new languages and application need to be assessed. Near term applications for LVCSR technology are likely to grow in somewhat limited domains such as spoken language systems for information retrieval, and limited domain dictation. Perspectives on some unresolved problems are given, indicating areas for future research
Estimating Performance of Pipelined Spoken Language Translation Systems
- ICSLP'94. MULTILINGUAL EVALUATION
, 1994
"... Most spoken language translation systems developed to date rely on a pipelined architecture, in which the main stages are speech recognition, linguistic analysis, transfer, generation and speech synthesis. When making projections of error rates for systems of this kind, it is natural to assume that ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Most spoken language translation systems developed to date rely on a pipelined architecture, in which the main stages are speech recognition, linguistic analysis, transfer, generation and speech synthesis. When making projections of error rates for systems of this kind, it is natural to assume that the error rates for the individual components are independent, making the system accuracy the product of the component accuracies. The paper reports experiments carried out using the SRI-SICSTelia Research Spoken Language Translator and a 1000-utterance sample of unseen data. The results suggest that the naive performance model leads to serious overestimates of system error rates, since there are in fact strong dependencies between the components. Predicting the system error rate on the independence assumption by simple multiplication resulted in a 16% proportional overestimate for all utterances,
Towards A Compact Speech Recognizer: Subspace Distribution Clustering Hidden Markov Model
, 1998
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xiii 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 The Problem: Too Many Parameters : : : : : : : : : : : : : : : : : : : : : : 3 1.2 Proposed Solution: It Is Time to ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xiii 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 The Problem: Too Many Parameters : : : : : : : : : : : : : : : : : : : : : : 3 1.2 Proposed Solution: It Is Time to Share More! : : : : : : : : : : : : : : : : : 4 1.3 Thesis Summary and Outline : : : : : : : : : : : : : : : : : : : : : : : : : : 6 2 Review of Acoustic Modeling Using Hidden Markov Model : : : : : : : 9 2.1 Speech Characteristics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 2.2 Selection of Input Speech Space and Speech Model : : : : : : : : : : : : : : 10 2.2.1 Cepstral Input : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 2.2.2 Hidden Markov Model : : : : : : : : : : : : : : : : : : : : : : : : : : 11 2.2.3 Our Choice of HMM for Acoustic Modeling : : : : : : : : : : : : : : 14 2.3 Speech Unit to Model : : : : : : : : : : : : : : : : : : : : : : : : : : ...
Abstract
, 2008
"... Most spoken language translation systems developed to date rely on a pipelined architecture, in which the main stages are speech recognition, linguistic analysis, transfer, generation and speech synthesis. When making projections of error rates for systems of this kind, it is natural to assume that ..."
Abstract
- Add to MetaCart
Most spoken language translation systems developed to date rely on a pipelined architecture, in which the main stages are speech recognition, linguistic analysis, transfer, generation and speech synthesis. When making projections of error rates for systems of this kind, it is natural to assume that the error rates for the individual components are independent, making the system accuracy the product of the component accuracies. The paper reports experiments carried out using the SRI-SICS-Telia Research Spoken Language Translator and a 1000-utterance sample of unseen data. The results suggest that the naive performance model leads to serious overestimates of system error rates, since there are in fact strong dependencies between the components. Predicting the system error rate on the independence assumption by simple multiplication resulted in a 16 % proportional overestimate for all utterances, and a 19 % overestimate when only utterances of length 1-10 words were considered. 1 1

