Results 1 - 10
of
31
Mean and Variance Adaptation within the MLLR Framework
- Computer Speech & Language
, 1996
"... One of the key issues for adaptation algorithms is to modify a large number of parameters with only a small amount of adaptation data. Speaker adaptation techniques try to obtain near speaker dependent (SD) performance with only small amounts of speaker specific data, and are often based on initi ..."
Abstract
-
Cited by 80 (15 self)
- Add to MetaCart
One of the key issues for adaptation algorithms is to modify a large number of parameters with only a small amount of adaptation data. Speaker adaptation techniques try to obtain near speaker dependent (SD) performance with only small amounts of speaker specific data, and are often based on initial speaker independent (SI) recognition systems. Some of these speaker adaptation techniques may also be applied to the task of adaptation to a new acoustic environment. In this case a SI recognition system trained in, typically, a clean acoustic environment is adapted to operate in a new, noise-corrupted, acoustic environment. This paper examines the Maximum Likelihood Linear Regression (MLLR) adaptation technique. MLLR estimates linear transformations for groups of models parameters to maximise the likelihood of the adaptation data. Previously, MLLR has been applied to the mean parameters in mixture Gaussian HMM systems. In this paper MLLR is extended to also update the Gaussian variances and re-estimation formulae are derived for these variance transforms. MLLR with variance compensation is evaluated on several large vocabulary recognition tasks. The use of mean and variance MLLR adaptation was found to give an additional 2% to 7% decrease in word error rate over mean-only MLLR adaptation. 1
Recent advances in the automatic recognition of audio-visual speech
- PROC. IEEE
, 2003
"... Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech r ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audio-visual adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.
Recognition of conversational telephone speech using the Janus Speech Engine
- IN PROCEEDINGS OF THE ICASSP’97
, 1997
"... Recognition of conversational speech is one of the most challenging speech recognition tasks to-date. While recognition error rates of 10 % or lower can now be reached on speech dictation tasks over vocabularies in excess of 60,000 words, recognition of conversational speech has persistently resiste ..."
Abstract
-
Cited by 31 (12 self)
- Add to MetaCart
Recognition of conversational speech is one of the most challenging speech recognition tasks to-date. While recognition error rates of 10 % or lower can now be reached on speech dictation tasks over vocabularies in excess of 60,000 words, recognition of conversational speech has persistently resisted most attempts at improvements by way of the proven techniques to date. Difficulties arise from shorter words, telephone channel degradation, and highly disfluent and coarticulated speech. In this paper, we describe the application, adaptation, and performance evaluation of our JANUS speech recognition engine to the Switchboard conversational speech recognition task. Through a number of algorithmic improvements, we havebeen able to reduce error rates from more than 50 % word error to 38%, measured on the official 1996 NIST evaluation test set. Improvements include vocal tract length normalization, polyphonic modeling, label boosting, speaker adaptation with and without confidence measures, and speaking mode dependent pronunciation modeling.
The SRI March 2000 Hub-5 conversational speech transcription system
- In Proceedings of the NIST Speech Transcription Workshop
, 2000
"... We describe SRI’s large vocabulary conversational speech recognition system as used in the March 2000 NIST Hub-5E evaluation. The system performs four recognition passes: (1) bigram recognition with phone-loop-adapted, within-word triphone acoustic models, (2) lattice generation with transcription-m ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
We describe SRI’s large vocabulary conversational speech recognition system as used in the March 2000 NIST Hub-5E evaluation. The system performs four recognition passes: (1) bigram recognition with phone-loop-adapted, within-word triphone acoustic models, (2) lattice generation with transcription-mode-adapted models, (3) trigram lattice recognition with adapted cross-word triphone models, and (4) N-best rescoring and reranking with various additional knowledge sources. The system incorporates two new kinds of acoustic model: triphone models conditioned on speaking rate, and an explicit joint model of within-word phone durations. We also obtained an unusually large improvement from modeling crossword pronunciation variants in “multiword ” vocabulary items. The language model (LM) was enhanced with an “anti-LM ” representing acoustically confusable word sequences. Finally, we applied a generalized ROVER algorithm to combine the N-best hypotheses from several systems based on different acoustic models. 1.
Uncertainty decoding for noise robust speech recognition
- in Proc. Interspeech
, 2004
"... This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings
Joint uncertainty decoding for robust large vocabulary speech recognition
, 2006
"... Standard techniques to increase automatic speech recognition noise robustness typically assume recognition models are clean trained. This “clean ” training data may in fact not be clean at all, but may contain channel variations, varying noise conditions, as well as different speakers. Hence rather ..."
Abstract
-
Cited by 23 (20 self)
- Add to MetaCart
Standard techniques to increase automatic speech recognition noise robustness typically assume recognition models are clean trained. This “clean ” training data may in fact not be clean at all, but may contain channel variations, varying noise conditions, as well as different speakers. Hence rather than considering noise robustness techniques as compensating clean acoustic models for environmental noise, they may be thought of as reducing the acoustic mismatch between training and test conditions. This report examines the application of VTS model compensation or model-based Joint uncertainty decoding to clean and multistyle trained systems. An EM-based noise estimation procedure is also presented to produce ML VTS or Joint noise models depending on the form of compensation used. Alternatively, compared to multistyle training, adaptive training with Joint uncertainty transforms, also referred to as JAT in this work, provides a better method for handling heterogeneous data. With JAT, the uncertainty bias added to the model variances de-weights observations proportional to the noise level. In this way, Joint transforms normalise the noise from the data allowing the canonical model to solely represent the underlying “clean ” acoustic signal. This
Predictive Model-Based Compensation Schemes for Robust Speech Recognition
- Speech Communication
, 1998
"... For practical applications speech recognition systems need to be insensitive to differences between training and test acoustic conditions. Differences in the acoustic environment may result from various sources, such as ambient background noise, channel variations and speaker stress. These differ ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
For practical applications speech recognition systems need to be insensitive to differences between training and test acoustic conditions. Differences in the acoustic environment may result from various sources, such as ambient background noise, channel variations and speaker stress. These differences can dramatically degrade the performance of a speech recognition system. A wide range of techniques have been proposed for achieving noise robustness. This paper considers one particular approach to model-based compensation, predictive model-based compensation, which has been shown to achieve good noise robustness in a wide range of acoustic environments. The characteristic of these schemes is that they combine a speech model with an additive noise model, a channel model and, in the general case, a speaker stress model, to generate a corrupted-speech model. The general theory of these predictive techniques is discussed. Various approximations for rapidly performing the model combination stage have been proposed and are reviewed in this paper. The advantages and the limitations of such a predictive approach to noise robustness are also discussed. In addition, methods for combining predictive schemes with schemes which make use of speech data in the new environment, adaptive schemes, are detailed. This combined approach overcomes some of the limitations of the predictive schemes. 1 The author is now at the IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA. 1
Speech technology in computer-aided language learning: Strengths and limitations of a new CALL paradigm. Language Learning
- Technology
, 1998
"... We investigate the suitability of deploying speech technology in computer-based systems that can be used to teach foreign language skills. In reviewing the current state of speech recognition and speech processing technology and by examining a number of voice-interactive CALL applications, we sugges ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
We investigate the suitability of deploying speech technology in computer-based systems that can be used to teach foreign language skills. In reviewing the current state of speech recognition and speech processing technology and by examining a number of voice-interactive CALL applications, we suggest how to create robust interactive learning environments that exploit the strengths of speech technology while working around its limitations. In the conclusion, we draw on our review of these applications to identify directions of future research that might improve both the design and the overall performance of voice-interactive CALL systems.
Fast Robust Inverse Transform SAT and Multi-stage Adaptation
- in Proceedings DARPA Broadcast News Transcription and Understanding Workshop
, 1998
"... We present a new method of Speaker Adapted Training #SAT# that is more robust, faster, and results in lower error rate than the previous methods. The method, called Inverse Transform SAT #ITSAT# is based on removing the di#erences between speakers before training, rather than modeling the di#erences ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
We present a new method of Speaker Adapted Training #SAT# that is more robust, faster, and results in lower error rate than the previous methods. The method, called Inverse Transform SAT #ITSAT# is based on removing the di#erences between speakers before training, rather than modeling the di#erences during training. We develop several methods to avoid the problems associated with inverting the transformation. In one method, weinterpolate the transformation matrix with an identity or diagonal transformation. We also apply constraints to the matrix to avoid estimation problems. We show that by using many diagonal-only transformation matrices with constraints we can achieve performance that is comparable to that of the original SAT method at a fraction of the cost. In addition, we describe a multi-stage approach to Maximum Likelihood Linear Regression #MLLR# unsupervised adaptation and we show that is more e#ective than a single stage regular MMLR adaptation. As a #nal stage, we adapt the...
The development of SRI’s 1997 Broadcast News transcription system
- In Proceedings DARPA BroadcastNews Transcription and Understanding Workshop
"... This paper describes SRI’s 1997 broadcastnews transcription system used for the 1997 DARPA H4 evaluations. Our system had several novel components. These include automatic segmentation of entire broadcast shows, word-internal and crossword acoustic models robustly estimated with a new Gaussian Mergi ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
This paper describes SRI’s 1997 broadcastnews transcription system used for the 1997 DARPA H4 evaluations. Our system had several novel components. These include automatic segmentation of entire broadcast shows, word-internal and crossword acoustic models robustly estimated with a new Gaussian Merging-Splitting (GMS) algorithm, the use of trigram language models (LMs) in lattices instead of for rescoring N-best lists, and an LM pruning algorithm that allows efficient representation of high-order (like 4- or 5-gram) LMs. We briefly describe these features and give comparative experimental results. We achieved a 18.7 % relative improvement in performance on our 1996 H4 partitioned evaluation (PE) development test set as compared to our 1996 H4 PE evaluation system. 1.

