• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Environmental Adaptation for Robust Speech Recognition (1994)

by Fu-Hua Liu
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 11
Next 10 →

Speech Recognition in Noisy Environments

by Pedro J. Moreno - Ph. D. Dissertation, ECE Department, CMU , 1996
"... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.1. Thesis goals . . . . . . . . . . . . . . . . . . . . . ..."
Abstract - Cited by 72 (3 self) - Add to MetaCart
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.1. Thesis goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 2 The SPHINX-II Recognition System . . . . . . . . . . . . . . . . . . . . . . 17 2.1. An Overview of the SPHINX-II System . . . . . . . . . . . . . . . . . . 17 2.1.1. Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.2. Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 20 2.1.3. Recognition Unit . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.4. Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.5. Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2. Experimental Tasks and Corpora . ...

A vector Taylor series approach for environment-independent speech recognition

by Pedro J. Moreno, Bhiksha Raj, Richard M. Stern - Proc. ICASSP-96 , 1996
"... In this paper we introduce a new analytical approach to environment compensation for speech recognition. Previous attempts at solving analytically the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speec ..."
Abstract - Cited by 66 (16 self) - Add to MetaCart
In this paper we introduce a new analytical approach to environment compensation for speech recognition. Previous attempts at solving analytically the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speech or they have relied on the availability of large environment-specific adaptation sets. Some of the previous methods required the use of adaptation data that consists of simultaneouslyrecorded or “stereo ” recordings of clean and degraded speech. In this work we introduce the use of a Vector Taylor series (VTS) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel. The VTS approach is computationally efficient. It can be applied either to the incoming speech feature vectors, or to the statistics representing these vectors. In the first case the speech is compensated and then recognized; in the second case HMM statistics are modified using the VTS formulation. Both approaches use only the actual speech segment being recognized to compute the parameters required for environmental compensation. We evaluate the performance of two implementations of VTS algorithms using the CMU SPHINX-II system on the 100word alphanumeric CENSUS database and on the 1993 5000word ARPA Wall Street Journal database. Artificial white Gaussian noise is added to both databases. The VTS approaches provide significant improvements in recognition accuracy compared to previous algorithms. 1.

Multi-Microphone Correlation-Based Processing for Robust Automatic Speech Recognition

by Thomas M. Sullivan - IEEE International Conference on Acoustics, Speech, and Signal Processing , 1996
"... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . 8 1.1. The Cross-Condition Problem . . . . . . . . . . . . . . . . . . . . 8 1. ..."
Abstract - Cited by 27 (3 self) - Add to MetaCart
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . 8 1.1. The Cross-Condition Problem . . . . . . . . . . . . . . . . . . . . 8 1.2. Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3. Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2. Background . . . . . . . . . . . . . . . . . . . . .12 2.1. Delay-and-Sum Beamforming . . . . . . . . . . . . . . . . . . . 12 2.1.1. Application of Delay-and-Sum Processing to Speech Recognition . . 13 2.2. Traditional Adaptive Arrays . . . . . . . . . . . . . . . . . . . . 13 2.2.1. Adaptive Noise Cancelling . . . . . . . . . . . . . . . . . . 15 2.2.2. Application of Traditional Adaptive Methods to Speech Recognition . 16 2.3. Cross-Correlation Based Arrays . . . . . . . . . . . . . . . . . . 18 2.3.1. Phenomena . . . . . . . . ....

On adaptive decision rules and decision parameter adaptation for automatic speech recognition

by Chin-hui Lee, Qiang Huo - Proc. IEEE , 2000
"... Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and ..."
Abstract - Cited by 16 (3 self) - Add to MetaCart
Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximum-likelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for high-performance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine

Cepstral compensation by polynomial approximation for environment-independent speech recognition

by Bhiksha Raj, Evandro B. Gouvêa, Pedro J. Moreno, Richard M. Stern - IN `INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING , 1996
"... Speech recognition systems perform poorly on speech degraded by even simple effects such as linear filtering and additive noise. One possible solution to this problem is to modify the probability density function (PDF) of clean speech to account for the effects of the degradation. However, even for ..."
Abstract - Cited by 8 (2 self) - Add to MetaCart
Speech recognition systems perform poorly on speech degraded by even simple effects such as linear filtering and additive noise. One possible solution to this problem is to modify the probability density function (PDF) of clean speech to account for the effects of the degradation. However, even for the case of linear filtering and additive noise, it is extremely difficult to do this analytically. Previously attempted analytical solutions to the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speech, or they have relied on the availability of large environmentspecific adaptation sets. Some of the previous methods required the use of adaptation data that consists of simultaneously-recorded or “stereo ” recordings of clean and degraded speech. In this paper we introduce an approximation-based method to compute the effects of the environment on the parameters of the PDF of clean speech. In this work, we perform compensation by Vector Polynomial approximationS (VPS) for the effects of linear filtering and additive noise on the clean speech. We also estimate the parameters of the environment, namely the noise and the channel, by using piecewiselinear approximations of these effects. We evaluate the performance of this method (VPS) using the CMU SPHINX-II system and the 100-word alphanumeric CENSUS database. Performance is evaluated at several SNRs, with artificial white Gaussian noise added to the database. VPS provides improvements of up to 15 percent in relative recognition accuracy. 1.

Multivariate-Gaussian-Based Cepstral Normalization for Robust Speech Recognition

by Pedro J. Moreno, Bhiksha Raj, Evandro Gouvêa, Richard M. Stern - IN PROCEEDINGS ICASSP , 1995
"... In this paper we introduce a new family of environmental compensation algorithms called Multivariate Gaussian Based Cepstral Normalization (RATZ). RATZ assumes that the effects of unknown noise and filtering on speech features can be compensated by corrections to the mean and variance of components ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
In this paper we introduce a new family of environmental compensation algorithms called Multivariate Gaussian Based Cepstral Normalization (RATZ). RATZ assumes that the effects of unknown noise and filtering on speech features can be compensated by corrections to the mean and variance of components of Gaussian mixtures, and an efficient procedure for estimating the correction factors is provided. The RATZ algorithm can be implemented to work with or without the use of “stereo ” development data that had been simultaneously recorded in the training and testing environments. “Blind ” RATZ partially overcomes the loss of information that would have been provided by stereo training through the use of a more accurate description of how noisy environments affect clean speech. We evaluate the performance of the two RATZ algorithms using the CMU SPHINX-II system on the alphanumeric census database and compare their performance with that of previous environmental-robustness developed at CMU. 1.

Cepstral compensation using statistical linearization. In Robust speech recognition using unknown communication channels

by Bhiksha Raj, Ro Gouvêa, Richard M. Stern - ESCA-NATO Tutorial and Research Workshop , 1997
"... Speech recognition systems perform poorly on speech degraded by even simple effects such as linear filtering and additive noise. One solution to this problem is to modify the probability density function (PDF) of clean speech to account for the effects of the degradation. However, even for the case ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
Speech recognition systems perform poorly on speech degraded by even simple effects such as linear filtering and additive noise. One solution to this problem is to modify the probability density function (PDF) of clean speech to account for the effects of the degradation. However, even for the case of linear filtering and additive noise, it is extremely difficult to do this analytically. Previously-attempted analytical solutions for the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speech, or they have relied on the availability of large environment-specific adaptation sets. In this paper we present the Vector Polynomial approximationS (VPS) method to compensate for the effects of linear filtering and additive noise on the PDF of clean speech. VPS also estimates the parameters of the environment, namely the noise and the channel, by using statistically linearized approximations of these effects. We evaluate the performance of this method (VPS) using the CMU SPHINX-II system on the alphanumeric CENSUS database corrupted with artificial white Gaussian noise. VPS provides improvements of up to 15 percent in relative recognition accuracy over our previous best algorithm, VTS, while being up to 20 percent more computationally efficient. 1.

Approaches to Environment Compensation in Automatic Speech Recognition

by Pedro J. Moreno, Bhiksha Raj, Richard M. Stern - Proceeding of the 1995 International Conference in Acoustics ICA'95 , 1995
"... This paper describes a series of cepstral-based compensation procedures that render the SPHINX-II continuous speech recognition system more robust with respect to acoustical changes in the environment. The first two algorithms, SNR based MultivaRiate gAussian based cepsTral normaliZation (SNR-based ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
This paper describes a series of cepstral-based compensation procedures that render the SPHINX-II continuous speech recognition system more robust with respect to acoustical changes in the environment. The first two algorithms, SNR based MultivaRiate gAussian based cepsTral normaliZation (SNR-based RATZ) and STAtistical Reestimation of HMMs (STAR), compensate for environmental degradation based on comparisons of simultaneouslyrecorded data in the training and testing environments (“stereo data”). They differ in that RATZ modifies the incoming feature vectors to a recognition system while STAR modifies the internal representation of speech by the system. We also describe N-CDCN, an improved version of codeword-dependent cepstral normalization (CDCN) which does not require stereo training data but nevertheless achieves performance levels comparable to RATZ and other algorithms that require stereo training. Use of these compensation algorithms significantly reduces the error rates for SPHINX-II. The algorithms are tested in a variety of databases and environmental conditions.

Continuous Recognition of Large-Vocabulary TelephoneQuality Speech

by Pedro J. Moreno, Matthew A. Siegler, Uday Jain, Richard M. Stern - Proceedings of the ARPA Workshop on Spoken Language Technology , 1994
"... The problem of speech recognition over telephone lines is growing in importance, as many near-term applications of spoken-language processing are likely to involve telephone speech. This paper describes recent efforts by the CMU speech group to improve the recognition accuracy of telephone-channel s ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
The problem of speech recognition over telephone lines is growing in importance, as many near-term applications of spoken-language processing are likely to involve telephone speech. This paper describes recent efforts by the CMU speech group to improve the recognition accuracy of telephone-channel speech, particularly in the context of the 1994 ARPA common Hub 2 evaluation of speech over long-distance telephone lines. The greatest amount of work was directed toward determining a training procedure that provides the greatest recognition accuracy when the incoming speech is known to be collected over the telephone. We compare the effectiveness of three training procedures, finding that training using high-quality speech that is bandlimited to 8 kHz can achieve results that are as good as those obtained by training on speech of a similar bandwidth collected over actual telephone channels. We also compare the recognition accuracy of the SPHINX-II system using high-quality speech and telephone speech, and we comment on the reasons for differences in system performance. 1.

Pronunciation Models and Their Evaluation Using Confidence Measures

by M. Doss, H. Bourlard, Mathew Magimai Doss, Mathew Magimai, Doss Herve Bourlard , 2001
"... In this report, we present preliminary experiments towards automatic inference and evaluation of pronunciation models based on multiple utterances of each lexicon word and their given baseline pronunciation model (baseform phonetic transcription). In the present system, the pronunciation models are ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
In this report, we present preliminary experiments towards automatic inference and evaluation of pronunciation models based on multiple utterances of each lexicon word and their given baseline pronunciation model (baseform phonetic transcription). In the present system, the pronunciation models are extracted by decoding each of the training utterances through a series of hidden Markov models (HMM), rst initialized to only allow the generation of the baseline transcription but iteratively relaxed to converge to a truly ergodic HMM. Each of the generated pronunciation models are then evaluated based on their con dence measure and their Levenshtein distance with the baseform model. The goal of this study is twofold. First, we show that this approach is appropriate to generate robust pronunciation variants. Second, we intend to use this approach to optimize these pronunciation models, by modifying/extending the acoustic features, to increase their con dence scores. In other words, while classical pronunciation modeling approaches usually attempt to make the models more and more complex to capture the pronunciation variability, we intend to x the pronunciation models and optimize the acoustic parameters to maximize their matching and discriminant properties.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University