Results 1 - 10
of
56
Predictive Model-Based Compensation Schemes for Robust Speech Recognition
- Speech Communication
, 1998
"... For practical applications speech recognition systems need to be insensitive to differences between training and test acoustic conditions. Differences in the acoustic environment may result from various sources, such as ambient background noise, channel variations and speaker stress. These differ ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
For practical applications speech recognition systems need to be insensitive to differences between training and test acoustic conditions. Differences in the acoustic environment may result from various sources, such as ambient background noise, channel variations and speaker stress. These differences can dramatically degrade the performance of a speech recognition system. A wide range of techniques have been proposed for achieving noise robustness. This paper considers one particular approach to model-based compensation, predictive model-based compensation, which has been shown to achieve good noise robustness in a wide range of acoustic environments. The characteristic of these schemes is that they combine a speech model with an additive noise model, a channel model and, in the general case, a speaker stress model, to generate a corrupted-speech model. The general theory of these predictive techniques is discussed. Various approximations for rapidly performing the model combination stage have been proposed and are reviewed in this paper. The advantages and the limitations of such a predictive approach to noise robustness are also discussed. In addition, methods for combining predictive schemes with schemes which make use of speech data in the new environment, adaptive schemes, are detailed. This combined approach overcomes some of the limitations of the predictive schemes. 1 The author is now at the IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA. 1
Compensation for environmental degradation in automatic speech recognition
- ESCA-NATO Tutorial and Research Workshop
, 1997
"... The accuracy of speech recognition systems degrades when operated in adverse acoustical environments. This paper reviews various methods by which more detailed mathematical descriptions of the effects of environmental degradation can improve speech recognition accuracy using both “data-driven” and “ ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
The accuracy of speech recognition systems degrades when operated in adverse acoustical environments. This paper reviews various methods by which more detailed mathematical descriptions of the effects of environmental degradation can improve speech recognition accuracy using both “data-driven” and “model-based ” compensation strategies. Data-driven methods learn environmental characteristics through direct comparisons of speech recorded in the noisy environment with the same speech recorded under optimal conditions. Model-based methods use a mathematical model of the environment and attempt to use samples of the degraded speech to estimate model parameters. These general approaches to environmental compensation are discussed in terms of recent research in environmental robustness at CMU, and in terms of similar efforts at other sites. These compensation algorithms are evaluated in a series of experiments measuring recognition accuracy for speech from the ARPA Wall Street Journal database that is corrupted by artificially-added noise at various signal-to-noise ratios (SNRs), and in more natural speech recognition tasks. 1.
Feature Extraction for Robust Speech Recognition using a Power-Law Nonlinearity and Power-Bias Subtraction
"... This paper presents a new feature extraction algorithm called Power-Normalized Cepstral Coefficients (PNCC) that is based on auditory processing. Major new features of PNCC processing include the use of a power-law nonlinearity that replaces the traditional log nonlinearity used for MFCC coefficient ..."
Abstract
-
Cited by 15 (11 self)
- Add to MetaCart
This paper presents a new feature extraction algorithm called Power-Normalized Cepstral Coefficients (PNCC) that is based on auditory processing. Major new features of PNCC processing include the use of a power-law nonlinearity that replaces the traditional log nonlinearity used for MFCC coefficients, and a novel algorithm that suppresses background excitation by estimating SNR based on the ratio of the arithmetic to geometric mean power, and subtracts the inferred background power. Experimental results demonstrate that the PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for various types of additive noise. The computational cost of PNCC is only slightly greater than that of conventional MFCC processing. Index Terms: Robust speech recognition, physiological modeling, rate-level curve, power function, ratio of arithmetic mean to geometric mean, power distribution normalization 1.
A Acero, “Enhancement of log mel power spectra of speech using a phase sensitive model the acoustic environemnt and sequential estimation of the corrupting noise
- IEEE Trans. on SAP
, 2004
"... Abstract—This paper presents a novel speech feature enhancement technique based on a probabilistic, nonlinear acoustic environment model that effectively incorporates the phase relationship (hence phase sensitive) between the clean speech and the corrupting noise in the acoustic distortion process. ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Abstract—This paper presents a novel speech feature enhancement technique based on a probabilistic, nonlinear acoustic environment model that effectively incorporates the phase relationship (hence phase sensitive) between the clean speech and the corrupting noise in the acoustic distortion process. The core of the enhancement algorithm is the MMSE (minimum mean square error) estimator for the log Mel power spectra of clean speech based on the phase-sensitive environment model, using highly efficient single-point, second-order Taylor series expansion to approximate the joint probability of clean and noisy speech modeled as a multivariate Gaussian. Since a noise estimate is required by the MMSE estimator, a high-quality, sequential noise estimation algorithm is also developed and presented. Both the noise estimation and speech feature enhancement algorithms are evaluated on the Aurora2 task of connected digit recognition. Noise-robust speech recognition results demonstrate that the new acoustic environment model which takes into account the relative phase in speech and noise mixing is superior to the earlier environment model which discards the phase under otherwise identical experimental conditions. The results also show that the sequential MAP (maximum a posteriori) learning for noise estimation is better than the sequential ML (maximum likelihood) learning, both evaluated under the identical phase-sensitive MMSE enhancement condition. Index Terms—Noise estimate, noise-robust ASR, phase-sensitive acoustic environment model, sequential algorithm, speech feature enhancement. I.
Histogram Equalization of the Speech Representation for Robust Speech Recognition
, 2001
"... The noise degrades the performance of Automatic Speech Recognition systems mainly due to the mismatch between the training and recognition conditions it introduces. The noise causes a distortion of the feature space which usually presents a non-linear behavior. In order to reduce this mismatch, the ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
The noise degrades the performance of Automatic Speech Recognition systems mainly due to the mismatch between the training and recognition conditions it introduces. The noise causes a distortion of the feature space which usually presents a non-linear behavior. In order to reduce this mismatch, the methods proposed for robust speech recognition try to compensate the noise effect either by obtaining an estimation of the clean speech or by adapting the recognizer acoustic models for a proper modeling of the noisy speech. In this paper we propose a method to compensate the noise effect over the speech representation. This method is based on the histogram equalization technique frequently applied for Digital Image Processing, which has been adapted to the speech representation. For each component of the feature vectors representing the speech signal, the histogram is estimated and the transformation which converts it into a reference histogram is calculated. Such transformations tend to compensate the distortion the noise produces over the different components of the feature vector and improve the performance of the recognition systems under noise conditions. We describe how the histogram equalization method can be adapted to robust speech recognition and present some recognition experiments to evaluate the proposed method.
Sequential Noise Estimation With Optimal Forgetting For Robust Speech Recognition
, 2001
"... Mismatch is known to degrade the performance of speech recognition systems. In real life applications mismatch is usually nonstationary, and a general way to compensate for slowly time varying mismatch is by using sequential algorithms with forgetting. The choice of forgetting factor is usually perf ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Mismatch is known to degrade the performance of speech recognition systems. In real life applications mismatch is usually nonstationary, and a general way to compensate for slowly time varying mismatch is by using sequential algorithms with forgetting. The choice of forgetting factor is usually performed empirically on some development data, and no optimality criterion is used. In this paper we introduce a framework for obtaining optimal forgetting factor. The proposed method is applied in conjunction with a sequential noise estimation algorithm, but can be extended to sequential bias or affine transformation estimation. Speech recognition experiments conducted first under a controlled scenario on the 5K Wall Street Journal task corrupted by different noise types, then under a real-life scenario on speech recorded in a noisy car environment validate the proposed method.
Cepstral compensation by polynomial approximation for environment-independent speech recognition
- IN `INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING
, 1996
"... Speech recognition systems perform poorly on speech degraded by even simple effects such as linear filtering and additive noise. One possible solution to this problem is to modify the probability density function (PDF) of clean speech to account for the effects of the degradation. However, even for ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Speech recognition systems perform poorly on speech degraded by even simple effects such as linear filtering and additive noise. One possible solution to this problem is to modify the probability density function (PDF) of clean speech to account for the effects of the degradation. However, even for the case of linear filtering and additive noise, it is extremely difficult to do this analytically. Previously attempted analytical solutions to the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speech, or they have relied on the availability of large environmentspecific adaptation sets. Some of the previous methods required the use of adaptation data that consists of simultaneously-recorded or “stereo ” recordings of clean and degraded speech. In this paper we introduce an approximation-based method to compute the effects of the environment on the parameters of the PDF of clean speech. In this work, we perform compensation by Vector Polynomial approximationS (VPS) for the effects of linear filtering and additive noise on the clean speech. We also estimate the parameters of the environment, namely the noise and the channel, by using piecewiselinear approximations of these effects. We evaluate the performance of this method (VPS) using the CMU SPHINX-II system and the 100-word alphanumeric CENSUS database. Performance is evaluated at several SNRs, with artificial white Gaussian noise added to the database. VPS provides improvements of up to 15 percent in relative recognition accuracy. 1.
Transforming Binary Uncertainties for Robust Speech Recognition
"... Abstract—Recently, several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time–frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with en ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Abstract—Recently, several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time–frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with enhanced speech in the linear spectral domain. The use of the cepstral transformation smears the information from the noise dominant time–frequency regions across all the cepstral features. We propose a supervised approach using regression trees to learn the nonlinear transformation of the uncertainty from the linear spectral domain to the cepstral domain. This uncertainty is used by a decoder that exploits the variance associated with the enhanced cepstral features to improve robust speech recognition. Systematic evaluations on a subset of the Aurora4 task using the estimated uncertainty show substantial improvement over the baseline performance across various noise conditions. Index Terms—Binary time–frequency mask, computational auditory scene analysis (CASA), robust automatic speech recognition, spectrogram reconstruction, uncertainty decoding. I.
Coupling particle filters with automatic speech recognition for speech feature enhancement
- Proc. of Interspeech
, 2006
"... This paper addresses robust speech feature extraction in combination with statistical speech feature enhancement and couples the particle filter to the speech recognition hypotheses. To extract noise robust features the Fourier transformation is replaced by the warped and scaled minimum variance dis ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
This paper addresses robust speech feature extraction in combination with statistical speech feature enhancement and couples the particle filter to the speech recognition hypotheses. To extract noise robust features the Fourier transformation is replaced by the warped and scaled minimum variance distortionless response spectral envelope. To enhance the features, particle filtering has been used. Further, we show that the robust extraction and statistical enhancement can be combined to good effect. One of the critical aspects in particle filter design is the particle weight calculation which is traditionally based on a general, time independent speech model approximated by a Gaussian mixture distribution. We replace this general, time independent speech model by time- and phoneme-specific models. The knowledge of the phonemes to be used is obtained by the hypothesis of a speech recognition system, therefore establishing a coupling between the particle filter and the speech recognition system which have been treated as independent components in the past. Index Terms: particle filters, automatic speech recognition, speech feature enhancement, phoneme-specific
Data-driven environmental compensation for speech recognition: A unified approach
- Speech Communication
, 1998
"... Environmental robustness for automatic speech recognition systems based on parameter modi®cation can be accomplished in two complementary ways. One approach is to modify the incoming features of environmentally-degraded speech to more closely resemble the features of the (normally undegraded) speech ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Environmental robustness for automatic speech recognition systems based on parameter modi®cation can be accomplished in two complementary ways. One approach is to modify the incoming features of environmentally-degraded speech to more closely resemble the features of the (normally undegraded) speech used to train the classi®er. The other approach is to modifying the internal statistical representations of speech features used by the classi®er to more closely resemble the features representing degraded speech in a particular target environment. This paper attempts to unify these two approaches to robust speech recognition by presenting several techniques that share the same basic assumptions and internal structure while di€ering in whether they modify the features of incoming speech or whether they modify the statistics of the classi®er itself. We present the multivaRiate gAussian-based cepsTral normaliZation (RATZ) family of algorithms which modify incoming cepstral features, along with the STAR (STAtistical Reestimation) family of algorithms, which modify the internal statistics of the classi®er. Both types of algorithms are data driven, in that they make use of a certain amount of adaptation data for learning compensation parameters. The algorithms were evaluated using the SPHINX-II speech recognition system on subsets of the Wall Street Journal database. While all algorithms demonstrated improved recognition accuracy compared to previous algorithms, the STAR family of algorithms tended to provide lower error rates than the RATZ family of algorithms as the SNR was decreased. Ó 1998 Elsevier Science

