Results 1 - 10
of
17
On the decorrelation of filter-bank energies in speech recognition
- Proc. Eurospeech
, 1995
"... Cepstral coefficients are widely used in speech recognition. In this paper, we claim that they are not the best way of representing the spectral envelope, at least for some usual speech recognition systems. In fact, cepstrum has several disadvantages: poor physical meaning, need of transformation, a ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
Cepstral coefficients are widely used in speech recognition. In this paper, we claim that they are not the best way of representing the spectral envelope, at least for some usual speech recognition systems. In fact, cepstrum has several disadvantages: poor physical meaning, need of transformation, and low capacity of adaptation to some recognition systems. In this paper, we propose a new representation that significantly outperforms both mel-cepstrum and LPC-cepstrum techniques in both recognition rate and computational cost. It consists of filtering the frequency sequence of filter-bank energies with an extremely simple filter that equalizes the variance of the cepstral coefficients. Excellent results of the new technique using a continuous observation density HMM recognition system and two very different recognition tasks, connected digits and phone recognition, are presented. 1.
A Model of Dynamic Auditory Perception and Its Application to Robust Word Recognition
, 1997
"... This paper describes two mechanisms that augment the common automatic speech recognition (ASR) front end and provide adaptation and isolation of local spectral peaks. A dynamic model consisting of a linear filterbank with a novel additive logarithmic adaptation stage after each filter output is prop ..."
Abstract
-
Cited by 22 (7 self)
- Add to MetaCart
This paper describes two mechanisms that augment the common automatic speech recognition (ASR) front end and provide adaptation and isolation of local spectral peaks. A dynamic model consisting of a linear filterbank with a novel additive logarithmic adaptation stage after each filter output is proposed. An extensive series of perceptual forward masking experiments, together with previously reported forward masking data, determine the model's dynamic parameters. Once parameterized, the simple exponential dynamic mechanism predicts the nature of forward masking data from several studies across wide ranging frequencies, input levels, and probe delay times. An initial evaluation of the dynamic model together with a local peak isolation mechanism as a front end for dynamic time warp (DTW) and hidden Markov model (HMM) word recognition systems shows an improvement in robustness to background noise when compared to Mel-frequency cepstral coefficients (MFCC), linear prediction cepstral coefficients (LPCC), and relative spectra (RASTA) based front ends.
Evaluation of objective quality measures for speech enhancement
- IEEE Trans. Audio Speech Language Processing
, 2008
"... Abstract—In this paper, we evaluate the performance of several objective measures in terms of predicting the quality of noisy speech enhanced by noise suppression algorithms. The objective measures considered a wide range of distortions introduced by four types of real-world noise at two signal-to-n ..."
Abstract
-
Cited by 19 (8 self)
- Add to MetaCart
Abstract—In this paper, we evaluate the performance of several objective measures in terms of predicting the quality of noisy speech enhanced by noise suppression algorithms. The objective measures considered a wide range of distortions introduced by four types of real-world noise at two signal-to-noise ratio levels by four classes of speech enhancement algorithms: spectral subtractive, subspace, statistical-model based, and Wiener algorithms. The subjective quality ratings were obtained using the ITU-T P.835 methodology designed to evaluate the quality of enhanced speech along three dimensions: signal distortion, noise distortion, and overall quality. This paper reports on the evaluation of correlations of several objective measures with these three subjective rating scales. Several new composite objective measures are also proposed by combining the individual objective measures using nonparametric and parametric regression analysis techniques. Index Terms—Objective measures, speech enhancement, speech quality assessment, subjective listening tests. I.
Robust Text-Independent Speaker Identification over Telephone Channels
- IEEE Trans. on Speech and Audio Processing
, 1997
"... This paper addresses the issue of closed-set text-independent speaker identification from samples of speech recorded over the telephone. It focuses on the effects of acoustic mismatches between training and testing data, and concentrates on two approaches: extracting features that are robust against ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper addresses the issue of closed-set text-independent speaker identification from samples of speech recorded over the telephone. It focuses on the effects of acoustic mismatches between training and testing data, and concentrates on two approaches: extracting features that are robust against channel variations, and transforming the speaker models to compensate for channel effects. First, an experimental study shows that optimizing the front end processing of the speech signal can significantly improve speaker recognition performance. A new filterbank design is introduced to improve the robustness of the speech spectrum computation in the front-end unit. Next, a new feature based on spectral slopes is described. Its ability to discriminate between speakers is shown to be superior to that of the traditional cepstrum. This feature can be used alone or combined with the cepstrum. The second part of the paper presents two model transformation methods that further reduce channel effe...
Robust Feature-Estimation and Objective Quality Assessment for Noisy Speech Recognition using the Credit Card Corpus
, 1994
"... It is well known that the introduction of acoustic background distortion into speech causes recognition algorithms to fail. In order to improve the environmental robustness of speech recognition in adverse conditions, a novel constrained-iterative feature-estimation algorithm, which was previously f ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
It is well known that the introduction of acoustic background distortion into speech causes recognition algorithms to fail. In order to improve the environmental robustness of speech recognition in adverse conditions, a novel constrained-iterative feature-estimation algorithm, which was previously formulated for speech enhancement, is considered and shown to produce improved feature characterization in a variety of actual noise conditions such as computer fan, large crowd, and voice communications channel noise. In addition, an objective measure based MAP estimator is formulated as a means of predicting changes in robust recognition performance at the speech feature extraction stage. The four measures considered include (i) NIST SNR, (ii) Itakura-Saito log-likelihood, (iii) log-area-ratio, and (iv) the weighted-spectral slope measure. A continuous distribution, monophone based, hidden Markov model recognition algorithm is used for objective measure based MAP estimator analysis and reco...
Evaluation of objective measures for speech enhancement
- In: Proc. of INTERSPEECH
, 2006
"... In this paper, we evaluate the performance of several objective measures in terms of predicting the quality of noisy speech enhanced by noise suppression algorithms. The objective measures considered a wide range of distortions introduced by four types of real-world noise at two SNRs by four classes ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In this paper, we evaluate the performance of several objective measures in terms of predicting the quality of noisy speech enhanced by noise suppression algorithms. The objective measures considered a wide range of distortions introduced by four types of real-world noise at two SNRs by four classes of speech enhancement algorithms: spectral subtractive, subspace, statistical-model based and Wiener algorithms. The subjective quality ratings were obtained using the ITU-T P.835 methodology designed to evaluate the speech quality along three dimensions: signal distortion, noise distortion and overall quality. This paper reports the correlations of five common objective measures with these three subjective measures. Improvements to the PESQ measure are reported along with new composite objective measures. Index Terms: speech enhancement, noise reduction, ITU-T P.835, objective measures, subjective listening test, correlation analysis.
Can Objective Measures Predict the Intelligibility of Modified HMM-based Synthetic Speech in Noise?
"... Synthetic speech can be modified to improve intelligibility in noise. In order to perform modifications automatically, it would be useful to have an objective measure that could predict the intelligibility of modified synthetic speech for human listeners. We analysed the impact on intelligibility – ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Synthetic speech can be modified to improve intelligibility in noise. In order to perform modifications automatically, it would be useful to have an objective measure that could predict the intelligibility of modified synthetic speech for human listeners. We analysed the impact on intelligibility – and on how well objective measures predict it – when we separately modify speaking rate, fundamental frequency, line spectral pairs and spectral peaks. Shifting LSPs can increase intelligibility for human listeners; other modifications had weaker effects. Among the objective measures we evaluated, the Dau model and the Glimpse proportion were the best predictors of human performance. Index Terms: objective measures for speech intelligibility, HMM-based speech synthesis, Lombard speech 1.
ICARUS: Source Generator Based Real-time Recognition of Speech in Noisy Stressful and Lombard Effect Environments
, 1995
"... The problem of real-time automatic speech recognition in an adverse environment is addressed in this paper. Though much research has been performed in the area of speech recognition, only limited success has been demonstrated for real-time recognition in noisy stressful environments. The primary rea ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The problem of real-time automatic speech recognition in an adverse environment is addressed in this paper. Though much research has been performed in the area of speech recognition, only limited success has been demonstrated for real-time recognition in noisy stressful environments. The primary reason for this is that the performance of present day recognition algorithms are predicated on the assumptions of the environmental settings in which the algorithms have been formulated and implemented. In this paper, we discuss the effects of additive background noise on speech quality and recognition parameters, and propose a source generator based framework to address stress and noise. Using this framework, a computationally efficient real-time recognition system called ICARUS is developed. The speech recognition system incorporates direct processing steps to address the effects of additive noise on the speech signal and stress on the speech production system. Central issues which are addre...
Speech Driven Facial Animation
, 2001
"... The results reported in this article are an integral part of a larger project aimed at achieving perceptually realistic animations, including the individualized nuances, of three-dimensional human faces driven by speech. The audiovisual system that has been developed for learning the spatio-temporal ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The results reported in this article are an integral part of a larger project aimed at achieving perceptually realistic animations, including the individualized nuances, of three-dimensional human faces driven by speech. The audiovisual system that has been developed for learning the spatio-temporal relationship between speech acoustics and facial animation is described, including video and speech processing, pattern analysis, and MPEG-4 compliant facial animation for a given speaker. In particular, we propose a perceptual transformation of the speech spectral envelope, which is shown to capture the dynamics of articulatory movements. An efficient nearest-neighbor algorithm is used to predict novel articulatory trajectories from the speech dynamics. The results are very promising and suggest a new way to approach the modeling of synthetic lip motion of a given speaker driven by his/her speech. This would also provide clues toward a more general cross-speaker realistic animation.
A Gaussian Mixture Model Spectral Representation for Speech Recognition
"... Summary Most modern speech recognition systems use either Mel-frequency cepstral coefficients or per-ceptual linear prediction as acoustic features. Recently, there has been some interest in alter-native speech parameterisations based on using formant features. Formants are the resonant frequencies ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Summary Most modern speech recognition systems use either Mel-frequency cepstral coefficients or per-ceptual linear prediction as acoustic features. Recently, there has been some interest in alter-native speech parameterisations based on using formant features. Formants are the resonant frequencies in the vocal tract which form the characteristic shape of the speech spectrum. How-ever, formants are difficult to reliably and robustly estimate from the speech signal and in some cases may not be clearly present. Rather than estimating the resonant frequencies, formant-like features can be used instead. Formant-like features use the characteristics of the spectral peaks to represent the spectrum. In this work, novel features are developed based on estimating a Gaussian mixture model (GMM) from the speech spectrum. This approach has previously been used sucessfully as a speech codec. The EM algorithm is used to estimate the parameters of the GMM. The extracted parameters: the means, standard deviations and component weights can be related to the for-mant locations, bandwidths and magnitudes. As the features directly represent the linear spec-trum, it is possibly to apply techniques for vocal tract length normalisation and additive noise

