Results 1 -
6 of
6
AM-Demodulation of Speech Spectra and Its Application to Noise Robust Speech Recognition
"... In this paper, a novel algorithm that resembles amplitude demodulation in the frequency domain is introduced, and its application to automatic speech recognition (ASR) is studied. Speech production can be regarded as a result of amplitude modulation (AM) with the source (excitation) spectrum being t ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In this paper, a novel algorithm that resembles amplitude demodulation in the frequency domain is introduced, and its application to automatic speech recognition (ASR) is studied. Speech production can be regarded as a result of amplitude modulation (AM) with the source (excitation) spectrum being the carrier and the vocal tract transfer function (VTTF) being the modulating signal. From this point of view, the VTTF can be recovered by amplitude demodulation. Amplitude demodulation of the speech spectrum is achieved by a novel nonlinear technique, which effectively performs envelope detection by using amplitudes of the harmonics and discarding inter-harmonic valleys. The technique is noise robust since frequency bands of low energy are discarded. The same principle is used to reshape the detected envelope. The algorithm is then used to construct an ASR feature extraction module. It is shown that this technique achieves superior performance to MFCCs in the presence of additive noise. Rec...
Speech-Video Synchronization Using Lips Movements and Speech Envelope Correlation
"... Abstract. In this paper, we propose a novel correlation based method for speech-video synchronization (synch) and relationship classification. The method uses the envelope of the speech signal and data extracted from the lips movement. Firstly, a nonlinear-time-varying model is considered to represe ..."
Abstract
- Add to MetaCart
Abstract. In this paper, we propose a novel correlation based method for speech-video synchronization (synch) and relationship classification. The method uses the envelope of the speech signal and data extracted from the lips movement. Firstly, a nonlinear-time-varying model is considered to represent the speech signal as a sum of amplitude and frequency modulated (AM-FM) signals. Each AM-FM signal, in this sum, is considered to model a single speech formant frequency. Using Taylor series expansion, the model is formulated in a way which characterizes the relation between the speech amplitude and the instantaneous frequency of each AM-FM signal w.r.t lips movements. Secondly, the envelope of the speech signal is estimated and then correlated with signals generated from lips movement. From the resultant correlation, the relation between the two signals is classified and the delay between them is estimated. The proposed method is applied to real cases and the results show that it is able to (i) classify if the speech and the video signals belong to the same source, (ii) estimate delays between audio and video signals that are as small as 0.1 second when speech signals are noisy and 0.04 second when the additive noise is less significant. 1
A Novel Analytical Approach for Lip Synchronization
"... Abstract—We present a novel approach for Lip synchronization by analyzing the relationship between a person’s speech signal and data extracted from his/her lip movements. To model the speech we use a nonlinear-time-varying sum of AM-FM signals each of which models a single formant frequency. The mod ..."
Abstract
- Add to MetaCart
Abstract—We present a novel approach for Lip synchronization by analyzing the relationship between a person’s speech signal and data extracted from his/her lip movements. To model the speech we use a nonlinear-time-varying sum of AM-FM signals each of which models a single formant frequency. The model is then realized using Taylor series expansions such that a closed form formula is achieved which shows the relationship between the speech amplitudes and instantaneous frequencies w.r.t lips varying width and height. Based on the obtained formula, lips movements data are employed to generate a semi-speech signal which is then correlated with the original speech over a span of delays. From the resultant correlation, the delay between the two signals is estimated, hence Lip Sync is achieved. The approach is applied to practical speech examples and the obtained results support the correctness and consistency of our proposed approach. The developed method can estimate delays around 0.1 second at low SNRs and 0.04 second at high SNRs. I.
Wavelet Based Noise Robust Features for Speaker Recognition
"... Extraction and selection of the best parametric representation of acoustic signal is the most important task in designing any speaker recognition system. A wide range of possibilities exists for parametrically representing the speech signal such as Linear Prediction Coding (LPC),Mel frequency Cepstr ..."
Abstract
- Add to MetaCart
Extraction and selection of the best parametric representation of acoustic signal is the most important task in designing any speaker recognition system. A wide range of possibilities exists for parametrically representing the speech signal such as Linear Prediction Coding (LPC),Mel frequency Cepstrum coefficients (MFCC) and others. MFCC are currently the most popular choice for any speaker recognition system, though one of the shortcomings of MFCC is that the signal is assumed to be stationary within the given time frame and is therefore unable to analyze the non-stationary signal. Therefore it is not suitable for noisy speech signals. To overcome this problem several researchers used different types of AM-FM modulation/demodulation techniques for extracting features from speech signal. In some approaches it is proposed to use the wavelet filterbanks for extracting the features. In this paper a technique for extracting the features by combining the above mentioned approaches is proposed. Features are extracted from the envelope of the signal and then passed through wavelet filterbank. It is found that the proposed method outperforms the existing feature extraction techniques.

