Results 1 -
3 of
3
Acoustic Feature Combination for Robust Speech Recognition
- Proc. IEEE Intern. Conf. on Acoustics, Speech, and Signal Processing
, 2005
"... In this paper, we consider the use of multiple acoustic features of the speech signal for robust speech recognition. We investigate the combination of various auditory based (Mel Frequency Cepstrum Coefficients, Perceptual Linear Prediction, etc.) and articulatory based (voicedness) features. Featur ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In this paper, we consider the use of multiple acoustic features of the speech signal for robust speech recognition. We investigate the combination of various auditory based (Mel Frequency Cepstrum Coefficients, Perceptual Linear Prediction, etc.) and articulatory based (voicedness) features. Features are combined by a Linear Discriminant Analysis based and by a log-linear model combination based techniques. We describe the two feature combination techniques and compare the experimental results. Experiments performed on the large-vocabulary task VerbMobil II (German conversational speech) show that the accuracy of automatic speech recognition systems can be improved by the combination of different acoustic features. 1.
Phoneme confusions in human and automatic speech recognition
- in Proc. Interspeech
, 2007
"... A comparison between automatic speech recognition (ASR) and human speech recognition (HSR) is performed as prerequisite for identifying sources of errors and improving feature extraction in ASR. HSR and ASR experiments are carried out with the same logatome database which consists of nonsense syllab ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
A comparison between automatic speech recognition (ASR) and human speech recognition (HSR) is performed as prerequisite for identifying sources of errors and improving feature extraction in ASR. HSR and ASR experiments are carried out with the same logatome database which consists of nonsense syllables. Two different kinds of signals are presented to human listeners: First, noisy speech samples are converted to Mel-frequency cepstral coefficients which are resynthesized to speech, with information about voicing and fundamental frequency being discarded. Second, the original signals with added noise are presented, which is used to evaluate the loss of information caused by the process of resynthesis. The analysis also covers the degradation of ASR caused by dialect or accent and shows that different error patterns emerge for ASR and HSR. The information loss induced by the calculation of ASR features has the same effect as a deteriation of the SNR by 10 dB. Index Terms: human speech recognition, automatic speech recognition, dialect, accent, phoneme confusions, MFCC
1 The Delta-Phase Spectrum with Application to Voice Activity Detection and Speaker Recognition
"... Abstract—For several reasons, the Fourier phase domain is less favoured than the magnitude domain in signal processing and modelling of speech. To correctly analyse the phase, several factors must be considered and compensated, including the effect of the step size, windowing function and other proc ..."
Abstract
- Add to MetaCart
Abstract—For several reasons, the Fourier phase domain is less favoured than the magnitude domain in signal processing and modelling of speech. To correctly analyse the phase, several factors must be considered and compensated, including the effect of the step size, windowing function and other processing parameters. Building on a review of these factors, this paper investigates a spectral representation based on the Instantaneous Frequency Deviation, but in which the step size between processing frames is used in calculating phase changes, rather than the traditional single sample interval. Reflecting these longer intervals, the term Delta-Phase Spectrum is used to distinguish this from instantaneous derivatives. Experiments show that mel-frequency cepstral coefficients features derived from the Delta-Phase Spectrum (termed Mel-Frequency Delta-Phase features) can produce broadly similar performance to equivalent magnitude domain features for both voice activity detection and speaker recognition tasks. Further, it is shown that the fusion of the magnitude and phase representations yields performance benefits over either in isolation. Index Terms—phase, instantaneous frequency, speech analysis, voice activity detection, speaker recognition. I.

