Results 1 - 10
of
13
A New perspective on Feature Extraction for Robust In-Vehicle Speech Recognition
- ISCA Proc.: Eurospeech, 2003
, 2003
"... The problem of reliable speech recognition for in-vehicle applications has recently emerged as a challenging research domain. This study focuses on the feature extraction stage of this problem. The approach is based on MinimumVariance Distortionless Response (MVDR) spectrum estimation. MVDR is used ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
The problem of reliable speech recognition for in-vehicle applications has recently emerged as a challenging research domain. This study focuses on the feature extraction stage of this problem. The approach is based on MinimumVariance Distortionless Response (MVDR) spectrum estimation. MVDR is used for robustly estimating the envelope of the speech signal and shown to be very accurate and relatively less sensitive to additive noise. The proposed feature estimation process removes the traditional Mel-scaled filterbank as a perceptually motivated frequency partitioning. Instead, we directly warp the FFT power spectrum of speech. The word error rate (WER) is shown to decrease by 27.3 % with respect to the MFCCs and 18.8 % with respect to recently proposed PMCCs on an extended digit recognition task in real car environments. The proposed feature estimation approach is called PMVDR and conclusively shown to be a better speech representation in real environments with emphasis on time-varying car noise. 1.
Synthesizing speech from speech recognition parameters
, 2004
"... operation visualization information loss The merits of different signal preprocessing schemes for speech recognizers are usually assessed purely on the basis of the resulting recognition accuracy. Such benchmarks give a good indication as to whether one preprocessing is better than another, but litt ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
operation visualization information loss The merits of different signal preprocessing schemes for speech recognizers are usually assessed purely on the basis of the resulting recognition accuracy. Such benchmarks give a good indication as to whether one preprocessing is better than another, but little knowledge is acquired about why it is better or how it could be further improved. In order to gain more insight in the preprocessing, we seek to re-synthesize speech from speech recognition features. This way, we are able to pin-point some deficiencies in our current preprocessing scheme. Additional analysis of successful new preprocessing schemes may allow us one day to identify precisely those properties that are desirable in a feature set. Next to these purely scientific aims, the re-synthesis of speech from recognition features is of interest to thin-client speech applications, and as an alternative to the classical LPC source-filter model for speech manipulation. 1.
Robust Speech Recognition Using A Voiced-Unvoiced Feature
- IN PROC. INT. CONF. ON SPOKEN LANGUAGE PROCESSING
, 2002
"... In this paper, a voiced-unvoiced measure is used as acoustic feature for continuous speech recognition. The voiced-unvoiced measure was combined with the standard Mel Frequency Cepstral Coefficients (MFCC) using linear discriminant analysis (LDA) to choose the most relevant features. Experiments wer ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
In this paper, a voiced-unvoiced measure is used as acoustic feature for continuous speech recognition. The voiced-unvoiced measure was combined with the standard Mel Frequency Cepstral Coefficients (MFCC) using linear discriminant analysis (LDA) to choose the most relevant features. Experiments were performed on the SieTill (German digit strings recorded over telephone line) and on the SPINE (English spontaneous speech under different simulated noisy environments) corpus. The additional voiced-unvoiced measure results in improvements in word error rate (WER) of up to 11% relative to using MFCC alone with the same overall number of parameters in the system.
A harmonic-model-based front end for robust speech recognition
- in EUROSPEECH
, 2003
"... Speech recognition accuracy degrades significantly when the speech has been corrupted by noise, especially when the system has been trained on clean speech. Many robust techniques have been developed which require reliable online noise estimates or a priori knowledge of the noise. In situations wher ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Speech recognition accuracy degrades significantly when the speech has been corrupted by noise, especially when the system has been trained on clean speech. Many robust techniques have been developed which require reliable online noise estimates or a priori knowledge of the noise. In situations where such estimates or knowledge is difficult to obtain, these methods fail. We present a new robustness algorithm which avoids these problems by making no assumptions about the corrupting noise. Instead, we exploit properties inherent to the speech signal itself to denoise the recognition features. In this method, speech is decomposed into harmonic and noise-like components, which are then processed independently and recombined. By processing noise-corrupted speech in this manner we are able to achieve significant improvements in recognition accuracy on the Aurora 2 task. 1.
Perceptual MVDR-based cepstral coefficients (pmccs) for robust speech recognition
- in ICASSP, (Hong Kong
, 2003
"... This paper describes a robust feature extraction technique for continuous speech recognition. Central to the technique is the Minimum Variance Distortionless Response (MVDR) method of spectrum estimation. We incorporate perceptual information directly in to the spectrum estimation. This provides imp ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper describes a robust feature extraction technique for continuous speech recognition. Central to the technique is the Minimum Variance Distortionless Response (MVDR) method of spectrum estimation. We incorporate perceptual information directly in to the spectrum estimation. This provides improved robustness and computational efficiency when compared with the previously proposed MVDR-MFCC technique [10]. On an in-car speech recognition task this method, which we refer to as PMCC, is 15 % more accurate in WER and requires approximately a factor of 4 times less computation than the MVDR-MFCC technique. On the same task PMCC yields 20 % relative improvement over MFCC and 11 % relative improvement over PLP frontends. Similar improvements are observed on the Aurora 2 database. 1.
Accurate Spectral Envelope Estimation for Articulation-to-Speech Synthesis
- in ‘Proc. 5th ISCA Speech Synthesis Workshop’, CMU, Pittsburgh
"... This paper introduces a novel articulatory-acoustic mapping in which detailed spectral envelopes are estimated based on the cepstrum, inclusive of the high-quefrency elements which are discarded in conventional speech synthesis to eliminate the pitch component of speech. For this estimation, the met ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This paper introduces a novel articulatory-acoustic mapping in which detailed spectral envelopes are estimated based on the cepstrum, inclusive of the high-quefrency elements which are discarded in conventional speech synthesis to eliminate the pitch component of speech. For this estimation, the method deals with the harmonics of multiple voiced-speech spectra so that several sets of harmonics can be obtained at various pitch frequencies to form a spectral envelope. The experimental result shows that the method estimates spectral envelopes with the highest accuracy when the cepstral order is 48-64, which suggests that the higherorder coefficients are required to represent detailed envelopes reflecting the real vocal-tract responses. 1.
Combining spectral representations for large vocabulary continuous speech recognition
- IEEE Transactions on Audio, Speech and Language Processing
, 2008
"... Abstract—In this paper we investigate the combination of complementary acoustic feature streams in large vocabulary continuous speech recognition (LVCSR). We have explored the use of acoustic features obtained using a pitch-synchronous analysis, STRAIGHT, in combination with conventional features su ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract—In this paper we investigate the combination of complementary acoustic feature streams in large vocabulary continuous speech recognition (LVCSR). We have explored the use of acoustic features obtained using a pitch-synchronous analysis, STRAIGHT, in combination with conventional features such as mel frequency cepstral coefficients. Pitch-synchronous acoustic features are of particular interest when used with vocal tract length normalisation (VTLN) which is known to be affected by the fundamental frequency. We have combined these spectral representations directly at the acoustic feature level using heteroscedastic linear discriminant analysis (HLDA) and at the system level using ROVER. We evaluated this approach on three LVCSR tasks: dictated newspaper text (WSJCAM0), conversational telephone speech (CTS), and multiparty meeting transcription. The CTS and meeting transcription experiments were both evaluated using standard NIST test sets and evaluation protocols. Our results indicate that combining conventional and pitch-synchronous acoustic feature sets using HLDA results in a consistent, significant decrease in word error rate across all three tasks. Combining at the system level using ROVER resulted in a further significant decrease in word error rate.
Estimating detailed spectral envelopes using articulatory clustering
- in ‘Proc. ICSLP2004’, Jeju, Korea
, 2004
"... This paper presents an articulatory-acoustic mapping where detailed spectral envelopes are estimated. During the estimation, the harmonics of a range of F0 values are derived from the spectra of multiple voiced speech signals vocalized with similar articulator settings. The envelope formed by these ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper presents an articulatory-acoustic mapping where detailed spectral envelopes are estimated. During the estimation, the harmonics of a range of F0 values are derived from the spectra of multiple voiced speech signals vocalized with similar articulator settings. The envelope formed by these harmonics is represented by a cepstrum, which is computed by fitting the peaks of all the harmonics based on the weighted least square method in the frequency domain. The experimental result shows that the spectral envelopes are estimated with the highest accuracy when the cepstral order is 48-64 for a female speaker, which suggests that representing the real response of the vocal tract requires high-quefrency elements that conventional speech synthesis methods are forced to discard in order to eliminate the pitch component of speech. 1.
Source-filter separation for articulation-to-speech synthesis
- in Proc. ICSLP2004, Jeju, Korea
, 2004
"... In this paper we examine a method for separating out the vocal-tract filter response from the voice source characteristic using a large articulatory database. The method realises such separation for voiced speech using an iterative approximation procedure under the assumption that the speech product ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
In this paper we examine a method for separating out the vocal-tract filter response from the voice source characteristic using a large articulatory database. The method realises such separation for voiced speech using an iterative approximation procedure under the assumption that the speech production process is a linear system composed of a voice source and a vocal-tract filter, and that each of the components is controlled independently by different sets of factors. Experimental results show that the spectral variation is evidently influenced by the fundamental frequency or the power of speech, and that the tendency of the variation may be related closely to speaker identity. The method enables independent control over the voice source characteristic in our articulation-to-speech synthesis. 1.
THIRD-ORDER MOMENTS OF FILTERED SPEECH SIGNALS FOR ROBUST SPEECH RECOGNITION
"... Novel speech features calculated from third-order statistics of subband-filtered speech signals are introduced and studied for robust speech recognition. These features have the potential to capture nonlinear information not represented by cepstral coefficients. Also, because the features presented ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Novel speech features calculated from third-order statistics of subband-filtered speech signals are introduced and studied for robust speech recognition. These features have the potential to capture nonlinear information not represented by cepstral coefficients. Also, because the features presented in this paper are based on the third-order moments, they may be more immune to Gaussian noise than cepstrals, as Gaussian distributions have zero third-order moments. Preliminary experiments on the AURORA2 database studying these features in combination with Mel-frequency cepstral coefficients (MFCC’s) are presented, and improvement over the MFCC-only baseline is shown with the combined feature set for several noise conditions. 1.

