Results 1 -
8 of
8
Improved Methods For Vocal Tract Normalization
- In Proc. of the IEEE Int. Conf. on Acoustics Speech and Signal Processing
, 1999
"... This paper presents improved methods for vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new method for VTN in training: By using acoustic models with single Gaussian densities per state for selecting the normalization scales it is avoided that the mod ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
This paper presents improved methods for vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new method for VTN in training: By using acoustic models with single Gaussian densities per state for selecting the normalization scales it is avoided that the models learn the normalization scales of the training speakers. We show that using single Gaussian densities for selecting the normalization scales in training results in lower error rates than using mixture densities. For VTN in recognition, we propose an improvement of the well--known multiple--pass strategy: By using an unnormalized acoustic model for the first recognition pass instead of a normalized model lower error rates are obtained. In recognition tests, this method is compared with a fast variant of VTN. The multiple--pass strategy is an efficient method but it is suboptimal because the normalization scale and the word sequence are determined sequentially. We found that for telephon...
Automatic Question Generation For Decision Tree Based State Tying
- Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing
, 1998
"... Decision tree based state tying uses so-called phonetic questions to assign triphone states to reasonable acoustic models. These phonetic questions are in fact phonetic categories such as vowels, plosives or fricatives. The assumption behind this is that context phonemes which belong to the same pho ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Decision tree based state tying uses so-called phonetic questions to assign triphone states to reasonable acoustic models. These phonetic questions are in fact phonetic categories such as vowels, plosives or fricatives. The assumption behind this is that context phonemes which belong to the same phonetic class have a similar influence on the pronunciation of a phoneme. For a new phoneme set, which has to be used e.g. when switching to a different corpus, a phonetic expert is needed to define proper phonetic questions. In this paper a new method is presented which automatically defines good phonetic questions for a phoneme set. This method uses the intermediate clusters from a phoneme clustering algorithm which are reduced to an appropriate number afterwards. Recognition results on the Wall Street Journal data for within-word and acrossword phoneme models show competitive performance of the automatically generated questions with our best handcrafted question set.
Speaker Adaptive Modeling by Vocal Tract Normalization
- IEEE Trans. on Speech and Audio Processing
, 2002
"... This paper presents methods for speaker adaptive modeling using vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new training method for VTN: By using single-density acoustic models per HMM state for selecting the scale factor of the frequency axis, we a ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper presents methods for speaker adaptive modeling using vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new training method for VTN: By using single-density acoustic models per HMM state for selecting the scale factor of the frequency axis, we avoid the problem that a mixture-density tends to learn the scale factors of the training speakers and thus cannot be used for selecting the scale factor. We show that using single Gaussian densities for selecting the scale factor in training results in lower error rates than using mixture densities.
Pronunciation Modelling In The Rwth Large Vocabulary Speech Recognizer
, 1998
"... this paper we describe the application of pronunciation variants for our large vocabulary continuous speech recognizer. We will explain how the pronunciation variants were used in training and recognition and give some recognition results on three different corpora. The recognition tests were perfor ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
this paper we describe the application of pronunciation variants for our large vocabulary continuous speech recognizer. We will explain how the pronunciation variants were used in training and recognition and give some recognition results on three different corpora. The recognition tests were performed on the Wall Street Journal (WSJ) November 92 development and evaluation corpora (5 000 words), the North American Business (NAB) H1 development corpus (20 000 words) and on the Verbmobil 1996 evaluation corpus (5 000 words). For the WSJ and NAB corpora, a slight improvement in recognition accuracy can be observed, while for the Verbmobil corpus the error rate remains unchanged
Using Phase Spectrum Information For Improved Speech Recognition Performance
, 2001
"... In this work, new acoustic features for continuous speech recognition based on the short-term Fourier phase spectrum are introduced for mono (telephone) recordings. The new phase based features were combined with standard Mel Frequency Cepstral Coefficients (MFCC), and results were produced with and ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In this work, new acoustic features for continuous speech recognition based on the short-term Fourier phase spectrum are introduced for mono (telephone) recordings. The new phase based features were combined with standard Mel Frequency Cepstral Coefficients (MFCC), and results were produced with and without using additional linear discriminant analysis (LDA) to choose the most relevant features. Experiments were performed on the SieTill corpus for telephone line recorded German digit strings. Using LDA to combine purely phase based features with MFCCs, we obtained improvements in word error rate of up to 25% relative to using MFCCs alone with the same overall number of parameters in the system.
On Feature Extraction By Mutual Information Maximization
- In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing
, 2002
"... In order to learn discriminative feature transforms, we discuss mutual information between class labels and transformed features as a criterion. Instead of Shannon's definition we use measures based on Renyi entropy, which lends itself into an efficient implementation and an interpretation of "infor ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
In order to learn discriminative feature transforms, we discuss mutual information between class labels and transformed features as a criterion. Instead of Shannon's definition we use measures based on Renyi entropy, which lends itself into an efficient implementation and an interpretation of "information potentials" and "information forces" induced by samples of data. This paper presents two routes towards practical usability of the method, especially aimed to large databases: The first is an on-line stochastic gradient algorithm, and the second is based on approximating class densities in the output space by Gaussian mixture models.
Noise Level Normalization And Reference Adaptation For Robust Speech Recognition
- in ASR2000 -- International Workshop on Automatic Speech Recognition
, 2000
"... This paper describes an approach to normalize the noise level of a speech signal at the outputs of the Mel scaled filter--bank used in MFCC--feature extraction. An adaptive normalizing function that distinguishes between speech and silence parts of the signal is used to normalize the noise level, wi ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
This paper describes an approach to normalize the noise level of a speech signal at the outputs of the Mel scaled filter--bank used in MFCC--feature extraction. An adaptive normalizing function that distinguishes between speech and silence parts of the signal is used to normalize the noise level, without altering the speech parts of the signal. This technique is combined with an adaptation of the reference vectors, depending on the average norm of the incoming feature vectors. On a database with training data recorded in office environment and testing data recorded in driving cars, the word error rate could be reduced from 35.5% to 14.7% for the city traffic testing set and from 78.0% to 24.1% for the highway testing set. 1. INTRODUCTION Noise level normalization (NLN) is based on the observance, that a combination spectral subtraction (SS) and signal--to--noise--ratio normalization (SNRN) gives better recognition results when the subtraction and normalization are only applied to the...
Computing Mel-Frequency Cepstral Coefficients
- Proc. Int. Conf. on Acoustic, Speech and Signal Processing
, 2001
"... In this paper we present a method to derive Mel-frequency cepstral coefficients directly from the power spectrum of a speech signal. We show that omitting the filterbank in signal analysis does not affect the word error rate. The presented approach simplifies the speech recognizer's front end by mer ..."
Abstract
- Add to MetaCart
In this paper we present a method to derive Mel-frequency cepstral coefficients directly from the power spectrum of a speech signal. We show that omitting the filterbank in signal analysis does not affect the word error rate. The presented approach simplifies the speech recognizer's front end by merging subsequent signal analysis steps into a single one. It avoids possible interpolation and discretization problems and results in a compact implementation. We show that frequency warping schemes like vocal tract normalization (VTN) can be integrated easily in our concept without additional computational efforts. Recognition test results obtained with the RWTH large vocabulary speech recognition system are presented for two different corpora: The German VerbMobil II dev99 corpus, and the English North American Business News 94 20k development corpus.

