• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

H.Ney, “Improved methods for vocal tract normalization (1999)

by S Kanthak L Welling
Venue:in ICASSP
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 13
Next 10 →

Vocal Tract Normalization Equals Linear Transformation in Cepstral Space

by Michael Pitz, Sirko Molau, Ralf Schlüter, Hermann Ney - IN PROC. OF THE EUROSPEECH’01 , 2001
"... We show that vocal tract normalization (VTN) frequency warping results in a linear transformation in the cepstral domain. For the special case of a piece-wise linear warping function, the transformation matrix is analytically calculated. This approach enables us to compute the Jacobian determinant o ..."
Abstract - Cited by 27 (6 self) - Add to MetaCart
We show that vocal tract normalization (VTN) frequency warping results in a linear transformation in the cepstral domain. For the special case of a piece-wise linear warping function, the transformation matrix is analytically calculated. This approach enables us to compute the Jacobian determinant of the transformation matrix, which allows the normalization of the probability distributions used in speaker-normalization for automatic speech recognition.

Investigating Recognition Of Children's Speech

by Diego Giuliani, Matteo Gerosa - IN PROC. ICASSP, 2003 , 2003
"... In this work recognition of children's speech was investigated by considering a phone recognition task. Two baseline systems were trained, one for children and one for adults, by exploiting two Italian speech databases. Under matching conditions, training and recognition performed with data from the ..."
Abstract - Cited by 13 (0 self) - Add to MetaCart
In this work recognition of children's speech was investigated by considering a phone recognition task. Two baseline systems were trained, one for children and one for adults, by exploiting two Italian speech databases. Under matching conditions, training and recognition performed with data from the same population group, the phone recognition accuracy was 77.30% and 79.43% for children and adults, respectively. It was

Fast search for large vocabulary speech recognition

by Stephan Kanthak, Achim Sixtus, Sirko Molau, Ralf Schlüter, Hermann Ney - in Verbmobil: Foundations of Speech-to-Speech Translation, W. Wahlster, Ed , 2000
"... Abstract. In this article we describe methods for improving the RWTH German speech recognizer used within the VERBMOBIL project. In particular, we present acceleration methods for the search based on both within-word and across-word phoneme models. We also study incremental methods to reduce the res ..."
Abstract - Cited by 11 (11 self) - Add to MetaCart
Abstract. In this article we describe methods for improving the RWTH German speech recognizer used within the VERBMOBIL project. In particular, we present acceleration methods for the search based on both within-word and across-word phoneme models. We also study incremental methods to reduce the response time of the online speech recognizer. Finally, we present experimental off-line results for the three VERBMOBIL scenarios. We report on word error rates and real-time factors for both speaker independent and speaker dependent recognition. 1

Speaker Adaptive Modeling by Vocal Tract Normalization

by Lutz Welling, Hermann Ney, Stephan Kanthak, Lehrstuhl Fur Informatik Vi - IEEE Trans. on Speech and Audio Processing , 2002
"... This paper presents methods for speaker adaptive modeling using vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new training method for VTN: By using single-density acoustic models per HMM state for selecting the scale factor of the frequency axis, we a ..."
Abstract - Cited by 10 (1 self) - Add to MetaCart
This paper presents methods for speaker adaptive modeling using vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new training method for VTN: By using single-density acoustic models per HMM state for selecting the scale factor of the frequency axis, we avoid the problem that a mixture-density tends to learn the scale factors of the training speakers and thus cannot be used for selecting the scale factor. We show that using single Gaussian densities for selecting the scale factor in training results in lower error rates than using mixture densities.

Recent Improvements Of The RWTH Large Vocabulary Speech Recognition System On Spontaneous Speech

by Achim Sixtus, Sirko Molau, Stephan Kanthak, Ralf Schlüter, Hermann Ney - Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing , 2000
"... This paper presents recent improvements of the RWTH large vocabulary continuous speech recognition system (LVCSR). In particular, we will report on the integration of across-word models into the rst recognition pass, and describe better algorithms for fast vocal tract normalization (VTN). We will fo ..."
Abstract - Cited by 8 (5 self) - Add to MetaCart
This paper presents recent improvements of the RWTH large vocabulary continuous speech recognition system (LVCSR). In particular, we will report on the integration of across-word models into the rst recognition pass, and describe better algorithms for fast vocal tract normalization (VTN). We will focus both on the improvements in word error rate and how to speed up the recognizer with only minimal loss in recognition accuracy. Implementation details and experimental results are given for the VerbMobil task, a German spontaneous speech corpus. The 25.0% word error rate (WER) of our within-word baseline system was reduced to 21.4% with VTN and across-word models. Decreasing the real-time factor (RTF) by up to 85% resulted in only a small degradation in recognition performance of 2% relative on average. 1. INTRODUCTION The RWTH LVCSR system is a continuous Gaussian mixture density speech recognition system, which has been described in detail in [6]. The baseline system is a trigram Vit...

The RWTH Aachen University Open Source Speech Recognition System

by David Rybach, Christian Gollan, Georg Heigold, Björn Hoffmeister, Jonas Lööf, Ralf Schlüter, Hermann Ney
"... We announce the public availability of the RWTH Aachen University speech recognition toolkit. The toolkit includes state of the art speech recognition technology for acoustic model training and decoding. Speaker adaptation, speaker adaptive training, unsupervised training, a finite state automata li ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
We announce the public availability of the RWTH Aachen University speech recognition toolkit. The toolkit includes state of the art speech recognition technology for acoustic model training and decoding. Speaker adaptation, speaker adaptive training, unsupervised training, a finite state automata library, and an efficient tree search decoder are notable components. Comprehensive documentation, example setups for training and recognition, and a tutorial are provided to support newcomers. Index Terms: speech recognition, LVCSR, software 1.

The RWTH Large Vocabulary Speech Recognition System For Spontaneous Speech

by Stephan Kanthak, Sirko Molau, Achim Sixtus, Ralf Schlüter, Hermann Ney - In Proceedings of the Konvens 2000 , 2000
"... This paper presents details of the RWTH large vocabulary continuous speech recognition system used in the VERBMOBIL spontaneous speech translation system. In particular, we report on methods for accelerating the search and algorithms for fast vocal tract normalization (VTN). We focus both on the imp ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
This paper presents details of the RWTH large vocabulary continuous speech recognition system used in the VERBMOBIL spontaneous speech translation system. In particular, we report on methods for accelerating the search and algorithms for fast vocal tract normalization (VTN). We focus both on the improvements in word error rate and how to speed up the recognizer with only minimal loss in recognition accuracy. Implementation details and experimental results are given for the VERBMOBIL German development corpus dev99. The 24.6% word error rate of the baseline system is reduced to 22.8% using VTN. Decreasing the real-time factor by a factor of 5 resulted in only a small degradation in recognition performance of 2% relative on average. Furthermore, we study incremental methods for reducing the response time of the online speech recognizer and an efficient method to reduce the density of word graphs. 1. Introduction This paper describes the RWTH large vocabulary continuous speech recogniti...

On Extending VTLN to Phoneme-specific Warping in Automatic Speech Recognition

by Daniel Elenius, Mats Blomberg
"... Phoneme- and formant-specific warping has been shown to decrease formant and cepstral mismatch. These findings have not yet been fully implemented in speech recognition. This paper discusses a few reasons how this can be. A small experimental study is also included where phoneme-independent warping ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Phoneme- and formant-specific warping has been shown to decrease formant and cepstral mismatch. These findings have not yet been fully implemented in speech recognition. This paper discusses a few reasons how this can be. A small experimental study is also included where phoneme-independent warping is extended towards phoneme-specific warping. The results of this investigation did not show a significant decrease in error rate during recognition. This is also in line with earlier experiments of methods discussed in the paper.

Performance Analysis of the Aurora Large Vocabulary Baseline System 1

by N. Parihar, J. Picone
"... In this paper, we present the design and analysis of a large vocabulary speech recognition system that was used to conduct the ETSI Aurora large vocabulary evaluation. The experimental paradigm is presented along with the results from a number of experiments designed to minimize the computational re ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
In this paper, we present the design and analysis of a large vocabulary speech recognition system that was used to conduct the ETSI Aurora large vocabulary evaluation. The experimental paradigm is presented along with the results from a number of experiments designed to minimize the computational requirements for the system. It is shown that increasing the sampling frequency from 8 kHz to 16 kHz improves in the performance significantly only for the noisy test conditions. Utterance detection resulted in significant improvements only on the noisy conditions for the mismatched training conditions. Use of the DSR standard lossy VQ-based compression algorithm did not result in a significant degradation in performance. A mismatch between training and testing conditions (model mismatch) resulted in a 300% relative increase in WER. Mismatches in microphones also resulted in 200 % relative increase in WER. The Aurora LV baseline system achieved a WER of 14.0 % on the standard 5K Wall Street Journal task, and required 4 xRT for training and 15 xRT for decoding (on an 800 MHz Pentium processor). 1.

Speech Recognition using Wavelet Packet Features

by Mihalis Siafarikas, Iosif Mporas, Todor Ganchev, Nikos Fakotakis
"... In view of the growing use of automatic speech recognition in the modern society, we study various alternative representations of the speech signal that have the potential to contribute to the improvement of the recognition performance. Specifically, the main targets of the present article are to ov ..."
Abstract - Add to MetaCart
In view of the growing use of automatic speech recognition in the modern society, we study various alternative representations of the speech signal that have the potential to contribute to the improvement of the recognition performance. Specifically, the main targets of the present article are to overview and evaluate the practical importance of some recently proposed, and thus less studied, wavelet packet-based speech parameterization methods on the speech recognition task, illustrating their merits compared to other well known approaches. To this end, working on the widely acknowledged TIMIT (Texas Instruments and Massachusetts Institute of Technology) speech database and relying on the Sphinx-III speech recognizer, we contrast the performance of four wavelet packet-based speech parameterizations against traditional Fourier-based techniques that have been considered for the task of speech recognition for over two decades, including Mel Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Predictive (PLP) cepstral coefficients that presently dominate the speech recognition field. The experimental results demonstrate that the wavelet packet-based speech features of interest provide a superior performance over the baseline parameters. This validates the wavelet packet-based speech parameterization schemes as a promising research direction that could bring further reduction of the speech recognition error rate.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University