Results 1 - 10
of
35
Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition
- Computer Speech and Language
, 1998
"... This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias ..."
Abstract
-
Cited by 275 (44 self)
- Add to MetaCart
This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias, strict linear feature-space transformations are inappropriate in this case. Hence, only model-based linear transforms are considered. The paper compares the two possible forms of model-based transforms: (i) unconstrained, where any combination of mean and variance transform may be used, and (ii) constrained, which requires the variance transform to have the same form as the mean transform (sometimes referred to as feature-space transforms). Re-estimation formulae for all appropriate cases of transform are given. This includes a new and efficient "full" variance transform and the extension of the constrained model-space transform from the simple diagonal case to the full or block-diagonal case. The constrained and unconstrained transforms are evaluated in terms of computational cost, recognition time efficiency, and use for speaker adaptive training. The recognition performance of the two model-space transforms on a large vocabulary speech recognition task using incremental adaptation is investigated. In addition, initial experiments using the constrained model-space transform for speaker adaptive training are detailed. 1 The author is now at the IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA 1
Continuous Probabilistic Transform for Voice Conversion
- IEEE Transactions on Speech and Audio Processing
, 1998
"... Abstract — Voice conversion, as considered in this paper, is defined as modifying the speech signal of one speaker (source speaker) so that it sounds as if it had been pronounced by a different speaker (target speaker). Our contribution includes the design of a new methodology for representing the r ..."
Abstract
-
Cited by 78 (3 self)
- Add to MetaCart
Abstract — Voice conversion, as considered in this paper, is defined as modifying the speech signal of one speaker (source speaker) so that it sounds as if it had been pronounced by a different speaker (target speaker). Our contribution includes the design of a new methodology for representing the relationship between two sets of spectral envelopes. The proposed method is based on the use of a Gaussian mixture model of the source speaker spectral envelopes. The conversion itself is represented by a continuous parametric function which takes into account the probabilistic classification provided by the mixture model. The parameters of the conversion function are estimated by least squares optimization on the training data. This conversion method is implemented in the context of the HNM (harmonic C noise model) system, which allows high-quality modifications of speech signals. Compared to earlier methods based on vector quantization, the proposed conversion scheme results in a much better match between the converted envelopes and the target envelopes. Evaluation by objective tests and formal listening tests shows that the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods. I.
Speech Recognition in Noisy Environments
- Ph. D. Dissertation, ECE Department, CMU
, 1996
"... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.1. Thesis goals . . . . . . . . . . . . . . . . . . . . . ..."
Abstract
-
Cited by 72 (3 self)
- Add to MetaCart
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.1. Thesis goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 2 The SPHINX-II Recognition System . . . . . . . . . . . . . . . . . . . . . . 17 2.1. An Overview of the SPHINX-II System . . . . . . . . . . . . . . . . . . 17 2.1.1. Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.2. Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 20 2.1.3. Recognition Unit . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.4. Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.5. Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2. Experimental Tasks and Corpora . ...
A vector Taylor series approach for environment-independent speech recognition
- Proc. ICASSP-96
, 1996
"... In this paper we introduce a new analytical approach to environment compensation for speech recognition. Previous attempts at solving analytically the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speec ..."
Abstract
-
Cited by 66 (16 self)
- Add to MetaCart
In this paper we introduce a new analytical approach to environment compensation for speech recognition. Previous attempts at solving analytically the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speech or they have relied on the availability of large environment-specific adaptation sets. Some of the previous methods required the use of adaptation data that consists of simultaneouslyrecorded or “stereo ” recordings of clean and degraded speech. In this work we introduce the use of a Vector Taylor series (VTS) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel. The VTS approach is computationally efficient. It can be applied either to the incoming speech feature vectors, or to the statistics representing these vectors. In the first case the speech is compensated and then recognized; in the second case HMM statistics are modified using the VTS formulation. Both approaches use only the actual speech segment being recognized to compute the parameters required for environmental compensation. We evaluate the performance of two implementations of VTS algorithms using the CMU SPHINX-II system on the 100word alphanumeric CENSUS database and on the 1993 5000word ARPA Wall Street Journal database. Artificial white Gaussian noise is added to both databases. The VTS approaches provide significant improvements in recognition accuracy compared to previous algorithms. 1.
Survey of the State of the Art in Human Language Technology
, 1995
"... Contents 1 Spoken Language Input 1 Ron Cole & Victor Zue, chapter editors 1.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Victor Zue & Ron Cole 1.2 Speech Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 Victor Zue, Ron Cole, & Wayne Ward 1.3 Sig ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
Contents 1 Spoken Language Input 1 Ron Cole & Victor Zue, chapter editors 1.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Victor Zue & Ron Cole 1.2 Speech Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 Victor Zue, Ron Cole, & Wayne Ward 1.3 Signal Representation : : : : : : : : : : : : : : : : : : : : : : : : : : 11 Melvyn J. Hunt 1.4 Robust Speech Recognition : : : : : : : : : : : : : : : : : : : : : : 17 Richard M. Stern 1.5 HMM Methods in Speech Recognition : : : : : : : : : : : : : : : 24 Renato De Mori & Fabio Brugnara 1.6 Language Representation : : : : : : : : : : : : : : : : : : : : : : : : 35 Salim Roukos 1.7 Speaker Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : :<F35.37
Should recognizers have ears
- Speech Communication
, 1998
"... The paper discusses author’s experience with applying auditory knowledge to automatic recognition of speech. It indirectly argues against blind implementing of scattered accidental knowledge which may be irrelevant to a speech recognition task. It advances the notion that the reason for applying kno ..."
Abstract
-
Cited by 44 (3 self)
- Add to MetaCart
The paper discusses author’s experience with applying auditory knowledge to automatic recognition of speech. It indirectly argues against blind implementing of scattered accidental knowledge which may be irrelevant to a speech recognition task. It advances the notion that the reason for applying knowledge of human auditory perception in engineering applications should be the ability of perception to suppress some parts of information in the speech message. Three properties of human speech perception: limited spectral resolution, use of information from about syllable-length segments ability to alleviate unreliable cues, are discussed in some detail. Overall, we are advocating selective use of auditory knowledge,optimized on real speechdata. Fig. I A good hard working man. Fig. II A foolish man?
Uncertainty decoding for noise robust speech recognition
- in Proc. Interspeech
, 2004
"... This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings
Environmental Adaptation for Robust Speech Recognition
, 1994
"... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1. Approaches to Overcoming Environmental Variability . . . . . . ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1. Approaches to Overcoming Environmental Variability . . . . . . . . . . . . . . 6 1.1.1. Re-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.2. Multi-Style Training . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.3. Environmental Compensation Using Dynamic Adaptation . . . . . . . . . . 8 1.2. Towards Environment-Independent Recognition . . . . . . . . . . . . . . . . 8 1.2.1. Sources of Environmental Variability . . . . . . . . . . . . . . . . . . 9 1.2.2. Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 9 1.3. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2 Overview of Environmental Robustness in Speech Recognition . . . . . . 12 2.1. Sources of Degradation...
Compensation for environmental degradation in automatic speech recognition
- ESCA-NATO Tutorial and Research Workshop
, 1997
"... The accuracy of speech recognition systems degrades when operated in adverse acoustical environments. This paper reviews various methods by which more detailed mathematical descriptions of the effects of environmental degradation can improve speech recognition accuracy using both “data-driven” and “ ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
The accuracy of speech recognition systems degrades when operated in adverse acoustical environments. This paper reviews various methods by which more detailed mathematical descriptions of the effects of environmental degradation can improve speech recognition accuracy using both “data-driven” and “model-based ” compensation strategies. Data-driven methods learn environmental characteristics through direct comparisons of speech recorded in the noisy environment with the same speech recorded under optimal conditions. Model-based methods use a mathematical model of the environment and attempt to use samples of the degraded speech to estimate model parameters. These general approaches to environmental compensation are discussed in terms of recent research in environmental robustness at CMU, and in terms of similar efforts at other sites. These compensation algorithms are evaluated in a series of experiments measuring recognition accuracy for speech from the ARPA Wall Street Journal database that is corrupted by artificially-added noise at various signal-to-noise ratios (SNRs), and in more natural speech recognition tasks. 1.
Predictive linear transforms for noise robust speech recognition
- in Proceedings of the ASRU Workshop
"... It is well known that the addition of background noise alters the correlations between the elements of, for example, the MFCC feature vector. However, standard model-based compensation techniques do not modify the feature-space in which the diagonal covariance matrix Gaussian mixture models are esti ..."
Abstract
-
Cited by 14 (12 self)
- Add to MetaCart
It is well known that the addition of background noise alters the correlations between the elements of, for example, the MFCC feature vector. However, standard model-based compensation techniques do not modify the feature-space in which the diagonal covariance matrix Gaussian mixture models are estimated. One solution to this problem, which yields good performance, is Joint Uncertainty Decoding (JUD) with full transforms. Unfortunately, this results in a high computational cost during decoding. This paper contrasts two approaches to approximating full JUD while lowering the computational cost. Both use predictive linear transforms to modify the feature-space: adaptation-based linear transforms, where the model parameters are restricted to be the same as the original clean system; and precision matrix modelling approaches, in particular semi-tied covariance matrices. These predicitve transforms are estimated using statistics derived from the full JUD transforms rather noisy data. The schemes are evaluated on AURORA 2 and a noise-corrupted Resource Management task. Index Terms — Noise robust speech recognition, joint uncertainty

