Results 1 - 10
of
63
Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition
- Computer Speech and Language
, 1998
"... This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias ..."
Abstract
-
Cited by 275 (44 self)
- Add to MetaCart
This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Other than in the form of a simple bias, strict linear feature-space transformations are inappropriate in this case. Hence, only model-based linear transforms are considered. The paper compares the two possible forms of model-based transforms: (i) unconstrained, where any combination of mean and variance transform may be used, and (ii) constrained, which requires the variance transform to have the same form as the mean transform (sometimes referred to as feature-space transforms). Re-estimation formulae for all appropriate cases of transform are given. This includes a new and efficient "full" variance transform and the extension of the constrained model-space transform from the simple diagonal case to the full or block-diagonal case. The constrained and unconstrained transforms are evaluated in terms of computational cost, recognition time efficiency, and use for speaker adaptive training. The recognition performance of the two model-space transforms on a large vocabulary speech recognition task using incremental adaptation is investigated. In addition, initial experiments using the constrained model-space transform for speaker adaptive training are detailed. 1 The author is now at the IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA 1
Cluster Adaptive Training Of Hidden Markov Models
- IEEE Transactions on Speech and Audio Processing
, 1999
"... When performing speaker adaptation there are two conicting requirements. First the transform must be powerful enough to represent the speaker. Second the transform must be quickly and easily estimated for any particular speaker. The most popular adaptation schemes have used many parameters to adapt ..."
Abstract
-
Cited by 36 (11 self)
- Add to MetaCart
When performing speaker adaptation there are two conicting requirements. First the transform must be powerful enough to represent the speaker. Second the transform must be quickly and easily estimated for any particular speaker. The most popular adaptation schemes have used many parameters to adapt the models to be representative of an individual speaker. This limits how rapidly the models may be adapted to a new speaker or acoustic environment. This paper examines an adaptation scheme requiring very few parameters, cluster adaptive training (CAT). CAT may be viewed as a simple extension to speaker clustering. Rather than selecting a single cluster as representative of a particular speaker, a linear interpolation of all the cluster means is used as the mean of the particular speaker. This scheme naturally falls into an adaptive training framework. Maximum likelihood estimates of the interpolation weights are given. Furthermore, simple re-estimation formulae for cluster means, represented both explicitly and by sets of transforms of some canonical mean, are given. On a speakerindependent task CAT reduced the word error rate using very little adaptation data. In addition when combined with other adaptation schemes it gave a 5% reduction in word error rate over adapting a speaker-independent model set. 2 1
Vocal Tract Normalization Equals Linear Transformation in Cepstral Space
- IN PROC. OF THE EUROSPEECH’01
, 2001
"... We show that vocal tract normalization (VTN) frequency warping results in a linear transformation in the cepstral domain. For the special case of a piece-wise linear warping function, the transformation matrix is analytically calculated. This approach enables us to compute the Jacobian determinant o ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
We show that vocal tract normalization (VTN) frequency warping results in a linear transformation in the cepstral domain. For the special case of a piece-wise linear warping function, the transformation matrix is analytically calculated. This approach enables us to compute the Jacobian determinant of the transformation matrix, which allows the normalization of the probability distributions used in speaker-normalization for automatic speech recognition.
The SRI March 2000 Hub-5 conversational speech transcription system
- In Proceedings of the NIST Speech Transcription Workshop
, 2000
"... We describe SRI’s large vocabulary conversational speech recognition system as used in the March 2000 NIST Hub-5E evaluation. The system performs four recognition passes: (1) bigram recognition with phone-loop-adapted, within-word triphone acoustic models, (2) lattice generation with transcription-m ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
We describe SRI’s large vocabulary conversational speech recognition system as used in the March 2000 NIST Hub-5E evaluation. The system performs four recognition passes: (1) bigram recognition with phone-loop-adapted, within-word triphone acoustic models, (2) lattice generation with transcription-mode-adapted models, (3) trigram lattice recognition with adapted cross-word triphone models, and (4) N-best rescoring and reranking with various additional knowledge sources. The system incorporates two new kinds of acoustic model: triphone models conditioned on speaking rate, and an explicit joint model of within-word phone durations. We also obtained an unusually large improvement from modeling crossword pronunciation variants in “multiword ” vocabulary items. The language model (LM) was enhanced with an “anti-LM ” representing acoustically confusable word sequences. Finally, we applied a generalized ROVER algorithm to combine the N-best hypotheses from several systems based on different acoustic models. 1.
Speech recognition in noisy environments using first-order vector Taylor series
- Speech Communication
, 1998
"... Z. In this paper, we generalize relations between clean and noisy speech signal using vector Taylor series VTS expansion Z. for noise-robust speech recognition. We use it for both the noisy data compensation and hidden Markov model HMM parameter adaptation, and apply it for the cepstral domain dire ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Z. In this paper, we generalize relations between clean and noisy speech signal using vector Taylor series VTS expansion Z. for noise-robust speech recognition. We use it for both the noisy data compensation and hidden Markov model HMM parameter adaptation, and apply it for the cepstral domain directly, while Moreno used it to estimate the log-spectral parameters. Also, we develop a detailed procedure to estimate environmental variables in the cepstral domain using the Z. Z. expectation and maximization EM algorithms based on the maximum likelihood ML sense. To evaluate the developed method, we conduct speaker-independent isolated word and continuous speech recognition experiments. White Gaussian and driving car noises added to clean speech at various SNR are used as disturbing sources. Using only noise statistics obtained from three frames of silence and noisy speech to be recognized, we achieve significant performance improvement. Z. Especially, HMM parameter adaptation with VTS i...
On adaptive decision rules and decision parameter adaptation for automatic speech recognition
- Proc. IEEE
, 2000
"... Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximum-likelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for high-performance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine
Speaker Normalization with All-Pass Transforms
, 1998
"... Speaker normalization is a process in which the short-time features of speech from a given speaker are transformed so as to better match some speaker independent model. Vocal tract length normalization (VTLN) is a popular speaker normalization scheme wherein the frequency axis of the short-time spec ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
Speaker normalization is a process in which the short-time features of speech from a given speaker are transformed so as to better match some speaker independent model. Vocal tract length normalization (VTLN) is a popular speaker normalization scheme wherein the frequency axis of the short-time spectrum associated with a particular speaker's speech is rescaled or warped prior to the extraction of cepstral features. In this work, we develop a novel speaker normalization scheme by exploiting the fact that frequency domain transformations similar to that inherent in VTLN can be accomplished entirely in the cepstral domain through the use of conformal maps. We propose a class of such maps, designated all-pass transforms for reasons given hereafter, and rigorously investigate their properties. Theoretical results are provided relating to the transformation of cepstral sequences under these maps. Additionally, all relations necessary to determine maximum likelihood estimates of the mapping parameters are derived for both speaker normalization and adaptation. 1 2 Speaker Normalization with All-Pass Transforms 1
Divergence-Based Out-Of-Class Rejection For Telephone Handset
- in Proc. ICSLP’02, 2002
, 2002
"... Research has shown that handset selectors can be used to assist telephone-based speech/speaker recognition. Most handset selectors, however, simply select the most likely handset from a set of known handsets even for speech coming from an `unseen' handset. This paper proposes a divergence-based hand ..."
Abstract
-
Cited by 12 (12 self)
- Add to MetaCart
Research has shown that handset selectors can be used to assist telephone-based speech/speaker recognition. Most handset selectors, however, simply select the most likely handset from a set of known handsets even for speech coming from an `unseen' handset. This paper proposes a divergence-based handset selector with out-of-handset (OOH) rejection capability to identify the `unseen' handsets. This is achieved by measuring the Jensen difference between the selector's output and a constant vector with identical elements. The resulting handset selector is combined with a feature-based channel compensation algorithm for telephonebased speaker verification. Utterances whose handsets were identified as `unseen' are either transformed by a global bias vector or normalized by cepstral mean subtraction (CMS). On the other hand, if the handset can be identified (considered as `seen'), its corresponding transformation parameters will be used to transform the utterances. Experiments based on ten handsets of the HTIMIT corpus show that using the transformation parameters of the `seen' handsets to transform the utterances with correctly identified handsets and processing those utterances with `unseen' handsets by CMS achieve the best result.
Transformation Smoothing for Speaker and Environmental Adaptation
"... Recently there has been much work done on how to transform HMMs, trained typically in a speaker-independent fashion on clean training data, to be more representative of data from a particular speaker or acoustic environment. These transforms are trained on a small amount of training data, so large n ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Recently there has been much work done on how to transform HMMs, trained typically in a speaker-independent fashion on clean training data, to be more representative of data from a particular speaker or acoustic environment. These transforms are trained on a small amount of training data, so large numbers of components are required to share the same transform. Normally, each component is constrained to only use one transform. This paper examines how to optimally, in a maximum likelihood sense, assign components to transforms and allow each component, or component grouping, to make use of many transformations. The theory for obtaining both "weights" for each transform and transforms given a set of weights is given. The techniques are evaluated on both speaker and environmental adaptation tasks. 1.
Robust Text-Independent Speaker Identification over Telephone Channels
- IEEE Trans. on Speech and Audio Processing
, 1997
"... This paper addresses the issue of closed-set text-independent speaker identification from samples of speech recorded over the telephone. It focuses on the effects of acoustic mismatches between training and testing data, and concentrates on two approaches: extracting features that are robust against ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper addresses the issue of closed-set text-independent speaker identification from samples of speech recorded over the telephone. It focuses on the effects of acoustic mismatches between training and testing data, and concentrates on two approaches: extracting features that are robust against channel variations, and transforming the speaker models to compensate for channel effects. First, an experimental study shows that optimizing the front end processing of the speech signal can significantly improve speaker recognition performance. A new filterbank design is introduced to improve the robustness of the speech spectrum computation in the front-end unit. Next, a new feature based on spectral slopes is described. Its ability to discriminate between speakers is shown to be superior to that of the traditional cepstrum. This feature can be used alone or combined with the cepstrum. The second part of the paper presents two model transformation methods that further reduce channel effe...

