Results 1 - 10
of
31
Uncertainty decoding for noise robust speech recognition
- in Proc. Interspeech
, 2004
"... This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings
Investigation of acoustic modeling techniques for LVCSR systems
- In Proc. ICASSP
, 2005
"... The CUHTK evaluation systems typically make use of a multiple pass, multiple branch, framework. This allows a range of acoustic models to be used in the framework and the output from all the systems, or branch, to be combined to give the final output. This paper describes experiments with several ad ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
The CUHTK evaluation systems typically make use of a multiple pass, multiple branch, framework. This allows a range of acoustic models to be used in the framework and the output from all the systems, or branch, to be combined to give the final output. This paper describes experiments with several advanced acoustic modelling techniques that were candidate approaches for the 2004 CU-HTK large vocabulary speech recognition systems. These techniques include Gaussianization for speaker normalization, discriminative cluster adaptive training, discriminative subspace for precision and mean modelling of inverse covariances, and discriminative complexity control. Acoustic models built using these techniques were integrated into a state-of-the-art 10 real-time multi-pass system with sophisticated adaptation for performance evaluation. Experimental results are presented on both broadcast news (BN) and conversational telephone speech (CTS) transcription tasks. 1.
Very Fast Adaptation with a Compact Context-Dependent Eigenvoice Model”. ICASSP-2001
, 2001
"... The “eigenvoice ” technique achieves rapid speaker adaptation by employing prior knowledge of speaker space obtained from reference speakers to place strong constraints on the initial model for each new speaker [9,10]. It has recently been shown to yield very fast adaptation for a large-vocabulary s ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
The “eigenvoice ” technique achieves rapid speaker adaptation by employing prior knowledge of speaker space obtained from reference speakers to place strong constraints on the initial model for each new speaker [9,10]. It has recently been shown to yield very fast adaptation for a large-vocabulary system [3] ([5] modifies the technique in an interesting way). In this paper, we describe a new way of applying the eigenvoice technique to context-dependent acoustic modeling, called the “Eigencentroid plus Delta Trees” (EDT) model. Here, the context-dependent model is defined so that it consists of a speaker-dependent component with a small number of parameters linked to a speaker-independent component with far more parameters. The eigenvoice technique can then be applied to the speaker-dependent component alone to attain very fast adaptation of the entire context-dependent model (e.g., 10% relative reduction in error rate after 3 sentences). EDT requires only a small number of parameters to represent speaker space and works even if only a small amount of data is available per reference speaker (in contrast to the system described in [3]). 2. BACKGROUND
Combining Derivative and Parametric Kernels for Speaker Verification
, 2007
"... Support Vector Machine-based speaker verification (SV) has become a standard approach in recent years. These systems typically use dynamic kernels to handle the dynamic nature of the speech utterances. This paper shows that many of these kernels fall into one of two general classes, derivative and p ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Support Vector Machine-based speaker verification (SV) has become a standard approach in recent years. These systems typically use dynamic kernels to handle the dynamic nature of the speech utterances. This paper shows that many of these kernels fall into one of two general classes, derivative and parametric kernels. The attributes of these classes are contrasted and the conditions under which the two forms of kernel are identical are described. By avoiding these conditions gains may be obtained by combining derivative and parametric kernels. One combination strategy is to combine at the kernel level. This paper describes a maximum-margin based scheme for learning kernel weights for the SV task. Various dynamic kernels and combinations were evaluated on the NIST 2002 SRE task, including derivative and parametric kernels based upon different model structures. The best overall performance was 7.78 % EER achieved when combining five kernels.
A Novel Framework and Training Algorithm for VariableParameter Hidden Markov Models
- IEEE trans. on Audio, Speech, and Language Processing
"... Abstract—We propose a new framework and the associated maximum-likelihood and discriminative training algorithms for the variable-parameter hidden Markov model (VPHMM) whose mean and variance parameters vary as functions of additional environment-dependent conditioning parameters. Our framework diff ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
Abstract—We propose a new framework and the associated maximum-likelihood and discriminative training algorithms for the variable-parameter hidden Markov model (VPHMM) whose mean and variance parameters vary as functions of additional environment-dependent conditioning parameters. Our framework differs from the VPHMM proposed by Cui and Gong (2007) in that piecewise spline interpolation instead of global polynomial regression is used to represent the dependency of the HMM parameters on the conditioning parameters, and a more effective functional form is used to model the variances. Our framework unifies and extends the conventional discrete VPHMM. It no longer requires quantization in estimating the model parameters and can support both parameter sharing and instantaneous conditioning parameters naturally. We investigate the strengths and weaknesses of the model on the Aurora-3 corpus. We show that under the well-matched condition the proposed discriminatively trained VPHMM outperforms the conventional HMM trained in the same way with relative word error rate (WER) reduction of 19 % and 15%, respectively, when only mean is updated and when both mean and variances are updated. Index Terms—Discriminative training, growth transformation, parameter clustering, speech recognition, spline interpolation, variable-parameter hidden Markov model (VPHMM). I.
Bayesian adaptive inference and adaptive training
- IEEE Transactions Speech and Audio Processing
, 2007
"... Abstract—Large-vocabulary speech recognition systems are often built using found data, such as broadcast news. In contrast to carefully collected data, found data normally contains multiple acoustic conditions, such as speaker or environmental noise. Adaptive training is a powerful approach to build ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Abstract—Large-vocabulary speech recognition systems are often built using found data, such as broadcast news. In contrast to carefully collected data, found data normally contains multiple acoustic conditions, such as speaker or environmental noise. Adaptive training is a powerful approach to build systems on such data. Here, transforms are used to represent the different acoustic conditions, and then a canonical model is trained given this set of transforms. This paper describes a Bayesian framework for adaptive training and inference. This framework addresses some limitations of standard maximum-likelihood approaches. In contrast to the standard approach, the adaptively trained system can be directly used in unsupervised inference, rather than having to rely on initial hypotheses being present. In addition, for limited adaptation data, robust recognition performance can be obtained. The limited data problem often occurs in testing as there is no control over the amount of the adaptation data available. In contrast, for adaptive training, it is possible to control the system complexity to reflect the available data. Thus, the standard point estimates may be used. As the integral associated with Bayesian adaptive inference is intractable, various marginalization approximations are described, including a variational Bayes approximation. Both batch and incremental modes of adaptive inference are discussed. These approaches are applied to adaptive training of maximum-likelihood linear regression and evaluated on a large-vocabulary speech recognition task. Bayesian adaptive inference is shown to significantly outperform standard approaches. Index Terms—Adaptive training, Bayesian adaptation, Bayesian inference, incremental, variational Bayes.
Embedded kernel eigenvoice speaker adaptation and its implication to reference speaker weighting
- IEEE Transactions on Speech and Audio Processing
, 2006
"... Abstract — Recently, we proposed an improvement to the conventional eigenvoice (EV) speaker adaptation using kernel methods. In our novel kernel eigenvoice (KEV) speaker adaptation [1], speaker supervectors are mapped to a kernelinduced high dimensional feature space, where eigenvoices are computed ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Abstract — Recently, we proposed an improvement to the conventional eigenvoice (EV) speaker adaptation using kernel methods. In our novel kernel eigenvoice (KEV) speaker adaptation [1], speaker supervectors are mapped to a kernelinduced high dimensional feature space, where eigenvoices are computed using kernel principal component analysis. A new speaker model is then constructed as a linear combination of the leading eigenvoices in the kernel-induced feature space. KEV adaptation was shown to outperform EV, MAP, and MLLR adaptation in a TIDIGITS task with less than 10s of adaptation speech [2]. Nonetheless, due to many kernel evaluations, both adaptation and subsequent recognition in KEV adaptation are considerably slower than conventional EV adaptation. In this paper, we solve the efficiency problem and eliminate all kernel evaluations involving adaptation or testing observations by finding an approximate preimage of the implicit adapted model found by KEV adaptation in the feature space; we call our new method embedded kernel eigenvoice (eKEV) adaptation. eKEV adaptation is faster than KEV adaptation, and subsequent recognition runs as fast as normal HMM decoding. eKEV adaptation makes use of multi-dimensional scaling technique so that the resulting adapted model lies in the span of a subset of carefully chosen training speakers. It is related to the reference speaker weighting (RSW) adaptation method that is based on speaker clustering. Our experimental results on Wall Street Journal show that eKEV adaptation continues to outperform EV, MAP, MLLR, and the original RSW method. However, by adopting the way we choose the subset of reference speakers for eKEV adaptation, we may also improve RSW adaptation so that it performs as well as our eKEV adaptation.
Speaker comparison with inner product discriminant functions
- in Advances in NIPS
"... Speaker comparison, the process of finding the speaker similarity between two speech signals, occupies a central role in a variety of applications—speaker verification, clustering, and identification. Speaker comparison can be placed in a geometric framework by casting the problem as a model compari ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Speaker comparison, the process of finding the speaker similarity between two speech signals, occupies a central role in a variety of applications—speaker verification, clustering, and identification. Speaker comparison can be placed in a geometric framework by casting the problem as a model comparison process. For a given speech signal, feature vectors are produced and used to adapt a Gaussian mixture model (GMM). Speaker comparison can then be viewed as the process of compensating and finding metrics on the space of adapted models. We propose a framework, inner product discriminant functions (IPDFs), which extends many common techniques for speaker comparison—support vector machines, joint factor analysis, and linear scoring. The framework uses inner products between the parameter vectors of GMM models motivated by several statistical methods. Compensation of nuisances is performed via linear transforms on GMM parameter vectors. Using the IPDF framework, we show that many current techniques are simple variations of each other. We demonstrate, on a 2006 NIST speaker recognition evaluation task, new scoring methods using IPDFs which produce excellent error rates and require significantly less computation than current techniques. 1
Adaptive Training for Large Vocabulary Continuous Speech Recognition
, 2006
"... Summary In recent years, there has been a trend towards training large vocabulary continuous speech recognition (LVCSR) systems on a large amount of found data. Found data is recorded from spontaneous speech without careful control of the recording acoustic conditions, for example, conversational te ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Summary In recent years, there has been a trend towards training large vocabulary continuous speech recognition (LVCSR) systems on a large amount of found data. Found data is recorded from spontaneous speech without careful control of the recording acoustic conditions, for example, conversational telephone speech. Hence, it typically has greater variability in terms of speaker and acoustic conditions than specially collected data. Thus, in addition to the desired speech variability required to discriminate between words, it also includes various non-speech variabil-ities, for example, the change of speakers or acoustic environments. The standard approach to handle this type of data is to train hidden Markov models (HMMs) on the whole data set as if all data comes from a single acoustic condition. This is referred to as multi-style training, for exam-ple speaker-independent training. Effectively, the non-speech variabilities are ignored. Though good performance has been obtained with multi-style systems, these systems account for all variabilities. Improvement may be obtained if the two types of variabilities in the found data are modelled separately. Adaptive training has been proposed for this purpose. In contrast to multi-style training, a set of transforms is used to represent the non-speech variabilities. A canonical
Multiple-Cluster Adaptive Training Schemes
- IN PROC. ICASSP
, 2001
"... This paper examines the training of multiple-cluster systems using adaptive training schemes. Various forms of transformation and canonical model are described in a consistent framework allowing re-estimation formulae for all cases to be simply derived. Initial experiments using these various scheme ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
This paper examines the training of multiple-cluster systems using adaptive training schemes. Various forms of transformation and canonical model are described in a consistent framework allowing re-estimation formulae for all cases to be simply derived. Initial experiments using these various schemes on a large vocabulary speech recognition task are presented. The initial experiments indicate that to achieve best performance when adapting these multiple-cluster systems requires the use of adaptive training schemes rather than using simpler cluster initialisation schemes.

