Results 1  10
of
12
Augmented Statistical Models for Classifying Sequence Data
, 2006
"... Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two journal articles [36,68], two workshop papers [35,67] and a technical report [65]. The length of this thesis including appendices, bibliography, footnotes, tables and equations is approximately 60,000 words. This thesis contains 27 figures and 20 tables. i
RaoBlackwellised Gibbs Sampling for Switching Linear Dynamical Systems
 In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004
, 2004
"... This paper describes the application of RaoBlackwellised Gibbs sampling (RBGS) to speech recognition using switching linear dynamical systems (SLDSs). The SLDS is a hybrid of standard hidden Markov models (HMMs) and linear dynamical systems. It is an extension of the stochastic segment model as it ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
This paper describes the application of RaoBlackwellised Gibbs sampling (RBGS) to speech recognition using switching linear dynamical systems (SLDSs). The SLDS is a hybrid of standard hidden Markov models (HMMs) and linear dynamical systems. It is an extension of the stochastic segment model as it relaxes the assumption of independent segments. SLDSs explicitly take into account the strong coarticulation present in speech. Unfortunately, inference in SLDS is intractable unless the discrete state sequence is known. RBGS is one approach that may be applied for both improved training and decoding for this form of intractable model. The theory of SLDS and RBGS is described, along with an efficient proposal mechanism. The performance of the SLDS using RBGS for training and inference is evaluated on the ARPA Resource Management task.
Linear Gaussian models for speech recognition
 CAMBRIDGE UNIVERSITY
, 2004
"... Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete stat ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular, the assumption that successive speech frames are conditionally independent given the discrete state that generated them is not a good assumption for speech recognition. State space models may be used to address some shortcomings of this assumption. State space models are based on a continuous state vector evolving through time according to a state evo
SVMs, scorespaces and maximum margin statistical models
 in Beyond HMM workshop, ATR
, 2004
"... There has been significant interest in developing new forms of acoustic model, in particular models which allow additional dependencies to be represented than allowed within a standard hidden Markov model (HMM). This paper discusses one such class of models, augmented statistical models. Here a loca ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
There has been significant interest in developing new forms of acoustic model, in particular models which allow additional dependencies to be represented than allowed within a standard hidden Markov model (HMM). This paper discusses one such class of models, augmented statistical models. Here a locally exponential approximation is made about some point on a base distribution. This allows additional dependencies within the data to be modelled than are represented in the base distribution. Augmented models based on Gaussian mixture models (GMMs) and HMMs are briefly described. These augmented models are then related to generative kernels, one approach used for allowing support vector machines (SVMs) to be applied to variable length data. The training of augmented statistical models within an SVM, generative kernel, framework is then discussed. This may be viewed as using maximum margin training to estimate statistical models. Augmented Gaussian mixture models are then evaluated using rescoring on a large vocabulary speech recognition task. 1.
Temporally varying model parameters for large vocabulary continuous speech recognition
 In Proc. Interspeech
, 2005
"... Many forms of time varying acoustic models have been applied to the area of speech recognition. However, there has been little success in applying these models to Large Vocabulary Continuous Speech Recognition (LVCSR). Recently, fMPE was introduced as a discriminative feature space estimation scheme ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Many forms of time varying acoustic models have been applied to the area of speech recognition. However, there has been little success in applying these models to Large Vocabulary Continuous Speech Recognition (LVCSR). Recently, fMPE was introduced as a discriminative feature space estimation scheme for the HMMbased LVCSR. This method estimates a projection matrix from a high dimensional space ( ∼ 100,000) down to a standard feature space (typically 39). This projection is then added on to the original feature vector (e.g. MFCC or PLP) to yield a feature vector to train the final model. This paper considers fMPE as a time varying model for the mean vectors by applying the time varying feature offset to the Gaussian mean vectors. This approach naturally yields the update formulae for fMPE and motivates an alternative style of training systems. This concept is then extended to the temporal precision matrix modelling (pMPE). In pMPE, a temporally varying positive scale is applied to each element of the diagonal precision matrices. Experimental results are presented on a conversational telephone speech English task. 1.
Discriminative Semiparametric Trajectory Model for Speech Recognition Abstract
"... Hidden Markov Models (HMMs) are the most commonly used acoustic model for speech recognition. In HMMs, the probability of successive observations is assumed independent given the state sequence. This is known as the conditional independence assumption. Consequently, the temporal (interframe) correl ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Hidden Markov Models (HMMs) are the most commonly used acoustic model for speech recognition. In HMMs, the probability of successive observations is assumed independent given the state sequence. This is known as the conditional independence assumption. Consequently, the temporal (interframe) correlations are poorly modelled. This limitation may be reduced by incorporating some form of trajectory modelling. In this paper, a general perspective on trajectory modelling is provided, where time varying model parameters are used for the Gaussian components. A discriminative semiparametric trajectory model is then described where the Gaussian mean vector and covariance matrix parameters vary with time. The time variation is modelled as a semiparametric function of the observation sequence via a set of centroids in the acoustic space. The model parameters are estimated discriminatively using the Minimum Phone Error (MPE) criterion. The performance of these models is investigated and benchmarked against a stateoftheart CUHTK Mandarin evaluation systems. Key words: speech recognition, trajectory model, discriminative training, minimum phone error 1
Structured Precision Matrix Modelling for Speech Recognition
, 2006
"... Declaration This dissertation is the result of my own work and includes nothing which is the outcome of the work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices i ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing which is the outcome of the work done in collaboration, except where stated. It has not been submitted in whole or part for a degree at any other university. The length of this thesis including footnotes and appendices is approximately 53,000 words. ii Summary The most extensively and successfully applied acoustic model for speech recognition is the Hidden Markov Model (HMM). In particular, a multivariate Gaussian Mixture Model (GMM) is typically used to represent the output density function of each HMM state. For reasons of efficiency, the covariance matrix associated with each Gaussian component is assumed diagonal and the probability of successive observations is assumed independent given the HMM state sequence. Consequently, the spectral (intraframe) and temporal (interframe) correlations are poorly modelled. This thesis investigates ways of improving these aspects by extending the standard HMM. Parameters for these extended models are estimated discriminatively using the
Kernel Methods for TextIndependent Speaker Verification
, 2010
"... In recent years, systems based on support vector machines (SVMs) have become standard for speaker veriﬁcation (SV) tasks. An important aspect of these systems is the dynamic kernel.
These operate on sequence data and handle the dynamic nature of the speech. In this thesis a number of techniques are ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
In recent years, systems based on support vector machines (SVMs) have become standard for speaker veriﬁcation (SV) tasks. An important aspect of these systems is the dynamic kernel.
These operate on sequence data and handle the dynamic nature of the speech. In this thesis a number of techniques are proposed for improving dynamic kernelbased SV systems.
The ﬁrst contribution of this thesis is the development of alternative forms of dynamic kernel. Several popular dynamic kernels proposed for SV are based on the KullbackLeibler divergence between Gaussian mixture models. Since this has no closedform solution, typically a matchedpair upper bound is used instead. This places signiﬁcant restrictions on the forms of model structure that may be used. In this thesis, dynamic kernels are proposed based
on alternative, variational approximations to the divergence. Unlike standard approaches, these allow the use of a more ﬂexible modelling framework. Also, using a more accurate approximation may lead to performance gains.
The second contribution of this thesis is to investigate the combination of multiple systems to improve SV performance. Typically, systems are combined by fusing the output scores.
For SVM classiﬁers, an alternative strategy is to combine at the kernel level. Recently an efficient maximummargin scheme for learning kernel weights has been developed. In this thesis several modiﬁcations are proposed to allow this scheme to be applied to SV tasks.
System combination will only lead to gains when the kernels are complementary. In this thesis it is shown that many commonly used dynamic kernels can be placed into one of two broad classes, derivative and parametric kernels. The attributes of these classes are contrasted and the conditions under which the two forms of kernel are identical are described. By avoiding these conditions gains may be obtained by combining derivative and parametric kernels.
The ﬁnal contribution of this thesis is to investigate the combination of dynamic kernels with traditional static kernels for vector data. Here two general combination strategies are available: static kernel functions may be deﬁned over the dynamic feature vectors. Alternatively, a static kernel may be applied at the observation level. In general, it is not possible to explicitly train a model in the feature space associated with a static kernel. However, it is shown in this thesis that this form of kernel can be computed by using a suitable metric with approximate component posteriors. Generalised versions of standard parametric and derivative kernels, that include an observationlevel static kernel, are proposed based on this
approach.
AnttiVeikko Ilmari ROSTI
, 1975
"... Laboratory, University of Cambridge, UK. . HTK Rich Audio Transcription project (part of the DARPA EARS programme). Responsibilities include:  automatic segmentation for conversational telephone speech;  fast likelihood calculation to speed up training;  HTK software development. 7/1999 ..."
Abstract
 Add to MetaCart
Laboratory, University of Cambridge, UK. . HTK Rich Audio Transcription project (part of the DARPA EARS programme). Responsibilities include:  automatic segmentation for conversational telephone speech;  fast likelihood calculation to speed up training;  HTK software development. 7/1999  9/2000 Researcher, Audio Research Group, Digital Media Institute, Tampere University of Technology, Finland. . Creation of Finnish speech databases recorded over a fixed telephone line in SpeechDat(II) and SpeechDatCar projects (http://www.speechdat.org/). Responsibilities included:  design and development of tools to create prompt sheets using C, Perl and Unix shell scripts;  administration of the recording platform in NT environment;  design and development of tools to annotate the recordings and create label files using C, Matlab and Unix shell scripts;  running initial speech recognition experiments for all major European languages using the HTK;  maintenance and docu
Discriminative Complexity Control and Linear Projections for Large Vocabulary Speech Recognition
, 2005
"... Selecting the optimal model structure with the “appropriate” complexity is a standard problem for training large vocabulary continuous speech recognition (LVCSR) systems, and machine learning in general. Stateoftheart LVCSR systems are highly complex. A wide variety of techniques may be used wh ..."
Abstract
 Add to MetaCart
Selecting the optimal model structure with the “appropriate” complexity is a standard problem for training large vocabulary continuous speech recognition (LVCSR) systems, and machine learning in general. Stateoftheart LVCSR systems are highly complex. A wide variety of techniques may be used which alter the system complexity and word error rate (WER). Explicitly evaluating systems for all possible configurations is infeasible. Automatic model complexity control criteria are needed. Most existing complexity control schemes can be classified into two types, Bayesian learning techniques and information theory approaches. An implicit assumption is made in both that increasing the likelihood on heldout data decreases the WER. However, this correlation is found to be quite weak for current speech recognition systems. Hence it is preferable to employ discriminative methods for complexity control. In this thesis a novel discriminative model selection technique, the marginalization of a discriminative growth function, is presented. This is a closer approximation to the true WER than standard likelihood based approaches. The number of Gaussian components and feature dimensions of an HMM based LVCSR system is controlled. Experimental results on a wide rage of LVCSR tasks showed that