Results 1  10
of
22
Uncertainty decoding for noise robust speech recognition
 in Proc. Interspeech
, 2004
"... This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings ..."
Abstract

Cited by 45 (12 self)
 Add to MetaCart
(Show Context)
This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings
Bayesian Ying Yang system, best harmony learning, and Gaussian manifold based family
 Computational Intelligence: Research Frontiers, WCCI2008 Plenary/Invited Lectures. Lecture Notes in Computer Science
"... five action circling ..."
(Show Context)
Machine Learning Paradigms for Speech Recognition: An Overview
, 2013
"... Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasional ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasionally does use ASR as a largescale, realistic application to rigorously test the effectiveness of a given technique, and to inspire new problems arising from the inherently sequential and dynamic nature of speech. On the other hand, even though ASR is available commercially for some applications, it is largely an unsolved problem—for almost all applications, the performance of ASR is not on par with human performance. New insight from modern ML methodology shows great promise to advance the stateoftheart in ASR technology. This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems. The intent is to foster further crosspollination between the ML and ASR communities than has occurred in the past. The article is organized according to the major ML paradigms that are either popular already or have potential for making significant contributions to ASR technology. The paradigms presented and elaborated in this overview include: generative and discriminative learning; supervised, unsupervised, semisupervised, and active learning; adaptive and multitask learning; and Bayesian learning. These learning paradigms are motivated and discussed in the context of ASR technology and applications. We finally present and analyze recent developments of deep learning and learning with sparse representations, focusing on their direct relevance to advancing ASR technology.
Lightly Supervised Discriminative Training of Grapheme Models for Improved Sentencelevel
 Alignment of Speech and Text Data,” in Proc. of Interspeech (accepted
, 2013
"... This paper introduces a method for lightly supervised discriminative training using MMI to improve the alignment of speech and text data for use in training HMMbased TTS systems for lowresource languages. In TTS applications, due to the use of longspan contexts, it is important to select training ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
This paper introduces a method for lightly supervised discriminative training using MMI to improve the alignment of speech and text data for use in training HMMbased TTS systems for lowresource languages. In TTS applications, due to the use of longspan contexts, it is important to select training utterances which have wholly correct transcriptions. In a lowresource setting, when using poorly trained grapheme models, we show that the use of MMI discriminative training at the graphemelevel enables us to increase the amount of correctly aligned data by 40%, while maintaining a 7 % sentence error rate and 0.8 % word error rate. We present the procedure for lightly supervised discriminative training with regard to the objective of minimising sentence error rate. Index Terms: automatic alignment, grapheme models, light supervision, MMI, texttospeech
Discriminative Classifiers with Generative Kernels for Noise Robust ASR
"... Discriminative classifiers are a popular approach to solving classification problems. However one of the problems with these approaches, in particular kernel based classifiers such as Support Vector Machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. Thi ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
(Show Context)
Discriminative classifiers are a popular approach to solving classification problems. However one of the problems with these approaches, in particular kernel based classifiers such as Support Vector Machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. This paper describes a scheme for overcoming this problem for speech recognition in noise. Generative kernels, defined using generative models, allow SVMs to handle sequence data. By compensating the generative models for the noise conditions noisespecific generative kernels can be obtained. These can be used to train a noiseindependent SVM on a range of noise conditions, which can then be used with a testset noise kernel for classification. Initial experiments using an idealised version of modelbased compensation were run on the AURORA 2.0 continuous digit task. The proposed scheme yielded large gains in performance over the compensated models.
EFFICIENT DECODING WITH GENERATIVE SCORESPACES USING THE EXPECTATION SEMIRING
"... Stateoftheart speech recognisers are usually based on hidden Markov models (HMMs). They model a hidden symbol sequence with a Markov process, with the observations independent given that sequence. These assumptions yield efficient algorithms, but limit the power of the model. An alternative model ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Stateoftheart speech recognisers are usually based on hidden Markov models (HMMs). They model a hidden symbol sequence with a Markov process, with the observations independent given that sequence. These assumptions yield efficient algorithms, but limit the power of the model. An alternative model that allows a wide range of features, including word and phonelevel features, is a loglinear model. To handle, for example, wordlevel variablelength features, the original feature vectors must be segmented into words. Thus, decoding must find the optimal combination of segmentation of the utterance into words and word sequence. Features must therefore be extracted for each possible segment of audio. For many types of features, this becomes slow. In this paper, longspan features are derived from the likelihoods of word HMMs. Derivatives of the loglikelihoods, which break the Markov assumption, are appended. Previously, decoding with this model took cubic time in the length of the sequence, and longer for higherorder derivatives. This paper shows how to decode in quadratic time. Index Terms — Speech recognition, loglinear models, weighted finitestate transducers, expectation semiring.
KERNELIZED LOG LINEAR MODELS FOR CONTINUOUS SPEECH RECOGNITION
"... Large margin criteria and discriminative models are two effective improvements for HMMbased speech recognition. This paper proposed a large margin trained log linear model with kernels for CSR. To avoid explicitly computing in the high dimensional feature space and to achieve the nonlinear decision ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Large margin criteria and discriminative models are two effective improvements for HMMbased speech recognition. This paper proposed a large margin trained log linear model with kernels for CSR. To avoid explicitly computing in the high dimensional feature space and to achieve the nonlinear decision boundaries, a kernel based training and decoding framework is proposed in this work. To make the system robust to noise a kernel adaptation scheme is also presented. Previous work in this area is extended in two directions. First, most kernels for CSR focus on measuring the similarity between two observation sequences. The proposed joint kernels defined a similarity between two observationlabel sequence pairs on the sentence level. Second, this paper addresses how to efficiently employ kernels in large margin training and decoding with lattices. To the best of our knowledge, this is the first attempt at using large margin kernelbased log linear models for CSR. The model is evaluated on a noise corrupted continuous digit task: AURORA 2.0. Index Terms — log linear model, large margin, kernel 1.
Information Theoretical Kernels for Generative Embeddings Based on Hidden Markov Models
"... Abstract. Many approaches to learning classifiers for structured objects (e.g., shapes) use generative models in a Bayesian framework. However, stateoftheart classifiers for vectorial data (e.g., support vector machines) are learned discriminatively. A generative embedding is a mapping from the o ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Many approaches to learning classifiers for structured objects (e.g., shapes) use generative models in a Bayesian framework. However, stateoftheart classifiers for vectorial data (e.g., support vector machines) are learned discriminatively. A generative embedding is a mapping from the object space into a fixed dimensional feature space, induced by a generative model which is usually learned from data. The fixed dimensionality of these feature spaces permits the use of state of the art discriminative machines based on vectorial representations, thus bringing together the best of the discriminative and generative paradigms. Using a generative embedding involves two steps: (i) defining and learning the generative model used to build the embedding; (ii) discriminatively learning a (maybe kernel) classifier on the adopted feature space. The literature on generative embeddings is essentially focused on step (i), usually adopting some standard offtheshelf tool (e.g., an SVM with a linear or RBF kernel) for step (ii). In this paper, we follow a different route, by combining several Hidden Markov Modelsbased generative embeddings (including the classical Fisher score) with the recently proposed nonextensive information theoretic kernels. We test this methodology on a 2D shape recognition task, showing that the proposed method is competitive with the stateofart. 1
unknown title
"... Abstract This chapter builds upon the reviews in the previous chapter on aspects of probability theory and statistics including random variables and Gaussian mixture models, and extends the reviews to the Markov chain and the hidden Markov sequence or model (HMM). Central to the HMM is the concept ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract This chapter builds upon the reviews in the previous chapter on aspects of probability theory and statistics including random variables and Gaussian mixture models, and extends the reviews to the Markov chain and the hidden Markov sequence or model (HMM). Central to the HMM is the concept of state, which is itself a random variable typically taking discrete values. Extending from a Markov chain to an HMM involves adding uncertainty or a statistical distribution on each of the states in the Markov chain. Hence, an HMM is a doublystochastic process, or probabilistic function of a Markov chain. When the state of the Markov sequence or HMM is conned to be discrete and the distributions associated with the HMM states do not overlap, we reduce it to a Markov chain. This chapter covers several key aspects of the HMM, including its parametric characterization, its simulation by random number generators, its likelihood evaluation, its parameter estimation via the EM algorithm, and its state decoding via the Viterbi algorithm or a dynamic programming procedure. We then provide discussions on the use of the HMM as a generative model for speech feature sequences and its use as the basis for speech recognition. Finally, we discuss the limitations of the HMM, leading to its various extended versions, where each state is made associated with a dynamic system or a hidden timevarying trajectory instead of with a temporally independent stationary distribution such as a Gaussian mixture. These variants of the HMM with stateconditioned dynamic systems expressed in the statespace formulation are introduced as a generative counterpart of the recurrent neural networks to be described in detail in Chapter ??.
THE BLAME GAME IN MEETING ROOM ASR: AN ANALYSIS OF FEATURE VERSUS MODEL ERRORS IN NOISY AND MISMATCHED CONDITIONS
"... ABSTRACT Given a test waveform, stateoftheart ASR systems extract a sequence of MFCC features and decode them with a set of trained HMMs. When this test data is clean, and it matches the condition used for training the models, then there are few errors. While it is known that ASR systems are bri ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT Given a test waveform, stateoftheart ASR systems extract a sequence of MFCC features and decode them with a set of trained HMMs. When this test data is clean, and it matches the condition used for training the models, then there are few errors. While it is known that ASR systems are brittle in noisy or mismatched conditions, there has been little work in quantitatively attributing the errors to features or to models. This paper attributes the sources of these errors in three conditions: (a) matched nearfield, (b) matched farfield, and a (c) mismatched condition. We undertake a series of diagnostic analyses employing the bootstrap method to probe a meeting room ASR system. Results show that when the conditions are matched (even if they are farfield), the model errors dominate; however, in mismatched conditions features are neither invariant nor separable and this causes as many errors as the model does.