Results 1 - 10
of
40
Latent-Dynamic Discriminative Models for Continuous Gesture Recognition
"... Many problems in vision involve the prediction of a class label for each frame in an unsegmented sequence. In this paper, we develop a discriminative framework for simultaneous sequence segmentation and labeling which can capture both intrinsic and extrinsic class dynamics. Our approach incorporates ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Many problems in vision involve the prediction of a class label for each frame in an unsegmented sequence. In this paper, we develop a discriminative framework for simultaneous sequence segmentation and labeling which can capture both intrinsic and extrinsic class dynamics. Our approach incorporates hidden state variables which model the sub-structure of a class sequence and learn dynamics between class labels. Each class label has a disjoint set of associated hidden states, which enables efficient training and inference in our model. We evaluated our method on the task of recognizing human gestures from unsegmented video streams and performed experiments on three different datasets of head and eye gestures. Our results demonstrate that our model compares favorably to Support Vector Machines, Hidden Markov Models, and Conditional Random Fields on visual gesture recognition tasks.
Augmented statistical models for speech recognition
- in Proc. ICASSP
, 2006
"... Recently there has been significant interest in developing new acoustic models for speech recognition. One such model, that allows complex dependencies to be represented, is the augmented statistical model. This incorporates additional dependencies by constructing a local exponential expansion of a ..."
Abstract
-
Cited by 16 (9 self)
- Add to MetaCart
Recently there has been significant interest in developing new acoustic models for speech recognition. One such model, that allows complex dependencies to be represented, is the augmented statistical model. This incorporates additional dependencies by constructing a local exponential expansion of a standard HMM. Unfortunately, the resulting model often has an intractable normalisation term, rendering training difficult for all but binary classification tasks. In this paper, conditional augmented (C-Aug) models are proposed as an attractive alternative. Instead of modelling utterance likelihoods and inferring decision boundaries, C-Aug models directly model the posterior probability of class labels, conditioned on the utterance. The resulting model is easy to normalise and can be trained using conditional maximum likelihood estimation. In addition, as a convex model, the optimisation converges to a global maximum. 1.
Minimizing and learning energy functions for side-chain prediction
- In RECOMB2007
, 2007
"... Side-chain prediction is an important subproblem of the general protein folding problem. Despite much progress in side-chain prediction, performance is far from satisfactory. As an example, the ROSETTA protocol that uses simulated annealing to select the minimum energy conformations, correctly predi ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Side-chain prediction is an important subproblem of the general protein folding problem. Despite much progress in side-chain prediction, performance is far from satisfactory. As an example, the ROSETTA protocol that uses simulated annealing to select the minimum energy conformations, correctly predicts the first two side-chain angles for approximately 72 % of the buried residues in a standard data set. Is further improvement more likely to come from better search methods, or from better energy functions? Given that exact minimization of the energy is NP hard, it is difficult to get a systematic answer to this question. In this paper, we present a novel search method and a novel method for learning energy functions from training data that are both based on Tree Reweighted Belief Propagation (TRBP). We find that TRBP can find the global optimum of the ROSETTA energy function in a few minutes of computation for approximately 85 % of the proteins in a standard benchmark set. TRBP can also effectively bound the partition function which enables using the Conditional Random Fields (CRF) framework for learning. Interestingly, finding the global minimum does not significantly improve side-chain prediction for
Discriminative classifiers with adaptive kernels for noise robust speech recognition
- Comput. Speech Lang
, 2010
"... Discriminative classifiers are a popular approach to solving classification problems. However one of the problems with these approaches, in particular kernel based classifiers such as Support Vector Machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. Thi ..."
Abstract
-
Cited by 12 (10 self)
- Add to MetaCart
Discriminative classifiers are a popular approach to solving classification problems. However one of the problems with these approaches, in particular kernel based classifiers such as Support Vector Machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. This paper describes a scheme for overcoming this problem for speech recognition in noise by adapting the kernel rather than the SVM decision boundary. Generative kernels, defined using generative models, are one type of kernel that allows SVMs to handle sequence data. By compensating the parameters of the generative models for each noise condition noise-specific generative kernels can be obtained. These can be used to train a noiseindependent SVM on a range of noise conditions, which can then be used with a test-set noise kernel for classification. The noise-specific kernels used in this paper are based on Vector Taylor Series (VTS) model-based compensation. VTS allows all the model parameters to be compensated and the background noise to be estimated in a maximum likelihood fashion. A brief discussion of VTS, and the optimisation of the mismatch function representing the impact of noise on the clean speech, is also included. Experiments using these VTS-based test-set noise kernels were run on the AURORA 2 continuous digit task. The proposed SVM rescoring scheme yields large gains in performance over the VTS compensated models. Key words: speech recognition, noise robustness, support vector machines, generative kernels
Speech Recognition Using Augmented Conditional Random Fields
"... Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT
Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition
- IEEE Transactions on Audio, Speech, and Language Processing
, 2012
"... Abstract—We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Abstract—We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8 % and 9.2 % (or relative error reduction of 16.0 % and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively. Index Terms—Speech recognition, deep belief network, context-dependent phone, LVSR, DNN-HMM, ANN-HMM I.
Noise Robust Phonetic Classification with Linear Regularized Least Squares and Second-Order Features
- PROC. ICASSP 2007, UNDER SUBMISSION
, 2007
"... We perform phonetic classification with an architecture whose elements are binary classifiers trained via linear regularized least squares (RLS). RLS is a simple yet powerful regularization algorithm with the desirable property that a good value of the regularization parameter can be found efficient ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
We perform phonetic classification with an architecture whose elements are binary classifiers trained via linear regularized least squares (RLS). RLS is a simple yet powerful regularization algorithm with the desirable property that a good value of the regularization parameter can be found efficiently by minimizing leave-one-out error on the training set. Our system achieves state-of-the-art single classifier performance on the TIMIT phonetic classification task, (slightly) beating other recent systems. We also show that in the presence of additive noise, our model is much more robust than a well-trained Gaussian mixture model.
A flat direct model for speech recognition
- in Proc. ICASSP
, 2009
"... We introduce a direct model for speech recognition that assumes an unstructured, i.e., flat text output. The flat model allows us to model arbitrary attributes and dependences of the output. This is different from the HMMs typically used for speech recognition. This conventional modeling approach is ..."
Abstract
-
Cited by 7 (7 self)
- Add to MetaCart
We introduce a direct model for speech recognition that assumes an unstructured, i.e., flat text output. The flat model allows us to model arbitrary attributes and dependences of the output. This is different from the HMMs typically used for speech recognition. This conventional modeling approach is based on sequential data and makes rigid assumptions on the dependences. HMMs have proven to be convenient and appropriate for large vocabulary continuous speech recognition. Our task under consideration, however, is the Windows Live Search for Mobile (WLS4M) task [1]. This is a cellphone application that allows users to interact with web-based information portals. In particular, the set of valid outputs can be considered discrete and finite (although probably large, i.e., unseen events are an issue). Hence, a flat direct model lends itself to this task, making the adding of different knowledge sources and dependences straightforward and cheap. Using e.g. HMM posterior, m-gram, and spotter features, significant improvements over the conventional HMM system were observed. Index Terms — maximum entropy, language model, nearest neighbor, voice search, speech recognition
Augmented Statistical Models for Classifying Sequence Data
, 2006
"... Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two journal articles [36,68], two workshop papers [35,67] and a tech-nical report [65]. The length of this thesis including appendices, bibliography, footnotes, tables and equations is approximately 60,000 words. This thesis contains 27 figures and 20 tables. i
Training algorithms for hidden conditional random fields
- In Proc. ICASSP
, 2006
"... We investigate algorithms for training hidden conditional random fields (HCRFs) – a class of direct models with hidden state sequences. We compare stochastic gradient ascent with the RProp algorithm, and investigate stochastic versions of RProp. We propose a new scheme for model flattening, and com ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We investigate algorithms for training hidden conditional random fields (HCRFs) – a class of direct models with hidden state sequences. We compare stochastic gradient ascent with the RProp algorithm, and investigate stochastic versions of RProp. We propose a new scheme for model flattening, and compare it to the state of the art. Finally we give experimental results on the TIMIT phone classification task showing how these training options interact, comparing HCRFs to HMMs trained using extended Baum-Welch as well as

