Results 1 - 10
of
27
Large Scale Discriminative Training For Speech Recognition
, 2000
"... This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion whi ..."
Abstract
-
Cited by 58 (5 self)
- Add to MetaCart
This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion which has been used to train HMM systems for conversational telephone speech transcription using up to 265 hours of training data. These experiments represent the largest-scale application of discriminative training techniques for speech recognition of which the authors are aware, and have led to significant reductions in word error rate for both triphone and quinphone HMMs compared to our best models trained using maximum likelihood estimation. The MMIE latticebased implementation used; techniques for ensuring improved generalisation; and interactions with maximum likelihood based adaptation are all discussed. Furthermore several variations to the MMIE training scheme are introduced with the a...
Speech Recognition Using Augmented Conditional Random Fields
"... Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT
Discriminative Training of Gaussian Mixtures for Image Object Recognition
- DAGM Symposium Mustererkennung
, 1999
"... In this paper we present a discriminative training procedure for Gaussian mixture densities. Conventional maximum likelihood (ML) training of such mixtures proved to be very efficient for object recognition, even though each class is treated separately in training. Discriminative criteria offer the ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
In this paper we present a discriminative training procedure for Gaussian mixture densities. Conventional maximum likelihood (ML) training of such mixtures proved to be very efficient for object recognition, even though each class is treated separately in training. Discriminative criteria offer the advantage that they also use out-of-class data, that is they aim at optimizing class separability. We present results on the US Postal Service (USPS) handwritten digits database and compare the discriminative results to those obtained by ML training. We also compare our best results with those reported by other groups, proving them to be state-of-the-art.
Relevance feedback from eye movements for proactive information retrieval
- WORKSHOP ON PROCESSING SENSORY INFORMATION FOR PROACTIVE SYSTEMS (PSIPS 2004
, 2004
"... We study whether it is possible to infer from eye movements measured during reading what is relevant for the user in an information retrieval task. Inference is made using hidden Markov and discriminative hidden Markov models. The result of this feasibility study is that prediction of relevance is p ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
We study whether it is possible to infer from eye movements measured during reading what is relevant for the user in an information retrieval task. Inference is made using hidden Markov and discriminative hidden Markov models. The result of this feasibility study is that prediction of relevance is possible to a certain extent, and models benefit from taking into account the time series nature of the data.
Minimum classification error training of landmark models for real-time continuous speech recognition
- in Proc. IEEE ICASSP
, 2004
"... Though many studies have shown the effectiveness of the Minimum Classification Error (MCE) approach to discriminative training of HMMs for speech recognition, few if any have reported MCE results for large (> 100 hours) training sets in the context of real-world, continuous speech recognition. Here ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Though many studies have shown the effectiveness of the Minimum Classification Error (MCE) approach to discriminative training of HMMs for speech recognition, few if any have reported MCE results for large (> 100 hours) training sets in the context of real-world, continuous speech recognition. Here we report large gains in performance for the MIT JUPITER weather information task as a result of MCE-based batch optimization of acoustic models. Investigation of word error rate vs. computation time showed that small MCE models significantly outperform the Maximum Likelihood (ML) baseline at all points of equal computation time, resulting in up to 20 % word error rate reduction for in-vocabulary utterances. The overall MCE loss function was minimized using Quickprop, a simple but effective second-order optimization method suited to parallelization over large training sets.
Augmented Statistical Models for Classifying Sequence Data
, 2006
"... Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two journal articles [36,68], two workshop papers [35,67] and a tech-nical report [65]. The length of this thesis including appendices, bibliography, footnotes, tables and equations is approximately 60,000 words. This thesis contains 27 figures and 20 tables. i
Using Phase Spectrum Information For Improved Speech Recognition Performance
, 2001
"... In this work, new acoustic features for continuous speech recognition based on the short-term Fourier phase spectrum are introduced for mono (telephone) recordings. The new phase based features were combined with standard Mel Frequency Cepstral Coefficients (MFCC), and results were produced with and ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In this work, new acoustic features for continuous speech recognition based on the short-term Fourier phase spectrum are introduced for mono (telephone) recordings. The new phase based features were combined with standard Mel Frequency Cepstral Coefficients (MFCC), and results were produced with and without using additional linear discriminant analysis (LDA) to choose the most relevant features. Experiments were performed on the SieTill corpus for telephone line recorded German digit strings. Using LDA to combine purely phase based features with MFCCs, we obtained improvements in word error rate of up to 25% relative to using MFCCs alone with the same overall number of parameters in the system.
Adaptive Training for Large Vocabulary Continuous Speech Recognition
, 2006
"... Summary In recent years, there has been a trend towards training large vocabulary continuous speech recognition (LVCSR) systems on a large amount of found data. Found data is recorded from spontaneous speech without careful control of the recording acoustic conditions, for example, conversational te ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Summary In recent years, there has been a trend towards training large vocabulary continuous speech recognition (LVCSR) systems on a large amount of found data. Found data is recorded from spontaneous speech without careful control of the recording acoustic conditions, for example, conversational telephone speech. Hence, it typically has greater variability in terms of speaker and acoustic conditions than specially collected data. Thus, in addition to the desired speech variability required to discriminate between words, it also includes various non-speech variabil-ities, for example, the change of speakers or acoustic environments. The standard approach to handle this type of data is to train hidden Markov models (HMMs) on the whole data set as if all data comes from a single acoustic condition. This is referred to as multi-style training, for exam-ple speaker-independent training. Effectively, the non-speech variabilities are ignored. Though good performance has been obtained with multi-style systems, these systems account for all variabilities. Improvement may be obtained if the two types of variabilities in the found data are modelled separately. Adaptive training has been proposed for this purpose. In contrast to multi-style training, a set of transforms is used to represent the non-speech variabilities. A canonical
Patch-based object recognition using discriminatively trained gaussian mixtures
, 2006
"... We present an approach using Gaussian mixture models for part-based object recognition where spatial relationships of the parts are explicitly modeled and parameters of the generative model are tuned discriminatively. These extensions lead to great improvements of the classification accuracy. Furthe ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We present an approach using Gaussian mixture models for part-based object recognition where spatial relationships of the parts are explicitly modeled and parameters of the generative model are tuned discriminatively. These extensions lead to great improvements of the classification accuracy. Furthermore we evaluate several improvements over our baseline system which incrementally improve the obtained results which compare favorable well to other published results for the three Caltech tasks and the PASCAL evaluation 05 tasks. 1
Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error
"... Abstract—The minimum classification error (MCE) framework for discriminative training is a simple and general formalism for directly optimizing recognition accuracy in pattern recognition problems. The framework applies directly to the optimization of hidden Markov models (HMMs) used for speech reco ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract—The minimum classification error (MCE) framework for discriminative training is a simple and general formalism for directly optimizing recognition accuracy in pattern recognition problems. The framework applies directly to the optimization of hidden Markov models (HMMs) used for speech recognition problems. However, few if any studies have reported results for the application of MCE training to large-vocabulary, continuous-speech recognition tasks. This article reports significant gains in recognition performance and model compactness as a result of discriminative training based on MCE training applied to HMMs, in the context of three challenging large-vocabulary (up to 100 k word) speech recognition tasks: the Corpus of Spontaneous Japanese lecture speech transcription task, a telephone-based name recognition task, and the MIT JUPITER telephone-based conversational weather information task. On these tasks, starting from maximum likelihood (ML) baselines, MCE training yielded relative reductions in word error ranging from 7 % to 20%. Furthermore, this paper evaluates the use of different methods for optimizing the MCE criterion function, as well as the use of precomputed recognition lattices to speed up training. An overview of the MCE framework is given, with an emphasis on practical implementation issues. Index Terms—Discriminative training, pattern recognition, speech recognition. I.

