Results 1 - 10
of
14
Contrastive estimation: Training log-linear models on unlabeled data
- In Proc. of ACL
, 2005
"... Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs are log-linear, allowing the incorporation of arbitrary features into the model. To train on unlabele ..."
Abstract
-
Cited by 89 (11 self)
- Add to MetaCart
Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs are log-linear, allowing the incorporation of arbitrary features into the model. To train on unlabeled data, we require unsupervised estimation methods for log-linear models; few exist. We describe a novel approach, contrastive estimation. We show that the new technique can be intuitively understood as exploiting implicit negative evidence and is computationally efficient. Applied to a sequence labeling problem—POS tagging given a tagging dictionary and unlabeled text—contrastive estimation outperforms EM (with the same feature set), is more robust to degradations of the dictionary, and can largely recover by modeling additional features. 1
Large Scale Discriminative Training For Speech Recognition
, 2000
"... This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion whi ..."
Abstract
-
Cited by 58 (5 self)
- Add to MetaCart
This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion which has been used to train HMM systems for conversational telephone speech transcription using up to 265 hours of training data. These experiments represent the largest-scale application of discriminative training techniques for speech recognition of which the authors are aware, and have led to significant reductions in word error rate for both triphone and quinphone HMMs compared to our best models trained using maximum likelihood estimation. The MMIE latticebased implementation used; techniques for ensuring improved generalisation; and interactions with maximum likelihood based adaptation are all discussed. Furthermore several variations to the MMIE training scheme are introduced with the a...
Guiding unsupervised grammar induction using contrastive estimation
- In Proc. of IJCAI Workshop on Grammatical Inference Applications
, 2005
"... We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihood-based objective functions. This criterion is a generalization ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihood-based objective functions. This criterion is a generalization of the function maximized by the Expectation-Maximization algorithm [Dempster et al., 1977]. CE is a natural fit for log-linear models, which can include arbitrary features but for which EM is computationally difficult. We show that, using the same features, log-linear dependency grammar models trained using CE can drastically outperform EMtrained generative models on the task of matching human linguistic annotations (the MATCHLIN-GUIST task). The selection of an implicit negative evidence class—a “neighborhood”—appropriate to a given task has strong implications, but a good neighborhood one can target the objective of grammar induction to a specific application. 1
Frame Discrimination Training Of HMMs For Large Vocabulary Speech Recognition
- Proc. ICASSP’99
, 1999
"... This paper describes the application of a discriminative HMM parameter estimation technique called Frame Discrimination (FD), to medium and large vocabulary continuous speech recognition. Previous work has shown that FD training can give better results than maximum mutual information (MMI) training ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
This paper describes the application of a discriminative HMM parameter estimation technique called Frame Discrimination (FD), to medium and large vocabulary continuous speech recognition. Previous work has shown that FD training can give better results than maximum mutual information (MMI) training for small tasks. The use of FD for much larger tasks required the development of a technique to be able to rapidly find the most likely set of Gaussians for each frame in the system. Experiments on the Resource Management and North American Business tasks show that FD training can give comparable improvements to MMI, but is less computationally intensive. 1. INTRODUCTION Previous research has shown that the accuracy of a speech recognition system trained using Maximum Likelihood Estimation (MLE) can often be improved further using discriminative training. All such techniques normally give much greater improvements in recognition accuracy on the training data than on the test set except wh...
Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text
, 2006
"... This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likel ..."
Abstract
-
Cited by 20 (7 self)
- Add to MetaCart
This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likelihood estimation, in different ways. Contrastive estimation maximizes the conditional probability of the observed data given a “neighborhood” of implicit negative examples. Skewed deterministic annealing locally maximizes likelihood using a cautious parameter search strategy that starts with an easier optimization problem than likelihood, and iteratively moves to harder problems, culminating in likelihood. Structural annealing is similar, but starts with a heavy bias toward simple syntactic structures and gradually relaxes the bias. Our estimation methods do not make use of annotated examples. We consider their performance in both an unsupervised model selection setting, where models trained under different initialization and regularization settings are compared by evaluating the training objective on a small set of unseen, unannotated development data, and supervised model selection, where the most accurate model on the development set (now with annotations)
Large Scale Mmie Training For Conversational Telephone Speech Recognition
, 2000
"... This paper describes a lattice-based framework for maximum mutual information estimation (MMIE) of HMM parameters which has been used to train HMM systems for conversational telephone speech transcription using up to 265 hours of training data. These experiments represent the largest-scale applicati ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
This paper describes a lattice-based framework for maximum mutual information estimation (MMIE) of HMM parameters which has been used to train HMM systems for conversational telephone speech transcription using up to 265 hours of training data. These experiments represent the largest-scale application of discriminative training techniques for speech recognition of which the authors are aware, and have led to significant reductions in word error rate for both triphone and quinphone HMMs compared to our best models trained using maximum likelihood estimation. The use of MMIE training was a key contributer to the performance of the CU-HTK March 2000 Hub5 evaluation system. 1 INTRODUCTION The model parameters in HMM based speech recognition systems are normally estimated using Maximum Likelihood Estimation (MLE). If certain conditions hold, including model correctness, then MLE can be shown to be optimal. However, when estimating the parameters of HMM-based speech recognisers, the true d...
The CU-HTK March 2000 Hub5E Transcription System
, 2000
"... This paper describes the Cambridge University HTK (CU-HTK) system developed for the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). A range of new features have been added to the HTK system used in the 1998 Hub5 evaluation, and the changes taken together ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
This paper describes the Cambridge University HTK (CU-HTK) system developed for the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). A range of new features have been added to the HTK system used in the 1998 Hub5 evaluation, and the changes taken together have resulted in an 11% relative decrease in word error rate on the 1998 evaluation test set. Major changes include the use of maximum mutual information estimation in training as well as conventional maximum likelihood estimation; the use of a full variance transform for adaptation; the inclusion of unigram pronunciation probabilities; and word-level posterior probability estimation using confusion networks for use in minimum word error rate decoding, confidence score estimation and system combination. On the March 2000 Hub5 evaluation set the CU-HTK system gave an overall word error rate of 25.4%, which was the best performance by a statistically significant margin. This paper describes th...
Improved Discriminative Training Techniques for Large Vocabulary Continuous Speech Recognition
- IEEE ICASSP'01
, 2001
"... This paper investigates the use of discriminative training techniques for large vocabulary speech recogntion with training datasets up to 265 hours. Techniques for improving lattice-based Maximum Mutual Information Estimation (MMIE) training are described and compared to Frame Discrimination (FD). ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper investigates the use of discriminative training techniques for large vocabulary speech recogntion with training datasets up to 265 hours. Techniques for improving lattice-based Maximum Mutual Information Estimation (MMIE) training are described and compared to Frame Discrimination (FD). An objective function which is an interpolation of MMIE and standard Maximum Likelihood Estimation (MLE) is also discussed. Experimental results on both the Switchboard and North American Business News tasks show that MMIE training can yield significant performance improvements over standard MLE even for the most complex speech recognition problems with very large training sets.
A Discriminative Training Algorithm for Hidden Markov Models
- IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING
, 2004
"... We introduce a discriminative training algorithm for the estimation of hidden Markov model (HMM) parameters. This algorithm is based on an approximation of the maximum mutual information (MMI) objective function and its maximization in a technique similar to the expectation-maximization (EM) algorit ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We introduce a discriminative training algorithm for the estimation of hidden Markov model (HMM) parameters. This algorithm is based on an approximation of the maximum mutual information (MMI) objective function and its maximization in a technique similar to the expectation-maximization (EM) algorithm. The algorithm is implemented by a simple modification of the standard Baum--Welch algorithm, and can be applied to speech recognition as well as to word-spotting systems. Three tasks were tested: Isolated digit recognition in a noisy environment, connected digit recognition in a noisy environment and word-spotting. In all tasks a significant improvement over maximum likelihood (ML) estimation was observed. We also compared the new algorithm to the commonly used extended Baum--Welch MMI algorithm. In our tests the algorithm showed advantages in terms of both performance and computational complexity.
New Features In The Cu-Htk System For Transcription Of Conversational Telephone Speech
- Online]. Available: http://www.ietf.org/rfc/rfc0791.txt
, 2001
"... This paper discusses new features integrated into the Cambridge University HTK (CU-HTK) system for the transcription of conversational telephone speech. Major improvements have been achieved ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper discusses new features integrated into the Cambridge University HTK (CU-HTK) system for the transcription of conversational telephone speech. Major improvements have been achieved

