Results 1 
7 of
7
Minimum Error Rate Training in Statistical Machine Translation
, 2003
"... Often, the training procedure for statistical machine translation models is based on maximum likelihood or related criteria. A general problem of this approach is that there is only a loose relation to the final translation quality on unseen text. In this paper, we analyze various training cri ..."
Abstract

Cited by 663 (7 self)
 Add to MetaCart
(Show Context)
Often, the training procedure for statistical machine translation models is based on maximum likelihood or related criteria. A general problem of this approach is that there is only a loose relation to the final translation quality on unseen text. In this paper, we analyze various training criteria which directly optimize translation quality.
LatticeBased Minimum Error Rate Training using Weighted FiniteState Transducers with Tropical Polynomial Weights
"... Minimum Error Rate Training (MERT) is a method for training the parameters of a loglinear model. One advantage of this method of training is that it can use the large number of hypotheses encoded in a translation lattice as training data. We demonstrate that the MERT line optimisation can be modelle ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Minimum Error Rate Training (MERT) is a method for training the parameters of a loglinear model. One advantage of this method of training is that it can use the large number of hypotheses encoded in a translation lattice as training data. We demonstrate that the MERT line optimisation can be modelled as computing the shortest distance in a weighted finitestate transducer using a tropical polynomial semiring. 1
Topics on Minimum Classification Error Rate Based Discriminant Function Approach to Speech Recognition
"... In this paper, we study discriminant function based minimum recognition error rate pattern recognition approach. This approach departs from the conventional paradigm which links a classification/recognition task to the problem of distribution estimation. Instead, it takes a discriminant function b ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper, we study discriminant function based minimum recognition error rate pattern recognition approach. This approach departs from the conventional paradigm which links a classification/recognition task to the problem of distribution estimation. Instead, it takes a discriminant function based statistical pattern recognition approach and the goodness of this approach to classification error rate minimization is established through a special loss function. It is meaningful even when the model correctness assumption is known not valid. The use of discriminant function has a significant impact on classifier design, since in many realistic applications, such as speech recognition, the true distribution form of the source is rarely known precisely and without model correctness assumption, the classical optimality theory of the distribution estimation approach can not be applied directly. We discuss issues in this new classifier design paradigm and present various extensions of this approach for applications in speech processing. 1.
Regularized Minimum Error Rate Training Michel Galley
"... Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, nonconvex ..."
Abstract
 Add to MetaCart
(Show Context)
Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, nonconvex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as `2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—`0 and a modification of `2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets. 1
Discriminative Models for Speech Recognition
"... Abstract — The vast majority of automatic speech recognition systems use Hidden Markov Models (HMMs) as the underlying acoustic model. Initially these models were trained based on the maximum likelihood criterion. Significant performance gains have been obtained by using discriminative training crit ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract — The vast majority of automatic speech recognition systems use Hidden Markov Models (HMMs) as the underlying acoustic model. Initially these models were trained based on the maximum likelihood criterion. Significant performance gains have been obtained by using discriminative training criteria, such as maximum mutual information and minimum phone error. However, the underlying acoustic model is still generative, with the associated constraints on the state and transition probability distributions, and classification is based on Bayes ’ decision rule. Recently, there has been interest in examining discriminative, or direct, models for speech recognition. This paper briefly reviews the forms of discriminative models that have been investigated. These include maximum entropy Markov models, hidden conditional random fields and conditional augmented models. The relationships between the various models and issues with applying them to large vocabulary continuous speech recognition will be discussed. I.
the reviewers for helpful discussions and comments.
"... When training the parameters for a natural language system, one would prefer to minimize 1best loss (error) on an evaluation set. Since the error surface for many natural language problems is piecewise constant and riddled with local minima, many systems instead optimize loglikelihood, which is co ..."
Abstract
 Add to MetaCart
When training the parameters for a natural language system, one would prefer to minimize 1best loss (error) on an evaluation set. Since the error surface for many natural language problems is piecewise constant and riddled with local minima, many systems instead optimize loglikelihood, which is conveniently differentiable and convex. We propose training instead to minimize the expected loss, or risk. We define this expectation using a probability distribution over hypotheses that we gradually sharpen (anneal) to focus on the 1best hypothesis. Besides the linear loss functions used in previous work, we also describe techniques for optimizing nonlinear functions such as precision or the BLEU metric. We present experiments training loglinear combinations of models for dependency parsing and for machine translation. In machine
IMPROVING STATISTICAL MACHINE TRANSLATION THROUGH NBEST LIST RERANKING AND OPTIMIZATION
, 2014
"... //signed// ..."
(Show Context)