Results 1  10
of
408
A Systematic Comparison of Various Statistical Alignment Models
 Computational Linguistics
, 2003
"... this article the problem of finding the word alignment of a bilingual sentencealigned corpus by using languageindependent statistical methods. There is a vast literature on this topic, and many different systems have been suggested to solve this problem. Our work follows and extends the methods in ..."
Abstract

Cited by 1249 (58 self)
 Add to MetaCart
this article the problem of finding the word alignment of a bilingual sentencealigned corpus by using languageindependent statistical methods. There is a vast literature on this topic, and many different systems have been suggested to solve this problem. Our work follows and extends the methods introduced by Brown, Della Pietra, Della Pietra, and Mercer (1993) by using refined statistical models for the translation process. The basic idea of this approach is to develop a model of the translation process with the word alignment as a hidden variable of this process, to apply statistical estimation theory to compute the "optimal" model parameters, and to perform alignment search to compute the best word alignment
The Mathematics of Statistical Machine Translation: Parameter Estimation
 Computational Linguistics
, 1993
"... this paper, we focus on the translation modeling problem. Before we turn to this problem, however, we should address an issue that may be a concern to some readers: Why do we estimate Pr(e) and Pr(fle) rather than estimate Pr(elf ) directly? We are really interested in this latter probability. Would ..."
Abstract

Cited by 1173 (1 self)
 Add to MetaCart
this paper, we focus on the translation modeling problem. Before we turn to this problem, however, we should address an issue that may be a concern to some readers: Why do we estimate Pr(e) and Pr(fle) rather than estimate Pr(elf ) directly? We are really interested in this latter probability. Wouldn't we reduce our problems from three to two by this direct approach? If we can estimate Pr(fle) adequately, why can't we just turn the whole process around to estimate Pr(elf)? To understand this, imagine that we divide French and English strings into those that are wellformed and those that are illformed. This is not a precise notion. We have in mind that strings like Il va la bibliothque, or I live in a house, or even Colorless green ideas sleep furiously are wellformed, but that strings like lava I1 bibliothque or a I in live house are not. When we translate a French string into English, we can think of ourselves as springing from a wellformed French string into the sea of wellformed English strings with the hope of landing on a good one. It is important, therefore, that our model for Pr(elf ) concentrate its probability as much as possible on wellformed English strings. But it is not important that our model for Pr(fle ) concentrate its probability on wellformed French strings. If we were to reduce the probability of all wellformed French strings by the same factor, spreading the probability thus 265 liberated over illformed French strings, there would be no effect on our translations: the argument that maximizes some function f(x) also maximizes cf(x) for any positive constant c. As we shall see below, our translation models are prodigal, spraying probability all over the place, most of it on illformed French strings. In fact, as we discuss in Section 4.5, two...
An Empirical Study of Smoothing Techniques for Language Modeling
, 1998
"... We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Br ..."
Abstract

Cited by 849 (20 self)
 Add to MetaCart
We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and ngram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the crossentropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of JelinekMercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods. 1
ClassBased ngram Models of Natural Language
 Computational Linguistics
, 1992
"... We address the problem of predicting a word from previous words in a sample of text. In particular we discuss ngram models based on calsses of words. We also discuss several statistical algoirthms for assigning words to classes based on the frequency of their cooccurrence with other words. We find ..."
Abstract

Cited by 697 (5 self)
 Add to MetaCart
We address the problem of predicting a word from previous words in a sample of text. In particular we discuss ngram models based on calsses of words. We also discuss several statistical algoirthms for assigning words to classes based on the frequency of their cooccurrence with other words. We find that we are able to extract classes that have the flavor of either syntactically based groupings or semantically based groupings, depending on the nature of the underlying statistics.
Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models
, 1995
"... ..."
Inducing Features of Random Fields
 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 1997
"... We present a technique for constructing random fields from a set of training samples. The learning paradigm builds increasingly complex fields by allowing potential functions, or features, that are supported by increasingly large subgraphs. Each feature has a weight that is trained by minimizing the ..."
Abstract

Cited by 552 (14 self)
 Add to MetaCart
We present a technique for constructing random fields from a set of training samples. The learning paradigm builds increasingly complex fields by allowing potential functions, or features, that are supported by increasingly large subgraphs. Each feature has a weight that is trained by minimizing the KullbackLeibler divergence between the model and the empirical distribution of the training data. A greedy algorithm determines how features are incrementally added to the field and an iterative scaling algorithm is used to estimate the optimal values of the weights. The random field models and techniques introduced in this paper differ from those common to much of the computer vision literature in that the underlying random fields are nonMarkovian and have a large number of parameters that must be estimated. Relations to other learning approaches, including decision trees, are given. As a demonstration of the method, we describe its application to the problem of automatic word classifica...
Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains
 IEEE Transactions on Speech and Audio Processing
, 1994
"... In this paper a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely the choice of prior distribution family, the specification of the parameters of prior densities and the evaluation of the MAP estimates, are addr ..."
Abstract

Cited by 489 (38 self)
 Add to MetaCart
In this paper a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely the choice of prior distribution family, the specification of the parameters of prior densities and the evaluation of the MAP estimates, are addressed. Using HMMs with Gaussian mixture state observation densities as an example, it is assumed that the prior densities for the HMM parameters can be adequately represented as a product of Dirichlet and normalWishart densities. The classical maximum likelihood estimation algorithms, namely the forwardbackward algorithm and the segmental kmeans algorithm, are expanded and MAP estimation formulas are developed. Prior density estimation issues are discussed for two classes of applications: parameter smoothing and model adaptation, and some experimental results are given illustrating the practical interest of this approach. Because of its adaptive nature, Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications
Unsupervised word sense disambiguation rivaling supervised methods
 IN PROCEEDINGS OF THE 33RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS
, 1995
"... This paper presents an unsupervised learning algorithm for sense disambiguation that, when trained on unannotated English text, rivals the performance of supervised techniques that require timeconsuming hand annotations. The algorithm is based on two powerful constraints  that words tend to have ..."
Abstract

Cited by 487 (4 self)
 Add to MetaCart
This paper presents an unsupervised learning algorithm for sense disambiguation that, when trained on unannotated English text, rivals the performance of supervised techniques that require timeconsuming hand annotations. The algorithm is based on two powerful constraints  that words tend to have one sense per discourse and one sense per collocation  exploited in an iterative bootstrapping procedure. Tested accuracy exceeds 96%.
Realtime american sign language recognition using desk and wearable computer based video
 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 1998
"... We present two realtime hidden Markov modelbased systems for recognizing sentencelevel continuous American Sign Language (ASL) using a single camera to track the user’s unadorned hands. The first system observes the user from a desk mounted camera and achieves 92 percent word accuracy. The secon ..."
Abstract

Cited by 444 (23 self)
 Add to MetaCart
We present two realtime hidden Markov modelbased systems for recognizing sentencelevel continuous American Sign Language (ASL) using a single camera to track the user’s unadorned hands. The first system observes the user from a desk mounted camera and achieves 92 percent word accuracy. The second system mounts the camera in a cap worn by the user and achieves 98 percent accuracy (97 percent with an unrestricted grammar). Both experiments use a 40word lexicon.