Results 1  10
of
18
A Maximum Entropy approach to Natural Language Processing
 COMPUTATIONAL LINGUISTICS
, 1996
"... The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper we des ..."
Abstract

Cited by 1287 (5 self)
 Add to MetaCart
The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper we describe a method for statistical modeling based on maximum entropy. We present a maximumlikelihood approach for automatically constructing maximum entropy models and describe how to implement this approach efficiently, using as examples several problems in natural language processing.
A Gaussian prior for smoothing maximum entropy models
, 1999
"... In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood training for exponential models, and like other maximum likelihood methods is prone to overfitting of training data. Several smoothing methods for maximum entropy models have been proposed to address this problem ..."
Abstract

Cited by 246 (2 self)
 Add to MetaCart
(Show Context)
In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood training for exponential models, and like other maximum likelihood methods is prone to overfitting of training data. Several smoothing methods for maximum entropy models have been proposed to address this problem, but previous results do not make it clear how these smoothing methods compare with smoothing methods for other types of related models. In this work, we survey previous work in maximum entropy smoothing and compare the performance of several of these algorithms with conventional techniques for smoothing ngram language models. Because of the mature body of research in ngram model smoothing and the close connection between maximum entropy and conventional ngram models, this domain is wellsuited to gauge the performance of maximum entropy smoothing methods. Over a large number of data sets, we find that an ME smoothing method proposed to us by Lafferty [1] performs as well as or better than all other algorithms under consideration. This general and efficient method involves using a Gaussian prior on the parameters of the model and selecting maximum a posteriori instead of maximum likelihood parameter values. We contrast this method with previous ngram smoothing methods to explain its superior performance.
Discriminative training of markov logic networks
 In Proc. of the Natl. Conf. on Artificial Intelligence
, 2005
"... Many machine learning applications require a combination of probability and firstorder logic. Markov logic networks (MLNs) accomplish this by attaching weights to firstorder clauses, and viewing these as templates for features of Markov networks. Model parameters (i.e., clause weights) can be lear ..."
Abstract

Cited by 99 (18 self)
 Add to MetaCart
Many machine learning applications require a combination of probability and firstorder logic. Markov logic networks (MLNs) accomplish this by attaching weights to firstorder clauses, and viewing these as templates for features of Markov networks. Model parameters (i.e., clause weights) can be learned by maximizing the likelihood of a relational database, but this can be quite costly and lead to suboptimal results for any given prediction task. In this paper we propose a discriminative approach to training MLNs, one which optimizes the conditional likelihood of the query predicates given the evidence ones, rather than the joint likelihood of all predicates. We extend Collins’s (2002) voted perceptron algorithm for HMMs to MLNs by replacing the Viterbi algorithm with a weighted satisfiability solver. Experiments on entity resolution and link prediction tasks show the advantages of this approach compared to generative MLN training, as well as compared to purely probabilistic and purely logical approaches.
2004. Statistical Machine Translation with Scarce Resources Using Morphosyntactic Information
 Computational Linguistics (Vol 30, Num
"... In statistical machine translation, correspondences between the words in the source and the target language are learned from parallel corpora, and often little or no linguistic knowledge is used to structure the underlying models. In particular, existing statistical systems for machine translation o ..."
Abstract

Cited by 90 (3 self)
 Add to MetaCart
In statistical machine translation, correspondences between the words in the source and the target language are learned from parallel corpora, and often little or no linguistic knowledge is used to structure the underlying models. In particular, existing statistical systems for machine translation often treat different inflected forms of the same lemma as if they were independent of one another. The bilingual training data can be better exploited by explicitly taking into account the interdependencies of related inflected forms. We propose the construction of hierarchical lexicon models on the basis of equivalence classes of words. In addition, we introduce sentencelevel restructuring transformations which aim at the assimilation of word order in related sentences. We have systematically investigated the amount of bilingual training data required to maintain an acceptable quality of machine translation. The combination of the suggested methods for improving translation quality in frameworks with scarce resources has been successfully tested: We were able to reduce the amount of bilingual training data to less than 10 % of the original corpus, while losing only 1.6 % in translation quality. The improvement of the translation results is demonstrated on two GermanEnglish corpora taken from the Verbmobil task and the Nespole! task. 1.
Parameter Estimation in Stochastic Logic Programs
 Machine Learning
, 2000
"... . Stochastic logic programs (SLPs) are logic programs with labelled clauses which dene a loglinear distribution over refutations of goals. The loglinear distribution provides, by marginalisation, a distribution over variable bindings, allowing SLPs to compactly represent quite complex distributions ..."
Abstract

Cited by 80 (5 self)
 Add to MetaCart
. Stochastic logic programs (SLPs) are logic programs with labelled clauses which dene a loglinear distribution over refutations of goals. The loglinear distribution provides, by marginalisation, a distribution over variable bindings, allowing SLPs to compactly represent quite complex distributions. We analyse the fundamental statistical properties of SLPs addressing issues concerning innite derivations, `unnormalised' SLPs and impure SLPs. After detailing existing approaches to parameter estimation for loglinear models and their application to SLPs, we present a new algorithm called failureadjusted maximisation (FAM). FAM is an instance of the EM algorithm that applies specically to normalised SLPs and provides a closedform for computing parameter updates within an iterative maximisation approach. We empirically show that FAM works on some small examples and discuss methods for applying it to bigger problems. c 2000 Kluwer Academic Publishers. Printed in the Netherlands. ...
Boosting as Entropy Projection
, 1999
"... We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost 's choice of the new distribution can be s ..."
Abstract

Cited by 70 (9 self)
 Add to MetaCart
We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost 's choice of the new distribution can be seen as an approximate solution to the following problem: Find a new distribution that is closest to the old distribution subject to the constraint that the new distribution is orthogonal to the vector of mistakes of the current weak hypothesis. The distance (or divergence) between distributions is measured by the relative entropy. Alternatively, we could say that AdaBoost approximately projects the distribution vector onto a hyperplane dened by the mistake vector. We show that this new view of AdaBoost as an entropy projection is dual to the usual view of AdaBoost as minimizing the normalization factors of the updated distributions.
Lexicalized Stochastic Modeling of ConstraintBased Grammars using LogLinear Measures and EM Training
 IN PROCEEDINGS OF THE 38TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL'00), HONG KONG
, 2000
"... We present a new approach to stochastic modeling of constraintbased grammars that is based on loglinear models and uses EM for estimation from unannotated data. The techniques are applied to an LFG grammar for German. Evaluation on an exact match task yields 86% precision for an ambiguity ..."
Abstract

Cited by 67 (11 self)
 Add to MetaCart
We present a new approach to stochastic modeling of constraintbased grammars that is based on loglinear models and uses EM for estimation from unannotated data. The techniques are applied to an LFG grammar for German. Evaluation on an exact match task yields 86% precision for an ambiguity rate of 5.4, and 90% precision on a subcat frame match for an ambiguity rate of 25. Experimental comparison to training from a parsebank shows a 10% gain from EM training. Also, a new classbased grammar lexicalization is presented, showing a 10% gain over unlexicalized models.
Text Segmentation Using Exponential Models
 In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing
, 1997
"... This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both shortrange and longrange language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple le ..."
Abstract

Cited by 60 (0 self)
 Add to MetaCart
This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both shortrange and longrange language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated ' error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effective ness of our approach in two very different domains, Wall Street Journal articles and the TDT Corpus, a collection of newswire articles and broadcast news transcripts.
Formal grammar and information theory: Together again?
 PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY
, 2000
"... In the last 40 years, research on models of spoken and written language has been split between two seemingly irreconcilable traditions: formal linguistics in the Chomsky tradition, and information theory in the Shannon tradition. Zellig Harris had advocated a close alliance between grammatical and i ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
(Show Context)
In the last 40 years, research on models of spoken and written language has been split between two seemingly irreconcilable traditions: formal linguistics in the Chomsky tradition, and information theory in the Shannon tradition. Zellig Harris had advocated a close alliance between grammatical and informationtheoretic principles in the analysis of natural language, and early formallanguage theory provided another strong link between information theory and linguistics. Nevertheless, in most research on language and computation, grammatical and informationtheoretic approaches had moved far apart. Today, after many years on the defensive, the informationtheoretic approach has gained new strength and achieved practical successes in speech recognition, information retrieval, and, increasingly, in language analysis and machine translation. The exponential increase in the speed and storage capacity of computers is the proximate cause of these engineering successes, allowing the automatic estimation of the parameters of probabilistic models of language by counting occurrences of linguistic events in very large bodies of text and speech. However, I will argue that informationtheoretic and computational ideas are also playing an increasing role in the scientific understanding of language, and will help bring together formallinguistic and informationtheoretic perspectives.
Statistical Learning Algorithms Based on Bregman Distances
, 1997
"... We present a class of statistical learning algorithms formulated in terms of minimizing Bregman distances, a family of generalized entropy measures associated with convex functions. The inductive learning scheme is akin to growing a decision tree, with the Bregman distance filling the role of the im ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
We present a class of statistical learning algorithms formulated in terms of minimizing Bregman distances, a family of generalized entropy measures associated with convex functions. The inductive learning scheme is akin to growing a decision tree, with the Bregman distance filling the role of the impurity function in treebased classifiers. Our approach is based on two components. In the feature selection step, each linear constraint in a pool of candidate features is evaluated by the reduction in Bregman distance that would result from adding it to the model. In the constraint satisfaction step, all of the parameters are adjusted to minimize the Bregman distance subject to the chosen constraints. We introduce a new iterative estimation algorithm for carrying out both the feature selection and constraint satisfaction steps, and outline a proof of the convergence of these algorithms. 1 Introduction In this paper we present a class of statistical learning algorithms formulated in terms...