Results 1  10
of
35
A Maximum Entropy Approach to Adaptive Statistical Language Modeling
 Computer, Speech and Language
, 1996
"... An adaptive statistical languagemodel is described, which successfullyintegrates long distancelinguistic information with other knowledge sources. Most existing statistical language models exploit only the immediate history of a text. To extract information from further back in the document's histor ..."
Abstract

Cited by 242 (11 self)
 Add to MetaCart
An adaptive statistical languagemodel is described, which successfullyintegrates long distancelinguistic information with other knowledge sources. Most existing statistical language models exploit only the immediate history of a text. To extract information from further back in the document's history, we propose and use trigger pairs as the basic information bearing elements. This allows the model to adapt its expectations to the topic of discourse. Next, statistical evidence from multiple sources must be combined. Traditionally, linear interpolation and its variants have been used, but these are shown here to be seriously deficient. Instead, we apply the principle of Maximum Entropy (ME). Each information source gives rise to a set of constraints, to be imposed on the combined estimate. The intersection of these constraints is the set of probability functions which are consistent with all the information sources. The function with the highest entropy within that set is the ME solution...
Maximum Entropy Models for Natural Language Ambiguity Resolution
, 1998
"... The best aspect of a research environment, in my opinion, is the abundance of bright people with whom you argue, discuss, and nurture your ideas. I thank all of the people at Penn and elsewhere who have given me the feedback that has helped me to separate the good ideas from the bad ideas. I hope th ..."
Abstract

Cited by 202 (1 self)
 Add to MetaCart
The best aspect of a research environment, in my opinion, is the abundance of bright people with whom you argue, discuss, and nurture your ideas. I thank all of the people at Penn and elsewhere who have given me the feedback that has helped me to separate the good ideas from the bad ideas. I hope that Ihave kept the good ideas in this thesis, and left the bad ideas out! Iwould like toacknowledge the following people for their contribution to my education: I thank my advisor Mitch Marcus, who gave me the intellectual freedom to pursue what I believed to be the best way to approach natural language processing, and also gave me direction when necessary. I also thank Mitch for many fascinating conversations, both personal and professional, over the last four years at Penn. I thank all of my thesis committee members: John La erty from Carnegie Mellon University, Aravind Joshi, Lyle Ungar, and Mark Liberman, for their extremely valuable suggestions and comments about my thesis research. I thank Mike Collins, Jason Eisner, and Dan Melamed, with whom I've had many stimulating and impromptu discussions in the LINC lab. Iowe them much gratitude for their valuable feedback onnumerous rough drafts of papers and thesis chapters.
A maximum entropy approach to named entity recognition
, 1999
"... iii Acknowledgments This work would not have been possible without the support of many people inside and outside of New York University. My advisor, Professor Ralph Grishman, has provided me with a great deal of useful advice, including suggesting the problem of named entity recognition to me as a p ..."
Abstract

Cited by 146 (4 self)
 Add to MetaCart
iii Acknowledgments This work would not have been possible without the support of many people inside and outside of New York University. My advisor, Professor Ralph Grishman, has provided me with a great deal of useful advice, including suggesting the problem of named entity recognition to me as a promising application for maximum entropy modeling. More than that, he has helped me work through a great deal of literature in statistical computational linguistics and he generously supplied me with the necessary time, equipment, and resources of his research staff which enabled me to put together the MENE system. I would also like to thank the other members of NYU's Proteus project for their assistance. In particular, John Sterling helped me to develop the idea of integrating the Proteus parser with the MENE system in the month before the MUC7 evaluation. He and Eugene Agichtein put in extremely long hours leading up to the evaluation and helped to make it a success. The work on porting the MENE system to Japanese would not have been possible without the assistance of my friend and colleague, Satoshi Sekine. In addition, I would like to thank him for helping me out as the only Englishspeaking participant in the IREX evaluation. For his assistance with my upcoming trip to Japan and for all his work on translating IREX instructions for my benefit, I am very grateful.
Using Unlabeled Data to Improve Text Classification
, 2001
"... One key difficulty with text classification learning algorithms is that they require many handlabeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high ..."
Abstract

Cited by 49 (0 self)
 Add to MetaCart
One key difficulty with text classification learning algorithms is that they require many handlabeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create highaccuracy text classifiers. By assuming that documents are created by a parametric generative model, ExpectationMaximization (EM) finds local maximum a posteriori models and classifiers from all the data  labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse. Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling subtopic class structure, and by modeling supertopic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to lowprobability models. Performance can be significantly improved by using active learning to select highquality initializations, and by using alternatives to EM that avoid lowprobability local maxima.
On the toric algebra of graphical models
, 2006
"... We formulate necessary and sufficient conditions for an arbitrary discrete probability distribution to factor according to an undirected graphical model, or a loglinear model, or other more general exponential models. For decomposable graphical models these conditions are equivalent to a set of con ..."
Abstract

Cited by 36 (6 self)
 Add to MetaCart
We formulate necessary and sufficient conditions for an arbitrary discrete probability distribution to factor according to an undirected graphical model, or a loglinear model, or other more general exponential models. For decomposable graphical models these conditions are equivalent to a set of conditional independence statements similar to the Hammersley–Clifford theorem; however, we show that for nondecomposable graphical models they are not. We also show that nondecomposable models can have nonrational maximum likelihood estimates. These results are used to give several novel characterizations of decomposable graphical models.
KullbackLeibler approximation of spectral density functions
 IEEE Trans. Inform. Theory
, 2003
"... Abstract—We introduce a Kullback–Leiblertype distance between spectral density functions of stationary stochastic processes and solve the problem of optimal approximation of a given spectral density 9 by one that is consistent with prescribed secondorder statistics. In general, such statistics are ..."
Abstract

Cited by 26 (15 self)
 Add to MetaCart
Abstract—We introduce a Kullback–Leiblertype distance between spectral density functions of stationary stochastic processes and solve the problem of optimal approximation of a given spectral density 9 by one that is consistent with prescribed secondorder statistics. In general, such statistics are expressed as the state covariance of a linear filter driven by a stochastic process whose spectral density is sought. In this context, we show i) that there is a unique spectral density 8 which minimizes this Kullback–Leibler distance, ii) that this optimal approximate is of the form 9 where the “correction term ” is a rational spectral density function, and iii) that the coefficients of can be obtained numerically by solving a suitable convex optimization problem. In the special case where 9=1, the convex functional becomes quadratic and the solution is then specified by linear equations. Index Terms—Approximation of power spectra, crossentropy minimization, Kullback–Leibler distance, mutual information, optimization, spectral estimation. I.
A convex optimization approach to generalized moment problems, Control and Modeling of Complex Systems
 Cybernetics in the 21st Century: Festschrift in Honor of Hidenori Kimura on the Occasion of his 60th
, 2003
"... ABSTRACT In this paper we present a universal solution to the generalized moment problem, with a nonclassical complexity constraint. We show that this solution can be obtained by minimizing a strictly convex nonlinear functional. This optimization problem is derived in two different ways. We first d ..."
Abstract

Cited by 17 (11 self)
 Add to MetaCart
ABSTRACT In this paper we present a universal solution to the generalized moment problem, with a nonclassical complexity constraint. We show that this solution can be obtained by minimizing a strictly convex nonlinear functional. This optimization problem is derived in two different ways. We first derive this intrinsically, in a geometric way, by path integration of a oneform which defines the generalized moment problem. It is observed that this oneform is closed and defined on a convex set, and thus exact with, perhaps surprisingly, a strictly convex primitive function. We also derive this convex functional as the dual problem of a problem to maximize a cross entropy functional. In particular, these approaches give a constructive parameterization of all solutions to the NevanlinnaPick interpolation problem, with possible higherorder interpolation at certain points in the complex plane, with a degree constraint as well as all soutions to the rational covariance extension problem two areas which have been advanced by the work of Hidenori Kimura. Illustrations of these results in system identifiaction and probablity are also mentioned. Key words. Moment problems, convex optimization, NevanlinnaPick interpolation, covariance extension, systems identification, KullbackLeibler distance. 1
Maximum entropy Gaussian approximation for the number of integer points and volumes of polytopes
, 2009
"... We describe a maximum entropy approach for computing volumes and counting integer points in polyhedra. To estimate the number of points from a particular set X ⊂ R n in a polyhedron P ⊂ R n, by solving a certain entropy maximization problem, we construct a probability distribution on the set X such ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
We describe a maximum entropy approach for computing volumes and counting integer points in polyhedra. To estimate the number of points from a particular set X ⊂ R n in a polyhedron P ⊂ R n, by solving a certain entropy maximization problem, we construct a probability distribution on the set X such that a) the probability mass function is constant on the set P ∩X and b) the expectation of the distribution lies in P. This allows us to apply Central Limit Theorem type arguments to deduce computationally efficient approximations for the number of integer points, volumes, and the number of 01 vectors in the polytope. As an application, we obtain asymptotic formulas for volumes of multiindex transportation polytopes and for the number of multiway contingency tables.
Paraphrase Recognition Using Machine Learning to Combine Similarity Measures
"... This paper presents three methods that can be used to recognize paraphrases. They all employ string similarity measures applied to shallow abstractions of the input sentences, and a Maximum Entropy classifier to learn how to combine the resulting features. Two of the methods also exploit WordNet to ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
This paper presents three methods that can be used to recognize paraphrases. They all employ string similarity measures applied to shallow abstractions of the input sentences, and a Maximum Entropy classifier to learn how to combine the resulting features. Two of the methods also exploit WordNet to detect synonyms and one of them also exploits a dependency parser. We experiment on two datasets, the MSR paraphrasing corpus and a dataset that we automatically created from the MTC corpus. Our system achieves state of the art or better results. 1