Results 1  10
of
60
A Maximum Entropy Approach to Adaptive Statistical Language Modeling
 Computer, Speech and Language
, 1996
"... An adaptive statistical languagemodel is described, which successfullyintegrates long distancelinguistic information with other knowledge sources. Most existing statistical language models exploit only the immediate history of a text. To extract information from further back in the document's h ..."
Abstract

Cited by 278 (12 self)
 Add to MetaCart
An adaptive statistical languagemodel is described, which successfullyintegrates long distancelinguistic information with other knowledge sources. Most existing statistical language models exploit only the immediate history of a text. To extract information from further back in the document's history, we propose and use trigger pairs as the basic information bearing elements. This allows the model to adapt its expectations to the topic of discourse. Next, statistical evidence from multiple sources must be combined. Traditionally, linear interpolation and its variants have been used, but these are shown here to be seriously deficient. Instead, we apply the principle of Maximum Entropy (ME). Each information source gives rise to a set of constraints, to be imposed on the combined estimate. The intersection of these constraints is the set of probability functions which are consistent with all the information sources. The function with the highest entropy within that set is the ME solution...
Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross entropy
 IEEE Trans. Information Theory
, 1980
"... dple of min imum cromentropy (mhlmum dire&d dfvergenoe) are shown tobeunfquelycomxtmethodsforhductiveinf~whennewinformnt ionlsghninthefomlofexpe&edvalues.ReviousjILstit icatioaslLve ..."
Abstract

Cited by 239 (0 self)
 Add to MetaCart
dple of min imum cromentropy (mhlmum dire&d dfvergenoe) are shown tobeunfquelycomxtmethodsforhductiveinf~whennewinformnt ionlsghninthefomlofexpe&edvalues.ReviousjILstit icatioaslLve
Maximum Entropy Models for Natural Language Ambiguity Resolution
, 1998
"... The best aspect of a research environment, in my opinion, is the abundance of bright people with whom you argue, discuss, and nurture your ideas. I thank all of the people at Penn and elsewhere who have given me the feedback that has helped me to separate the good ideas from the bad ideas. I hope th ..."
Abstract

Cited by 226 (1 self)
 Add to MetaCart
The best aspect of a research environment, in my opinion, is the abundance of bright people with whom you argue, discuss, and nurture your ideas. I thank all of the people at Penn and elsewhere who have given me the feedback that has helped me to separate the good ideas from the bad ideas. I hope that Ihave kept the good ideas in this thesis, and left the bad ideas out! Iwould like toacknowledge the following people for their contribution to my education: I thank my advisor Mitch Marcus, who gave me the intellectual freedom to pursue what I believed to be the best way to approach natural language processing, and also gave me direction when necessary. I also thank Mitch for many fascinating conversations, both personal and professional, over the last four years at Penn. I thank all of my thesis committee members: John La erty from Carnegie Mellon University, Aravind Joshi, Lyle Ungar, and Mark Liberman, for their extremely valuable suggestions and comments about my thesis research. I thank Mike Collins, Jason Eisner, and Dan Melamed, with whom I've had many stimulating and impromptu discussions in the LINC lab. Iowe them much gratitude for their valuable feedback onnumerous rough drafts of papers and thesis chapters.
Using Unlabeled Data to Improve Text Classification
, 2001
"... One key difficulty with text classification learning algorithms is that they require many handlabeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high ..."
Abstract

Cited by 64 (0 self)
 Add to MetaCart
(Show Context)
One key difficulty with text classification learning algorithms is that they require many handlabeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create highaccuracy text classifiers. By assuming that documents are created by a parametric generative model, ExpectationMaximization (EM) finds local maximum a posteriori models and classifiers from all the data  labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse. Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling subtopic class structure, and by modeling supertopic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to lowprobability models. Performance can be significantly improved by using active learning to select highquality initializations, and by using alternatives to EM that avoid lowprobability local maxima.
On the toric algebra of graphical models
, 2006
"... We formulate necessary and sufficient conditions for an arbitrary discrete probability distribution to factor according to an undirected graphical model, or a loglinear model, or other more general exponential models. For decomposable graphical models these conditions are equivalent to a set of con ..."
Abstract

Cited by 39 (7 self)
 Add to MetaCart
(Show Context)
We formulate necessary and sufficient conditions for an arbitrary discrete probability distribution to factor according to an undirected graphical model, or a loglinear model, or other more general exponential models. For decomposable graphical models these conditions are equivalent to a set of conditional independence statements similar to the Hammersley–Clifford theorem; however, we show that for nondecomposable graphical models they are not. We also show that nondecomposable models can have nonrational maximum likelihood estimates. These results are used to give several novel characterizations of decomposable graphical models.
KullbackLeibler approximation of spectral density functions
 IEEE Trans. Inform. Theory
, 2003
"... Abstract—We introduce a Kullback–Leiblertype distance between spectral density functions of stationary stochastic processes and solve the problem of optimal approximation of a given spectral density 9 by one that is consistent with prescribed secondorder statistics. In general, such statistics are ..."
Abstract

Cited by 34 (16 self)
 Add to MetaCart
(Show Context)
Abstract—We introduce a Kullback–Leiblertype distance between spectral density functions of stationary stochastic processes and solve the problem of optimal approximation of a given spectral density 9 by one that is consistent with prescribed secondorder statistics. In general, such statistics are expressed as the state covariance of a linear filter driven by a stochastic process whose spectral density is sought. In this context, we show i) that there is a unique spectral density 8 which minimizes this Kullback–Leibler distance, ii) that this optimal approximate is of the form 9 where the “correction term ” is a rational spectral density function, and iii) that the coefficients of can be obtained numerically by solving a suitable convex optimization problem. In the special case where 9=1, the convex functional becomes quadratic and the solution is then specified by linear equations. Index Terms—Approximation of power spectra, crossentropy minimization, Kullback–Leibler distance, mutual information, optimization, spectral estimation. I.
A convex optimization approach to generalized moment problems, Control and Modeling of Complex Systems
 Cybernetics in the 21st Century: Festschrift in Honor of Hidenori Kimura on the Occasion of his 60th
, 2003
"... ABSTRACT In this paper we present a universal solution to the generalized moment problem, with a nonclassical complexity constraint. We show that this solution can be obtained by minimizing a strictly convex nonlinear functional. This optimization problem is derived in two different ways. We first d ..."
Abstract

Cited by 19 (10 self)
 Add to MetaCart
(Show Context)
ABSTRACT In this paper we present a universal solution to the generalized moment problem, with a nonclassical complexity constraint. We show that this solution can be obtained by minimizing a strictly convex nonlinear functional. This optimization problem is derived in two different ways. We first derive this intrinsically, in a geometric way, by path integration of a oneform which defines the generalized moment problem. It is observed that this oneform is closed and defined on a convex set, and thus exact with, perhaps surprisingly, a strictly convex primitive function. We also derive this convex functional as the dual problem of a problem to maximize a cross entropy functional. In particular, these approaches give a constructive parameterization of all solutions to the NevanlinnaPick interpolation problem, with possible higherorder interpolation at certain points in the complex plane, with a degree constraint as well as all soutions to the rational covariance extension problem two areas which have been advanced by the work of Hidenori Kimura. Illustrations of these results in system identifiaction and probablity are also mentioned. Key words. Moment problems, convex optimization, NevanlinnaPick interpolation, covariance extension, systems identification, KullbackLeibler distance. 1
Paraphrase Recognition Using Machine Learning to Combine Similarity Measures
"... This paper presents three methods that can be used to recognize paraphrases. They all employ string similarity measures applied to shallow abstractions of the input sentences, and a Maximum Entropy classifier to learn how to combine the resulting features. Two of the methods also exploit WordNet to ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
This paper presents three methods that can be used to recognize paraphrases. They all employ string similarity measures applied to shallow abstractions of the input sentences, and a Maximum Entropy classifier to learn how to combine the resulting features. Two of the methods also exploit WordNet to detect synonyms and one of them also exploits a dependency parser. We experiment on two datasets, the MSR paraphrasing corpus and a dataset that we automatically created from the MTC corpus. Our system achieves state of the art or better results. 1