Results 1 
7 of
7
Discriminative Reranking for Natural Language Parsing
, 2005
"... This article considers approaches which rerank the output of an existing probabilistic parser. The base parser produces a set of candidate parses for each input sentence, with associated probabilities that define an initial ranking of these parses. A second model then attempts to improve upon this i ..."
Abstract

Cited by 268 (9 self)
 Add to MetaCart
This article considers approaches which rerank the output of an existing probabilistic parser. The base parser produces a set of candidate parses for each input sentence, with associated probabilities that define an initial ranking of these parses. A second model then attempts to improve upon this initial ranking, using additional features of the tree as evidence. The strength of our approach is that it allows a tree to be represented as an arbitrary set of features, without concerns about how these features interact or overlap and without the need to define a derivation or a generative model which takes these features into account. We introduce a new method for the reranking task, based on the boosting approach to ranking problems described in Freund et al. (1998). We apply the boosting method to parsing the Wall Street Journal treebank. The method combined the loglikelihood under a baseline model (that of Collins [1999]) with evidence from an additional 500,000 features over parse trees that were not included in the original model. The new model achieved 89.75 % Fmeasure, a 13 % relative decrease in Fmeasure error over the baseline model’s score of 88.2%. The article also introduces a new algorithm for the boosting approach which takes advantage of the sparsity of the feature space in the parsing data. Experiments show significant efficiency gains for the new algorithm over the obvious implementation of the boosting approach. We argue that the method is an appealing alternative—in terms of both simplicity and efficiency—to work on feature selection methods within loglinear (maximumentropy) models. Although the experiments in this article are on natural language parsing (NLP), the approach should be applicable to many other NLP problems which are naturally framed as ranking tasks, for example, speech recognition, machine translation, or natural language generation.
Potential Boosters?
 Advances in Neural Information Processing Systems 12
, 2000
"... Recent interpretations of the Adaboost algorithm view it as performing a gradient descent on a potential function. Simply changing the potential function allows one to create new algorithms related to AdaBoost. However, these new algorithms are generally not known to have the formal boosting pro ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
Recent interpretations of the Adaboost algorithm view it as performing a gradient descent on a potential function. Simply changing the potential function allows one to create new algorithms related to AdaBoost. However, these new algorithms are generally not known to have the formal boosting property. This paper examines the question of which potential functions lead to new algorithms that are boosters. The two main results are general sets of conditions on the potential; one set implies that the resulting algorithm is a booster, while the other implies that the algorithm is not. These conditions are applied to previously studied potential functions, such as those used by LogitBoost and Doom II. 1 Introduction The rst boosting algorithm appeared in Rob Schapire's thesis [1]. This algorithm was able to boost the performance of a weak PAC learner [2] so that the resulting algorithm satises the strong PAC learning [3] criteria. We will call any method that builds a strong PA...
Game Theory, Maximum Generalized Entropy, Minimum Discrepancy, Robust Bayes and Pythagoras
, 2002
"... ..."
Strong Entropy Concentration, Game Theory and Algorithmic Randomness
, 2001
"... . We give a characterization of Maximum Entropy/Minimum Relative Entropy inference by providing two `strong entropy concentration ' theorems. These theorems unify and generalize Jaynes' `concentration phenomenon' and Van Campenhout and Cover's `conditional limit theorem'. The theorems characteri ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
. We give a characterization of Maximum Entropy/Minimum Relative Entropy inference by providing two `strong entropy concentration ' theorems. These theorems unify and generalize Jaynes' `concentration phenomenon' and Van Campenhout and Cover's `conditional limit theorem'. The theorems characterize exactly in what sense a `prior' distribution Q conditioned on a given constraint and the distribution ~ P minimizing D(P jjQ) over all P satisfying the constraint are `close' to each other. We show how our theorems are related to `universal models ' for exponential families, thereby establishing a link with Rissanen's MDL/stochastic complexity. We then apply our theorems to establish the relationship (A) between entropy concentration and a gametheoretic characterization of Maximum Entropy Inference due to Topse and others; (B) between maximum entropy distributions and sequences that are random (in the sense of MartinLof/Kolmogorov) with respect to the given constraint. These two applications have strong implications for the use of Maximum Entropy distributions in sequential prediction tasks, both for the logarithmic loss and for general loss functions. We identify circumstances under which Maximum Entropy predictions are almost optimal. 1
Strong Entropy Concentration, Coding, Game Theory and Randomness
, 2001
"... . We give a characterization of Maximum Entropy/Minimum Relative Entropy inference by providing two `strong entropy concentration' theorems. These theorems unify and generalize Jaynes' `concentration phenomenon' and Van Campenhout and Cover's `conditional limit theorem'. The theorems characterize ex ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
. We give a characterization of Maximum Entropy/Minimum Relative Entropy inference by providing two `strong entropy concentration' theorems. These theorems unify and generalize Jaynes' `concentration phenomenon' and Van Campenhout and Cover's `conditional limit theorem'. The theorems characterize exactly in what sense a `prior' distribution Q conditioned on a given constraint and the distribution ~ P minimizing D(P jjQ) over all P satisfying the constraint are `close' to each other. We show how our theorems are related to `universal models' for exponential families, thereby establishing a link with Rissanen's MDL/stochastic complexity. We then apply our theorems to establish the relationship (A) between entropy concentration and a gametheoretic characterization of Maximum Entropy Inference due to Topse and others; (B) between maximum entropy distributions and sequences that are random (in the sense of MartinL of/Kolmogorov) with respect to the given constraint. These two applications have strong implications for the use of Maximum Entropy distributions in sequential prediction tasks, both for the logarithmic loss and for general loss functions. We identify circumstances under which Maximum Entropy predictions are almost optimal. This is an extended version, containing all the proofs, of the paper Strong Entropy Concentration, Game Theory and Algorithmic Randomness, Proceedings of the Fourteenth Annual Conference on Computational Learning Theory (COLT/EUROCOLT 2001), Amsterdam, 2001. The author would like to thank Dan Roth and especially Phil Dawid for providing stimulating conversations and deep insights. The author is with EURANDOM, Postbus 513, 5600 MB Eindhoven, the Netherlands. URL: www.cwi.nl/~pdg. 3 4 PETER GR UNWALD STRONG ENTROPY CONCENTRATION, CODING...
Boosting versus Covering
 in Advances in Neural Information Processing Systems 16
, 2003
"... We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are onesided, i.e. either all its positive (or negative) predictions are correct. In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are onesided in this ..."
Abstract
 Add to MetaCart
We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are onesided, i.e. either all its positive (or negative) predictions are correct. In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are onesided in this case), AdaBoost constructs a consistent hypothesis by using O(k log m) iterations. On the other hand, a greedy set covering algorithm finds a consistent hypothesis of size O(k log m). Our primary question is whether there is a simple boosting algorithm that performs as well as the greedy set covering.