Results 1 - 10
of
38
Logistic Regression, AdaBoost and Bregman Distances
, 2000
"... We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt al ..."
Abstract
-
Cited by 171 (39 self)
- Add to MetaCart
We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliary-function proof technique. As one of our sequential-update algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with iterative scaling. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.
Maximum Entropy Discrimination
, 1999
"... We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calculations involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is ..."
Abstract
-
Cited by 95 (20 self)
- Add to MetaCart
We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calculations involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is not separable within the chosen parametric class, in the context of anomaly detection rather than classification, or when the labels in the training set are uncertain or incomplete. Support vector machines are naturally subsumed under this class and we provide several extensions. We are also able to estimate exactly and efficiently discriminative distributions over tree structures of class-conditional models within this framework. Preliminary experimental results are indicative of the potential in these techniques.
An introduction to boosting and leveraging
- Advanced Lectures on Machine Learning, LNCS
, 2003
"... ..."
Boosting and Maximum Likelihood for Exponential Models
- In Advances in Neural Information Processing Systems
, 2001
"... Recent research has considered the relationship between boosting and more standard statistical methods, such as logistic regression, concluding that AdaBoost is similar but somehow still very different from statistical methods in that it minimizes a different loss function. In this paper we derive a ..."
Abstract
-
Cited by 66 (5 self)
- Add to MetaCart
Recent research has considered the relationship between boosting and more standard statistical methods, such as logistic regression, concluding that AdaBoost is similar but somehow still very different from statistical methods in that it minimizes a different loss function. In this paper we derive an equivalence between AdaBoost and the dual of a convex optimization problem. In this setting, it is seen that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels; the two methods minimize the same Kullback-Leibler divergence objective function subject to identical feature constraints. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets, comparing exponential loss and maximum likelihood for parallel and sequential update algorithms, confirm our theoretical analysis, indicating that AdaBoost and maximum likelihood typically yield identical results as the number of features increases to allow the models to fit the training data.
Game Theory, Maximum Entropy, Minimum Discrepancy And Robust Bayesian Decision Theory
- Annals of Statistics
, 2004
"... this paper appeared in the Proceedings of the 2002 IEEE Information Theory Workshop [see Grnwald and Dawid (2002)] ..."
Abstract
-
Cited by 53 (3 self)
- Add to MetaCart
this paper appeared in the Proceedings of the 2002 IEEE Information Theory Workshop [see Grnwald and Dawid (2002)]
Feature Selection and Dualities in Maximum Entropy Discrimination
- In Uncertainity In Artificial Intellegence
, 2000
"... We present the maximum entropy discrimination (MED) formalism as a regularization approach with information theoretic penalties. By extending discriminative and large margin concepts to a probabilistic setting, MED permits many important generalizations to SVMs. We introduce feature selection ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
We present the maximum entropy discrimination (MED) formalism as a regularization approach with information theoretic penalties. By extending discriminative and large margin concepts to a probabilistic setting, MED permits many important generalizations to SVMs. We introduce feature selection as a particularly critical augmentation of the learning machine. MED derivations for both regression and classification cases are shown and lead to promising experimental results. Features are pruned simultaneously with parameter estimation to generate substantial improvements with relatively sparse training data. Furthermore, in the linear model case, complexity scales linearly with dimensionality and can remain tractable under explicit feature expansions of non-linear kernels. The MED formalism also accommodates discriminant functions that arise from generative probability models (log-likelihood ratios) although feature selection may require more computational effort and ap...
Matrix exponentiated gradient updates for on-line learning and Bregman projections
- Journal of Machine Learning Research
, 2005
"... We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that ..."
Abstract
-
Cited by 32 (8 self)
- Add to MetaCart
We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements. 1
Constructing Boosting Algorithms from SVMs: An Application to One-class Classification
, 2002
"... ..."
Efficient Margin Maximizing with Boosting
, 2002
"... AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. It has been observed that the generalization error of the algorithm continues to improve even after all examples are classified correctly by the current signed linear combination, whic ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. It has been observed that the generalization error of the algorithm continues to improve even after all examples are classified correctly by the current signed linear combination, which can be viewed as hyperplane in feature space where the base hypotheses form the features.

