Results 1  10
of
39
Sequential minimal optimization: A fast algorithm for training support vector machines
 Advances in Kernel MethodsSupport Vector Learning
, 1999
"... This paper proposes a new algorithm for training support vector machines: Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possi ..."
Abstract

Cited by 286 (3 self)
 Add to MetaCart
This paper proposes a new algorithm for training support vector machines: Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a timeconsuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets. Because matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size. SMO’s computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets. On realworld sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm. 1.
Logistic Regression, AdaBoost and Bregman Distances
, 2000
"... We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt al ..."
Abstract

Cited by 203 (43 self)
 Add to MetaCart
We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliaryfunction proof technique. As one of our sequentialupdate algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with iterative scaling. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.
Relative Loss Bounds for Online Density Estimation with the Exponential Family of Distributions
 MACHINE LEARNING
, 2000
"... We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the n ..."
Abstract

Cited by 116 (11 self)
 Add to MetaCart
We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the negative loglikelihood of the example with respect to the past parameter of the algorithm. An oline algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the online algorithm over the total loss of the best oline parameter. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a Bregman divergence to derive and analyze each algorithm. These divergences are relative entropies between two exponential distributions. We also use our methods to prove relative loss bounds for linear regression.
Adaptive and SelfConfident OnLine Learning Algorithms
, 2000
"... We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the wh ..."
Abstract

Cited by 62 (7 self)
 Add to MetaCart
We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the whole sequence of examples and thus it is not available to any strictly online algorithm. We introduce new techniques for adaptively tuning the learning rate as the data sequence is progressively revealed. Our techniques allow us to prove essentially the same bounds as if we knew the optimal learning rate in advance. Moreover, such techniques apply to a wide class of online algorithms, including pnorm algorithms for generalized linear regression and Weighted Majority for linear regression with absolute loss. Our adaptive tunings are radically dierent from previous techniques, such as the socalled doubling trick. Whereas the doubling trick restarts the online algorithm several ti...
Boosting as Entropy Projection
, 1999
"... We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost 's choice of the new distribution can be seen ..."
Abstract

Cited by 58 (8 self)
 Add to MetaCart
We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost 's choice of the new distribution can be seen as an approximate solution to the following problem: Find a new distribution that is closest to the old distribution subject to the constraint that the new distribution is orthogonal to the vector of mistakes of the current weak hypothesis. The distance (or divergence) between distributions is measured by the relative entropy. Alternatively, we could say that AdaBoost approximately projects the distribution vector onto a hyperplane dened by the mistake vector. We show that this new view of AdaBoost as an entropy projection is dual to the usual view of AdaBoost as minimizing the normalization factors of the updated distributions.
Matrix exponentiated gradient updates for online learning and Bregman projections
 Journal of Machine Learning Research
, 2005
"... We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that ..."
Abstract

Cited by 47 (8 self)
 Add to MetaCart
We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: Online learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the nondiagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements. 1
Legendre Functions and the Method of Random Bregman Projections
, 1997
"... this paper, Bregman's method is studied within the powerful framework of Convex Analysis. New insights are obtained and the rich class of "Bregman/Legendre functions" is introduced. Bregman's method still works, if the underlying function is Bregman/Legendre or more generally if it is Legendre but s ..."
Abstract

Cited by 44 (13 self)
 Add to MetaCart
this paper, Bregman's method is studied within the powerful framework of Convex Analysis. New insights are obtained and the rich class of "Bregman/Legendre functions" is introduced. Bregman's method still works, if the underlying function is Bregman/Legendre or more generally if it is Legendre but some constraint qualification holds additionally. The key advantage is the broad applicability and
Additive Models, Boosting, and Inference for Generalized Divergences
 In Proc. 12th Annu. Conf. on Comput. Learning Theory
, 1999
"... We present a framework for designing incremental learning algorithms derived from generalized entropy functionals. Our approach is based on the use of Bregman divergences together with the associated class of additive models constructed using the Legendre transform. A particular oneparameter family ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
We present a framework for designing incremental learning algorithms derived from generalized entropy functionals. Our approach is based on the use of Bregman divergences together with the associated class of additive models constructed using the Legendre transform. A particular oneparameter family of Bregman divergences is shown to yield a family of loss functions that includes the loglikelihood criterion of logistic regression as a special case, and that closely approximates the exponential loss criterion used in the AdaBoost algorithms of Schapire et al., as the natural parameter of the family varies. We also show how the quadratic approximation of the gain in Bregman divergence results in a weighted leastsquares criterion. This leads to a family of incremental learning algorithms that builds upon and extends the recent interpretation of boosting in terms of additive models proposed by Friedman, Hastie, and Tibshirani. 1 Introduction Logistic regression is a widely used statisti...
DUAL COORDINATE STEP METHODS FOR LINEAR NETWORK FLOW PROBLEMS
, 1988
"... We review a class of recentlyproposed linearcost network flow methods which are amenable to distributed implementation. All the methods in the class use the notion of ecomplementary slackness, and most do not explicitly manipulate any "global " objects such as paths, trees, or cuts. Interestingly ..."
Abstract

Cited by 29 (7 self)
 Add to MetaCart
We review a class of recentlyproposed linearcost network flow methods which are amenable to distributed implementation. All the methods in the class use the notion of ecomplementary slackness, and most do not explicitly manipulate any "global " objects such as paths, trees, or cuts. Interestingly, these methods have stimulated a large number of new serial computational complexity results. We develop the basic theory of these methods and present two specific methods, the erelaxation algorithm for the minimumcost flow problem, and the auction algorithm for the assignment problem. We show how to implement these methods with serial complexities of O(N 3 log NC) and O(NA log NC), respectively. We also discuss practical implementation issues and computational experience to date. Finally, we show how to implement erelaxation in a completely asynchronous, "chaotic" environment in which some processors compute faster than others, some processors communicate faster than others, and there can be arbitrarily large communication delays.
Learning permutations with exponential weights
 In 20th Annual Conference on Learning Theory
, 2007
"... Abstract. We give an algorithm for learning a permutation online. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic matrix. This matrix is updated by multiplying the current matrix entries by exponential factors. These factors destroy the doubly stochastic ..."
Abstract

Cited by 27 (5 self)
 Add to MetaCart
Abstract. We give an algorithm for learning a permutation online. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic matrix. This matrix is updated by multiplying the current matrix entries by exponential factors. These factors destroy the doubly stochastic property of the matrix and an iterative procedure is needed to renormalize the rows and columns. Even though the result of the normalization procedure does not have a closed form, we can still bound the additional loss of our algorithm over the loss of the best permutation chosen in hindsight. 1