## Pegasos: Primal Estimated sub-gradient solver for SVM

### Cached

### Download Links

Citations: | 284 - 15 self |

### BibTeX

@MISC{Shalev-Shwartz_pegasos:primal,

author = {Shai Shalev-Shwartz and Yoram Singer and Nathan Srebro and Andrew Cotter},

title = {Pegasos: Primal Estimated sub-gradient solver for SVM },

year = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ɛ is Õ(1/ɛ), where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require Ω(1/ɛ2) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is Õ(d/(λɛ)), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.

### Citations

9002 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ...l Conference on Machine Learning, Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). 1. Introduction Support Vector Machines (SVMs) are effective and popular classification learning tool (=-=Vapnik, 1998-=-; Cristianini & Shawe-Taylor, 2000). The task of learning a support vector machine is cast as a constrained quadratic programming problem. However, in its native form, it is in fact an unconstrained e... |

3927 |
Pattern classification and scene analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...approaches to learn the bias term and underscore the advantages and disadvantages of each approach. The first approach is rather well known and its roots go back to early work on pattern recognition (=-=Duda & Hart, 1973-=-). This approach simply amounts to adding one more feature to each instance x thus increasing the dimension to n + 1. The artificially added feature always take the same value. We assume w.l.o.g that ... |

3701 |
L.: Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...ation problem, quite a few methods were devised and analyzed. The different approaches can be roughly divided into the following categories. Interior Point (IP) methods: IP methods (see for instance (=-=Boyd & Vandenberghe, 2004-=-) and the references therein) cast SVM learning as a quadratic optimization problem subject to linear constraints. The constraints are replaced with a barrier function. The result is a sequence of unc... |

3274 | Convex Analysis - Rockafellar - 1998 |

2036 | Online learning with kernels
- Kivinen, Smola, et al.
- 2001
(Show Context)
Citation Context ...e Pegasos algorithm is an improved stochastic subgradient method. Two concrete algorithms that are closely related to the Pegasos algorithm that are based on gradient methods are the NORMA algorithm (=-=Kivinen et al., 2002-=-) and a stochastic gradient algorithm by Zhang (2004). The Pegasos algorithm uses a sub-sample of k training examples to compute an approximate sub-gradient. When k = 1 the Pegasos algorithm becomes v... |

1007 | Fast training of support vector machines using sequential minimal optimization - Platt - 1999 |

942 |
An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
(Show Context)
Citation Context ...n Machine Learning, Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). 1. Introduction Support Vector Machines (SVMs) are effective and popular classification learning tool (Vapnik, 1998; =-=Cristianini & Shawe-Taylor, 2000-=-). The task of learning a support vector machine is cast as a constrained quadratic programming problem. However, in its native form, it is in fact an unconstrained empirical loss minimization with a ... |

467 | Making large-scale support vector machine learning practical
- Joachims
- 1999
(Show Context)
Citation Context ...rous approximate solutions for multiple choices of λ. Decomposition methods: To overcome the quadratic memory requirement of IP methods, decomposition methods such as SMO (Platt, 1998) and SVM-Light (=-=Joachims, 1998-=-) switch to the dual representation of the SVM optimization problem, and employ an active set of constraints thus working on a subset of dual variables. In the extreme case, called row-action methods ... |

412 | Large margin classification using the perceptron algorithm
- Freund, Schapire
- 1999
(Show Context)
Citation Context ...ng. For instance, the Passive Aggressive (Crammer et al., 2006) applies the objective function of SVM to each example. Online learning algorithms were also suggested as fast alternatives to SVM (see (=-=Freund & Schapire, 1999-=-)). Such algorithms can be used to obtain a predictor with low generalization error using an online-to-batch conversion scheme (CesaBianchi et al., 2004). However, the conversion schemes do not necess... |

375 | Stochastic Approximation Algorithms and Applications - Yin - 1997 |

322 |
Training Linear SVMs in Linear Time
- Joachims
- 2006
(Show Context)
Citation Context ...hm and its accompanying analysis. We start by showing that Pegasos is indeed a practical tool for solving large scale problems. In particular, we compare its runtime to a new state-of-theart solver (=-=Joachims, 2006-=-) on three large datasets. Next, we compare Pegasos to two previously proposed methods that are based on stochastic gradient descent, namely to Norma (Kivinen et al., 2002) and to the method given in ... |

290 | Natural gradient works efficiently in learning - Amari - 1998 |

280 |
Some results on Tchebycheffian spline functions
- Kimeldorf, Wahba
- 1971
(Show Context)
Citation Context ...port vector machines is their ability to incorporate and construct non-linear predictors using kernels which satisfy Mercer’s conditions. The crux of this property stems from the representer theorem (=-=Kimeldorf & Wahba, 1971-=-), which implies that the optimal solution of SVM can be expressed as a linear combination of its constraints. In the classification problem, the representer theorem implies that w is a linear combina... |

255 |
Parallel optimization : theory, algorithms, and applications
- Censor, Zenios
- 1997
(Show Context)
Citation Context ... switch to the dual representation of the SVM optimization problem, and employ an active set of constraints thus working on a subset of dual variables. In the extreme case, called row-action methods (=-=Censor & Zenios, 1997-=-), the active set consists of a single constraint. While algorithms in this family are fairly simple to implement and enPegasos: Primal Estimated sub-GrAdient SOlver for SVM tertain general asymptotic... |

193 | Efficient SVM training using low-rank kernel representations
- Fine, Scheinberg
(Show Context)
Citation Context ... use of IP methods very difficult when the training set has many examples. It should be noted that there have been several attempts to reduce the complexity based on additional assumptions (see e.g. (=-=Fine & Scheinberg, 2001-=-)). However, the dependence on m remains super linear. In addition, while the focus of the paper is the optimization problem cast by SVM, one needs to bare in mind that the optimization problem is a p... |

182 | Introduction to Stochastic Search and Optimization - Spall - 2003 |

135 | On the generalization ability of on-line learning algorithms
- Cesa-Bianchi, Conconi, et al.
- 2004
(Show Context)
Citation Context ...m. 3 we obtain that to achieve accuracy ǫ with confidence 1 − δ we 1 need Õ( λ δ ǫ ) iterations. In contrast, by applying previously studied conversions of online algorithms in the PAC setting (e.g. (=-=Cesa-Bianchi et al., 2004-=-; Cesa-Bianchi & Gentile, 2006)) one can obtain accuracy of ǫ with confidence 1 − δ using Õ(ln(1/δ) λ ǫ2 ) iterations. Thus, as long as the desired confidence is not too high, our convergence rate is ... |

133 | The tradeoffs of large scale learning - Bottou, Bousquet - 2008 |

122 | Logarithmic regret algorithms for online convex optimization
- Hazan, Agarwal, et al.
- 2007
(Show Context)
Citation Context ...ding the average instantaneous objective of the algorithm relatively to the average instantaneous objective of the optimal solution. We first need the following lemma which generalizes a result from (=-=Hazan et al., 2006-=-). The lemma relies on the notion of strongly convex functions. A detailed proof and further explanations can be found in (Shalev-Shwartz & Singer, 2007). Lemma 1. Let f1,...,fT be a sequence of λ-str... |

102 | Fast kernel classifiers with online and active learning - Bordes, Ertekin, et al. |

99 | S.: A dual coordinate descent method for large-scale linear SVM - Hsieh, Chang, et al. - 2008 |

91 | Training a support vector machine in the primal - Chapelle |

73 | Primal-Dual Subgradient Methods for Convex Problems - Nesterov |

62 |
Online passive aggressive algorithms
- Crammer, Dekel, et al.
- 2003
(Show Context)
Citation Context ...ctive function (See also the discussion in (Hush et al., 2006)). Some of the decomposition methods do yield though a regret bound in the online learning setting. For instance, the Passive Aggressive (=-=Crammer et al., 2006-=-) applies the objective function of SVM to each example. Online learning algorithms were also suggested as fast alternatives to SVM (see (Freund & Schapire, 1999)). Such algorithms can be used to obta... |

57 | Solving large scale linear prediction problems using stochastic gradient descent algorithms - Zhang - 2004 |

53 | Svm optimization: inverse dependence on training set size - Shalev-Shwartz, Srebro - 2008 |

45 | Large scale online learning - Bottou, LeCun - 2003 |

38 | Bundle methods for machine learning - Smola, Vishwanathan, et al. - 2007 |

36 | Online algorithms and stochastic approximations - Bottou - 1998 |

22 | On the generalization ability of online strongly convex programming - Kakade, Tewari - 2008 |

20 | Improved risk tail bounds for on-line algorithms
- Cesa-Bianchi, Gentile
- 2005
(Show Context)
Citation Context ...eve accuracy ǫ with confidence 1 − δ we 1 need Õ( λ δ ǫ ) iterations. In contrast, by applying previously studied conversions of online algorithms in the PAC setting (e.g. (Cesa-Bianchi et al., 2004; =-=Cesa-Bianchi & Gentile, 2006-=-)) one can obtain accuracy of ǫ with confidence 1 − δ using Õ(ln(1/δ) λ ǫ2 ) iterations. Thus, as long as the desired confidence is not too high, our convergence rate is significantly better. If we wo... |

19 | Fast rates for regularized objectives - SHAMIR, Sridharan, et al. - 2008 |

16 | Logarithmic regret algorithms for stronglyconvexrepeatedgames
- Shalev-Shwartz, Singer
- 2007
(Show Context)
Citation Context ... need the following lemma which generalizes a result from (Hazan et al., 2006). The lemma relies on the notion of strongly convex functions. A detailed proof and further explanations can be found in (=-=Shalev-Shwartz & Singer, 2007-=-). Lemma 1. Let f1,...,fT be a sequence of λ-strongly convex functions w.r.t. the function 1 2 � · �2 . Let B be a closed convex set and define ΠB(w) = arg minw ′ ∈B �w − w ′ �. Let w1,...,wT+1 be a s... |

16 | QP algorithms with guaranteed accuracy and run time for support vector machines - Hush, Kelly, et al. |

14 | A statistical study of on-line learning - Murata - 1998 |

14 | Statistical analysis of learning dynamics - Murata, Amari - 1999 |

11 | Proximal regularization for online and batch learning - Do, Foo - 2009 |

5 | Stochastic Approximations and Efficient Learning - Bottou, Murata - 2002 |

3 |
QP algorithms with guaranteed accuracy and run time for support vector machines
- Hush, Kelly, et al.
- 2006
(Show Context)
Citation Context ...al solution and their goal is to maximize the dual objective function, they often result in a rather slow convergence rate to the optimum of the primal objective function (See also the discussion in (=-=Hush et al., 2006-=-)). Some of the decomposition methods do yield though a regret bound in the online learning setting. For instance, the Passive Aggressive (Crammer et al., 2006) applies the objective function of SVM t... |