## Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines (1998)

### Cached

### Download Links

Venue: | ADVANCES IN KERNEL METHODS - SUPPORT VECTOR LEARNING |

Citations: | 331 - 3 self |

### BibTeX

@TECHREPORT{Platt98sequentialminimal,

author = {John C. Platt},

title = {Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines},

institution = {ADVANCES IN KERNEL METHODS - SUPPORT VECTOR LEARNING},

year = {1998}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper proposes a new algorithm for training support vector machines: Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets. Because matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size. SMO's computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets. On realworld sparse data sets, SMO can be more than 1000 times...

### Citations

10096 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ... sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm. 1. INTRODUCTION In the last few years, there has been a surge of interest in Support Vector Machines (SVMs) [19] =-=[20]-=- [4]. SVMs have empirically been shown to give good generalization performance on a wide variety of problems such as handwritten character recognition [12], face detection [15], pedestrian detection [... |

4204 |
PE.(1973) Pattern Classification and Scene Analysis [M
- Duda, Hart
(Show Context)
Citation Context ... which Lagrange multiplier to optimize is the same as the first choice heuristic described in section 2.2. Fixed-threshold SMO for a linear SVM is similar in concept to the perceptron relaxation rule =-=[8]-=-, where the output of a perceptron is adjusted whenever there is an error, so that the output exactly lies on the margin. However, the fixed-threshold SMO algorithm will sometimes reduce the proportio... |

2530 | A tutorial on support vector machines for pattern recognition
- Burges
- 1998
(Show Context)
Citation Context ...se data sets, SMO can be more than 1000 times faster than the chunking algorithm. 1. INTRODUCTION In the last few years, there has been a surge of interest in Support Vector Machines (SVMs) [19] [20] =-=[4]-=-. SVMs have empirically been shown to give good generalization performance on a wide variety of problems such as handwritten character recognition [12], face detection [15], pedestrian detection [14],... |

2424 | V.: Support-vector network
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...o hyperplane that splits the positive examples from the negative examples. In the formulation above, the non-separable case would correspond to an infinite solution. However, in 1995, Cortes & Vapnik =-=[7]-=- suggested a modification to the original optimization statement (3) which allows, but penalizes, the failure of an example to reach the correct margin. That modification is: N 1 r 2 r r min || w|| + ... |

2393 |
A Wavelet Tour of Signal Processing
- Mallat
- 1999
(Show Context)
Citation Context ...ny quantized or fuzzy-membership-encoded problems will be sparse. Also, optical character recognition [12], handwritten character recognition [1], and wavelet transform coefficients of natural images =-=[13]-=- [14] tend to be naturally expressed as sparse data. 16The second artificial data set stands in stark contrast to the first easy data set. The second set is generated with random 300-dimensional bina... |

1897 | Text categorization with support vector machines: Learning with many relevant features
- Joachims
- 1998
(Show Context)
Citation Context ...een shown to give good generalization performance on a wide variety of problems such as handwritten character recognition [12], face detection [15], pedestrian detection [14], and text categorization =-=[9]-=-. However, the use of SVMs is still limited to a small group of researchers. One possible reason is that training algorithms for SVMs are slow, especially for large problems. Another explanation is th... |

1569 |
Practical Optimization
- Gill, Murray, et al.
(Show Context)
Citation Context ...1"s in the input can be stored, and the dot product will sum the weights corresponding to the position of the "1"s in the input. The chunking algorithm uses the projected conjugate gradient algorithm =-=[11]-=- as its QP solver, as suggested by Burges [4]. In order to ensure that the chunking algorithm is a fair benchmark, Burges compared the speed of his chunking code on a 200 MHz Pentium II running Solari... |

1442 | A training algorithm for optimal margin classifiers
- Boser
- 1992
(Show Context)
Citation Context ...simply changes the constraint (5) into a box constraint: The variables ξi do not appear in the dual formulation at all. 0 ≤αi≤C, ∀i. (9) SVMs can be even further generalized to non-linear classifiers =-=[2]-=-. The output of a non-linear SVM is explicitly computed from the Lagrange multipliers: 3r r N r r u= y α K( x , x) −b, ∑ j= 1 j j j (10) where r K is a kernel function that measures the similarity or... |

841 |
Estimation of Dependences Based on Empirical Data
- Vapnik
- 1982
(Show Context)
Citation Context ...world sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm. 1. INTRODUCTION In the last few years, there has been a surge of interest in Support Vector Machines (SVMs) =-=[19]-=- [20] [4]. SVMs have empirically been shown to give good generalization performance on a wide variety of problems such as handwritten character recognition [12], face detection [15], pedestrian detect... |

615 | F.Girosi: Training Support Vector Machines: an Application to Face Detection
- Osuna
- 1997
(Show Context)
Citation Context ...or Machines (SVMs) [19] [20] [4]. SVMs have empirically been shown to give good generalization performance on a wide variety of problems such as handwritten character recognition [12], face detection =-=[15]-=-, pedestrian detection [14], and text categorization [9]. However, the use of SVMs is still limited to a small group of researchers. One possible reason is that training algorithms for SVMs are slow, ... |

305 |
The relaxation method for finding the common point of convex sets and its application to the solution of problems in convex programming
- Bregman
- 1967
(Show Context)
Citation Context ...nge multipliers are replaced at every step with new multipliers that are chosen via good heuristics. The SMO algorithm is closely related to a family of optimization algorithms called Bregman methods =-=[3]-=- or row-action methods [5]. These methods solve convex programming problems with linear constraints. They are iterative methods where each step projects the current primal point onto each constraint. ... |

287 | An improved training algorithm for support vector machines - Osuna, Freund, et al. - 1997 |

238 | Pedestrian detection using wavelet templates
- Oren, Papageorgiou, et al.
- 1997
(Show Context)
Citation Context ...] [4]. SVMs have empirically been shown to give good generalization performance on a wide variety of problems such as handwritten character recognition [12], face detection [15], pedestrian detection =-=[14]-=-, and text categorization [9]. However, the use of SVMs is still limited to a small group of researchers. One possible reason is that training algorithms for SVMs are slow, especially for large proble... |

181 | A resource-allocating network for function interpolation
- Platt
- 1991
(Show Context)
Citation Context ...ases the amount of a training input in the weight vector and, hence, is not maximum margin. Fixed-threshold SMO for Gaussian kernels is also related to the resource allocating network (RAN) algorithm =-=[18]-=-. When RAN detects certain kinds of errors, it will allocate a kernel to exactly fix the error. SMO will perform similarly. However SMO/SVM will adjust the heights of the kernels to maximize the margi... |

56 |
An iterative row-action method for interval convex programming
- Censor, Lent
- 1981
(Show Context)
Citation Context ...lem of minimizing the norm of the weight vector r w over the combined space of all possible weight vectors r w with thresholds b produces a Bregman D-projection that does not have a unique minimum [3]=-=[6]-=-. It is interesting to consider an SVM where the threshold b is held fixed at zero, rather than being solved for. A fixed-threshold SVM would not have a linear equality constraint (6). Therefore, only... |

45 |
Row-action methods for huge and sparse systems and their applications
- Censor
- 1981
(Show Context)
Citation Context ...ed at every step with new multipliers that are chosen via good heuristics. The SMO algorithm is closely related to a family of optimization algorithms called Bregman methods [3] or row-action methods =-=[5]-=-. These methods solve convex programming problems with linear constraints. They are iterative methods where each step projects the current primal point onto each constraint. An unmodified Bregman meth... |

38 | Learning algorithms for classification: A comparison on handwritten digit recognition
- LeCun, Botou, et al.
- 1995
(Show Context)
Citation Context ...erest in Support Vector Machines (SVMs) [19] [20] [4]. SVMs have empirically been shown to give good generalization performance on a wide variety of problems such as handwritten character recognition =-=[12]-=-, face detection [15], pedestrian detection [14], and text categorization [9]. However, the use of SVMs is still limited to a small group of researchers. One possible reason is that training algorithm... |

34 | Globally trained handwritten word recognizer using spatial representation, convolutional neural networks, and hidden Markov models
- Bengio, Cun, et al.
- 1993
(Show Context)
Citation Context ...word data sets described in section 3.1 and section 3.2, any quantized or fuzzy-membership-encoded problems will be sparse. Also, optical character recognition [12], handwritten character recognition =-=[1]-=-, and wavelet transform coefficients of natural images [13] [14] tend to be naturally expressed as sparse data. 16The second artificial data set stands in stark contrast to the first easy data set. T... |

25 |
A quadratic programming procedure
- Hildreth
- 1957
(Show Context)
Citation Context ...the corresponding dimension. The update rule is new yE 1 1 α1 = α1+ Kx ( , x) . r r (23) This update equation forces the output of the SVM to be y1 (similar to Bregman methods or Hildreth’s QP method =-=[10]-=-). After the new α is computed, it is clipped to the [0,C] interval (unlike previous methods). The choice of which Lagrange multiplier to optimize is the same as the first choice heuristic described i... |