## A Dual Coordinate Descent Method for Large-scale Linear SVM

### Cached

### Download Links

Citations: | 109 - 11 self |

### BibTeX

@MISC{Hsieh_adual,

author = {Cho-jui Hsieh and Kai-wei Chang and Chih-jen Lin and S. Sathiya Keerthi},

title = {A Dual Coordinate Descent Method for Large-scale Linear SVM},

year = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1- and L2loss functions. The proposed method is simple and reaches an ɛ-accurate solution in O(log(1/ɛ)) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, TRON, SVM perf, and a recent primal coordinate descent implementation. 1.

### Citations

3907 | LIBSVM: A library for support vector machinesed
- Chang, Lin
- 2001
(Show Context)
Citation Context ...pdated. Therefore, using (12), ∇f(α) is easily available. Below we demonstrate a shrinking implementation so that reconstructing the whole ∇f(α) is never needed. Our method is related to what LIBSVM (=-=Chang & Lin, 2001-=-) uses. From the optimality condition of boundconstrained problems, α is optimal for (4) if and only if ∇ P f(α) = 0, where ∇ P f(α) is the projected gradient defined in (8). We then prove the followi... |

1579 |
Making Large-Scale SVM Learning Practical
- Joachims
- 2004
(Show Context)
Citation Context ...ntains constraints 0 ≤ αi ≤ U. If an αi is 0 or U for many iterations, it may remain the same. To speed up decomposition methods for nonlinear SVM (discussed in Section 4.1), the shrinking technique (=-=Joachims, 1998-=-) reduces the size of the optimization problem without considering some bounded variables. Below we show it is much easier to apply this technique to linear SVM than the nonlinear case. If A is the su... |

1442 | A training algorithm for optimal margin classifiers
- Boser
- 1992
(Show Context)
Citation Context ...te that our method is much faster than state of the art solvers such as Pegasos, TRON, SVM perf , and a recent primal coordinate descent implementation. 1. Introduction Support vector machines (SVM) (=-=Boser et al., 1992-=-) are useful for data classification. Given a set of instance-label pairs (xi, yi), i = 1, . . . , l, xi ∈ R n , yi ∈ {−1, +1}, SVM requires the solution of the following unconstrained optimization pr... |

1126 |
Fast training of support vector machines using sequential minimal optimization
- Platt
- 1999
(Show Context)
Citation Context ...cient for linear SVM. Therefore, it is important to discuss the relationship between decomposition methods and our method. In early decomposition methods that were first proposed (Osuna et al., 1997; =-=Platt, 1998-=-), variables minimized at an iteration are selected by certain heuristics. However, subsequent developments (Joachims, 1998; Chang & Lin, 2001; Keerthi et al., 2001) all use gradient information to co... |

615 | F.Girosi: Training Support Vector Machines: an Application to Face Detection
- Osuna
- 1997
(Show Context)
Citation Context ...nate descent is efficient for linear SVM. Therefore, it is important to discuss the relationship between decomposition methods and our method. In early decomposition methods that were first proposed (=-=Osuna et al., 1997-=-; Platt, 1998), variables minimized at an iteration are selected by certain heuristics. However, subsequent developments (Joachims, 1998; Chang & Lin, 2001; Keerthi et al., 2001) all use gradient info... |

356 |
Training linear SVMs in linear time
- Joachims
- 2006
(Show Context)
Citation Context ...ios. For L1-SVM, Zhang (2004), Shalev-Shwartz et al. (2007), Bottou (2007) propose various stochastic gradient descent methods. Collins et al. (2008) apply an exponentiated gradient method. SVM perf (=-=Joachims, 2006-=-) uses a cutting plane technique. Smola et al. (2008) apply bundle methods, and view SVM perf as a special case. For L2-SVM, Keerthi and DeCoste (2005) propose modified Newton methods. A trust region ... |

313 | Pegasos: Primal Estimated sub-GrAdient SOlver for SVM
- Shalev-Shwartz, Singer, et al.
- 2007
(Show Context)
Citation Context ...olves (1) with the loss function max(−yiwT xi, 0), which is different from (2). They do not study data with a large number of features. Next, we discuss the connection to stochastic gradient descent (=-=Shalev-Shwartz et al., 2007-=-; Bottou, 2007). The most important step of this method is the following update of w: w ← w − η∇w(yi, xi), (19) where ∇w(yi, xi) is the sub-gradient of the approximate objective function: w T w/2 + C ... |

268 | Ultraconservative Online Algorithms for Multiclass Problems
- Crammer, Singer
- 2003
(Show Context)
Citation Context ... earlier studies on decomposition methods failed to modify their algorithms in an efficient way like ours for large-scale linear SVM. We also discuss the connection to other linear SVM works such as (=-=Crammer & Singer, 2003-=-; Collins et al., 2008; ShalevShwartz et al., 2007). This paper is organized as follows. In Section 2, we describe our proposed algorithm. Implementation issues are investigated in Section 3. Section ... |

207 | Improvements to Platt’s SMO algorithm for SVM classifier design
- Keerthi
- 2001
(Show Context)
Citation Context ...hat were first proposed (Osuna et al., 1997; Platt, 1998), variables minimized at an iteration are selected by certain heuristics. However, subsequent developments (Joachims, 1998; Chang & Lin, 2001; =-=Keerthi et al., 2001-=-) all use gradient information to conduct the selection. The main reason is that maintaining the whole gradient does not introduce extra cost. Here we explain the detail by assuming that one variable ... |

100 |
The kernel adatron algorithm: a fast and simple learning procedure for support vector machines
- Friess, Cristianini, et al.
- 1998
(Show Context)
Citation Context ...ne often has an easier access of values per instance. Solving the dual takes this advantage, so our implementation is simpler than Chang et al. (2008). Early SVM papers (Mangasarian & Musicant, 1999; =-=Friess et al., 1998-=-) have discussed coordinate descent methods for the SVM dual form. However, they do not focus on large data using the linear kernel. Crammer and Singer (2003) proposed an online setting for multiclass... |

89 | A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
- Keerthi, DeCoste
- 2005
(Show Context)
Citation Context ... number of variables is restricted to one, a decomposition method is like the online coordinate descent in Section 3.3, but it differs in the way it selects variables for updating. It has been shown (=-=Keerthi & DeCoste, 2005-=-) that, for linear SVM decomposition methods are inefficient. On the other hand, here we are pointing out that dual coordinate descent is efficient for linear SVM. Therefore, it is important to discus... |

73 |
On the convergence of the coordinate descent method for convex differentiable minimization
- Luo, Tseng
- 1992
(Show Context)
Citation Context ...ithm 1. The cost per iteration (i.e., from α k to α k+1 ) is O(l¯n). The main memory requirement is on storing x1, . . . , xl. For the convergence, we prove the following theorem using techniques in (=-=Luo & Tseng, 1992-=-): Theorem 1 For L1-SVM and L2-SVM, {α k,i } generated by Algorithm 1 globally converges to an optimal solution α ∗ . The convergence rate is at least linear: there are 0 < µ < 1 and an iteration k0 s... |

71 | Successive overrelaxation for support vector machines
- Mangasarian, Musicant
- 1999
(Show Context)
Citation Context ...values. However, in practice one often has an easier access of values per instance. Solving the dual takes this advantage, so our implementation is simpler than Chang et al. (2008). Early SVM papers (=-=Mangasarian & Musicant, 1999-=-; Friess et al., 1998) have discussed coordinate descent methods for the SVM dual form. However, they do not focus on large data using the linear kernel. Crammer and Singer (2003) proposed an online s... |

70 | Trust region Newton method for large-scale logistic regression
- Lin, Weng, et al.
- 2007
(Show Context)
Citation Context ... technique. Smola et al. (2008) apply bundle methods, and view SVM perf as a special case. For L2-SVM, Keerthi and DeCoste (2005) propose modified Newton methods. A trust region Newton method (TRON) (=-=Lin et al., 2008-=-) is proposed for logistic reA Dual Coordinate Descent Method for Large-scale Linear SVM gression and L2-SVM. These algorithms focus on different aspects of the training speed. Some aim at quickly ob... |

66 | Solving large scale linear prediction problems using stochastic gradient descent algorithms - Zhang |

65 | Exponentiated gradient algorithms for conditional random fields and max-margin markov networks
- Collins, Globerson, et al.
- 2008
(Show Context)
Citation Context ...mposition methods failed to modify their algorithms in an efficient way like ours for large-scale linear SVM. We also discuss the connection to other linear SVM works such as (Crammer & Singer, 2003; =-=Collins et al., 2008-=-; ShalevShwartz et al., 2007). This paper is organized as follows. In Section 2, we describe our proposed algorithm. Implementation issues are investigated in Section 3. Section 4 discusses the connec... |

41 | Bundle methods for machine learning
- Smola, Vishwanathan, et al.
- 2007
(Show Context)
Citation Context ...t present its results. We do not compare with another online method Vowpal Wabbit (Langford et al., 2007) either as it has been made available only very recently. Though a code for the bundle method (=-=Smola et al., 2008-=-) is available, we do not include it for comparison due to its closeness to SVM perf . All sources used for our comparisons are available at http://csie.ntu. edu.tw/~cjlin/liblinear/exp.html. We set t... |

25 | A quadratic programming procedure - Hildreth - 1957 |

20 | Coordinate descent method for large-scale l2-loss linear support vector machines
- Chang, Hsieh, et al.
(Show Context)
Citation Context ...plementation Issues 3.1. Random Permutation of Sub-problems In Algorithm 1, the coordinate descent algorithm solves the one-variable sub-problems in the order of α1, . . . , αl. Past results such as (=-=Chang et al., 2008-=-) show that solving sub-problems in an arbitrary order may give faster convergence. This inspires us to randomly permute the sub-problems at each outer iteration. Formally, at the kth outer iteration,... |

9 |
Decomposition methods for linear support vector machines
- Kao, Chung, et al.
- 2004
(Show Context)
Citation Context ...han that with. Hence, the coordinate descent method can be faster than the decomposition method by using many cheap iterations. An earlier attempt to speed up decomposition methods for linear SVM is (=-=Kao et al., 2004-=-). However, it failed to derive our method here because it does not give up maintaining gradients. 4.2. Existing Linear SVM Methods We discussed in Section 1 and other places the difference between ou... |

3 |
Stochastic gradient descent examples. http://leon. bottou.org/projects/sgd
- Bottou
(Show Context)
Citation Context ...on max(−yiw T xi, 0), which is different from (2). They do not study data with a large number of features. Next, we discuss the connection to stochastic gradient descent (Shalev-Shwartz et al., 2007; =-=Bottou, 2007-=-).sA Dual Coordinate Descent Method for Large-scale Linear SVM Table 2. On the right training time for a solver to reduce the primal objective value to within 1% of the optimal value; see (20). Time i... |

1 |
Vowpal Wabbit. http://hunch.net/~vw
- Langford, Li, et al.
- 2007
(Show Context)
Citation Context ...rimal coordinate descent method for L2-SVM (Chang et al., 2008). Since (Bottou, 2007) is related to Pegasos, we do not present its results. We do not compare with another online method Vowpal Wabbit (=-=Langford et al., 2007-=-) either as it has been made available only very recently. Though a code for the bundle method (Smola et al., 2008) is available, we do not include it for comparison due to its closeness to SVM perf .... |

1 |
A Dual Coordinate Descent Method for Large-scale Linear SVM
- Shalev-Shwartz, Singer, et al.
- 2007
(Show Context)
Citation Context ...lves (1) with the loss function max(−yiw T xi, 0), which is different from (2). They do not study data with a large number of features. Next, we discuss the connection to stochastic gradient descent (=-=Shalev-Shwartz et al., 2007-=-; Bottou, 2007).sA Dual Coordinate Descent Method for Large-scale Linear SVM Table 2. On the right training time for a solver to reduce the primal objective value to within 1% of the optimal value; se... |

1 | we check if E has zero column. This situation happens only if xi = 0 and Dii = 0, the case of L1SVM without a bias term. Let A = {i | xi �= 0}. We explain that elements not in this set can be eliminated for consideration, so we still have a matrix E witho - Next - 1992 |

1 | Bundle methods for machine learning. NIPS. Dual Coordinate Descent Method for Large-scale Linear SVM - Zhang - 2004 |

1 | Neural Networks, 10, 1032–1037. Dual Coordinate Descent Method for Large-scale Linear SVM - Trans - 2010 |