## Core vector machines: Fast SVM training on very large data sets (2005)

### Cached

### Download Links

- [www.cs.ust.hk]
- [c2inet.sce.ntu.edu.sg]
- [www.cse.ust.hk]
- [www.cs.ust.hk]
- [jmlr.csail.mit.edu]
- [www.cse.ust.hk]
- [jmlr.org]
- [www.cse.ust.hk]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 83 - 13 self |

### BibTeX

@ARTICLE{Tsang05corevector,

author = {Ivor W. Tsang and James T. Kwok and Pak-ming Cheung and Nello Cristianini},

title = {Core vector machines: Fast SVM training on very large data sets},

journal = {Journal of Machine Learning Research},

year = {2005},

volume = {6},

pages = {363--392}

}

### Years of Citing Articles

### OpenURL

### Abstract

Standard SVM training has O(m 3) time and O(m 2) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness ” in this paper. We first show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB) problems in computational geometry. Then, by adopting an efficient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine (CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and realworld data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about five million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC.

### Citations

10995 | Computers and Intractability: A Guide to the Theory of NP-Completeness - GAREY, JOHNSON - 1979 |

9071 | Statistical learning theory
- Vapnik
- 1998
(Show Context)
Citation Context ...nel matrix may still be too high to be handled efficiently. c○2005 Ivor W. Tsang, James T. Kwok and Pak-Ming Cheung.sTSANG, KWOK AND CHEUNG Another approach to scale up kernel methods is by chunking (=-=Vapnik, 1998-=-) or more sophisticated decomposition methods (Chang and Lin, 2004; Osuna et al., 1997b; Platt, 1999; Vishwanathan et al., 2003). However, chunking needs to optimize the entire set of non-zero Lagrang... |

3496 | LIBSVM : a library for support vector machines - Chang, Lin - 2011 |

2088 | Rapid Object Detection using a Boosted Cascade of Simple Features
- VIOLA, JONES
- 2001
(Show Context)
Citation Context ...rt face detection performance. Nevertheless, the ability of CVM in handling very large data sets could make it a better base classifier in powerful face detection systems such as the boosted cascade (=-=Viola and Jones, 2001-=-). training set # faces # nonfaces total original 2,429 4,548 6,977 set A 2,429 481,914 484,343 set B 19,432 481,914 501,346 set C 408,072 481,914 889,986 Table 2: Number of faces and nonfaces in the ... |

2047 | Learning with kernels
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ...e-up methods include the Kernel Adatron (Friess et al., 1998) and the SimpleSVM (Vishwanathan et al., 2003). For a more complete survey, interested readers may consult (Tresp, 2001) or Chapter 10 of (=-=Schölkopf and Smola, 2002-=-). In practice, state-of-the-art SVM implementations typically have a training time complexity that scales between O(m) and O(m2.3 ) (Platt, 1999). This can be further driven down to O(m) with the use... |

1775 | Computational Geometry: An Introduction - Preparata, Shamos - 1985 |

1019 |
Fast training of support vector machines using sequential minimal optimization
- Platt
- 1999
(Show Context)
Citation Context ...ak-Ming Cheung.sTSANG, KWOK AND CHEUNG Another approach to scale up kernel methods is by chunking (Vapnik, 1998) or more sophisticated decomposition methods (Chang and Lin, 2004; Osuna et al., 1997b; =-=Platt, 1999-=-; Vishwanathan et al., 2003). However, chunking needs to optimize the entire set of non-zero Lagrange multipliers that have been identified, and the resultant kernel matrix may still be too large to f... |

904 |
Approximation algorithms
- Vazirani
- 2004
(Show Context)
Citation Context ...ield of theoretical computer science, approximation algorithms with provable performance guarantees have been extensively used in tackling computationally difficult problems (Garey and Johnson, 1979; =-=Vazirani, 2001-=-). Let C be the cost of the solution returned by an approximate algorithm, and C∗ be the cost of the optimal solution. An approximate algorithm has approximation ratio ρ(n) for an input size n if max ... |

568 | Training support vector machines: Applications to face detection
- Osuna, Freund, et al.
- 1997
(Show Context)
Citation Context ..., James T. Kwok and Pak-Ming Cheung.sTSANG, KWOK AND CHEUNG Another approach to scale up kernel methods is by chunking (Vapnik, 1998) or more sophisticated decomposition methods (Chang and Lin, 2004; =-=Osuna et al., 1997-=-b; Platt, 1999; Vishwanathan et al., 2003). However, chunking needs to optimize the entire set of non-zero Lagrange multipliers that have been identified, and the resultant kernel matrix may still be ... |

511 | C.: Estimating the support of a high-dimensional distribution. Neural Computation - Schölkopf, Platt, et al. - 2001 |

481 | A tutorial on support vector regression
- Smola, Schölkopf
- 2004
(Show Context)
Citation Context ...in many numerical routines, only approximate the optimal solution by an iterative strategy. Typically, the stopping criterion uses either the precision of the Lagrange multipliers or the duality gap (=-=Smola and Schölkopf, 2004-=-). For example, in SMO, SVM light (Joachims, 1999) and SimpleSVM, training stops when the KarushKuhn-Tucker (KKT) conditions are fulfilled within a tolerance parameter ε. Experience with these softwar... |

471 | Making large-scale support vector machine learning practical
- Joachims
- 1999
(Show Context)
Citation Context ...ution by an iterative strategy. Typically, the stopping criterion uses either the precision of the Lagrange multipliers or the duality gap (Smola and Schölkopf, 2004). For example, in SMO, SVM light (=-=Joachims, 1999-=-) and SimpleSVM, training stops when the KarushKuhn-Tucker (KKT) conditions are fulfilled within a tolerance parameter ε. Experience with these softwares indicate that near-optimal solutions are often... |

439 | The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition 30 (7
- Bradley
- 1997
(Show Context)
Citation Context ...time, number of support vectors, and size of the training set are in log scale. 382 10 6 10 7sCORE VECTOR MACHINES commonly used for face detectors. The ROC (receiver operating characteristic) curve (=-=Bradley, 1997-=-) plots TP on the Y -axis and the false positive rate FP = negatives incorrectly classified total negatives on the X-axis. Here, faces are treated as positives while non-faces as negatives. The AUC is... |

300 | S.: Choosing multiple parameters for support vector machines
- Chapelle, Vapnik, et al.
- 2002
(Show Context)
Citation Context ...n support vector data description (SVDD) (Tax and Duin, 1999), which will be briefly reviewed in Section 3.1. The MEB problem can also be used to find the radius component of the radius-margin bound (=-=Chapelle et al., 2002-=-; Vapnik, 1998). Thus, Kumar et al. (2003) has pointed out that the MEB problem can be used in support vector clustering and SVM parameter tuning. However, as will be shown in Section 3.2, other kerne... |

284 | M.: Using the Nystrom method to speed up kernel machines
- WILLIAMS, SEEGER
- 2000
(Show Context)
Citation Context ...ommonly encountered in data mining applications. To reduce the time and space complexities, a popular technique is to obtain low-rank approximations on the kernel matrix, by using the Nyström method (=-=Williams and Seeger, 2001-=-), greedy approximation (Smola and Schölkopf, 2000), sampling (Achlioptas et al., 2002) or matrix decompositions (Fine and Scheinberg, 2001). However, on very large data sets, the resulting rank of th... |

256 | An Improved Training Algorithm for Support Vector Machines
- Osuna, Freund, et al.
- 1997
(Show Context)
Citation Context ..., James T. Kwok and Pak-Ming Cheung.sTSANG, KWOK AND CHEUNG Another approach to scale up kernel methods is by chunking (Vapnik, 1998) or more sophisticated decomposition methods (Chang and Lin, 2004; =-=Osuna et al., 1997-=-b; Platt, 1999; Vishwanathan et al., 2003). However, chunking needs to optimize the entire set of non-zero Lagrange multipliers that have been identified, and the resultant kernel matrix may still be ... |

206 | Less is more: Active learning with support vector machines
- Schohn, Cohn
(Show Context)
Citation Context ...hese small SVMs. Lee and Mangasarian (2001) proposed the reduced SVM (RSVM), which uses a random rectangular subset of the kernel matrix. Instead of random sampling, one can also use active learning (=-=Schohn and Cohn, 2000-=-; Tong and Koller, 2000), squashing (Pavlov et al., 2000a), editing (Bakir et al., 2005) or even clustering (Boley and Cao, 2004; Yu et al., 2003) to intelligently sample a small number of training da... |

200 |
Linear-time algorithms for linear programming in r 3 and related problems
- Megiddo
- 1983
(Show Context)
Citation Context ...oblem in computational geometry. The MEB problem computes the ball of minimum radius enclosing a given set of points (or, more generally, balls). Traditional algorithms for finding exact MEBs (e.g., (=-=Megiddo, 1983-=-; Welzl, 1991)) do not scale well with the dimensionality d of the points. Consequently, recent attention has shifted to the development of approximation algorithms (Bădoiu and Clarkson, 2002; Kumar e... |

193 | Efficient SVM training using low-rank kernel representations
- Fine, Scheinberg
(Show Context)
Citation Context ...he data sets used. For comparison, we also run the following SVM implementations: 4 1. L2-SVM: LIBSVM implementation (in C++); 2. L2-SVM: LSVM implementation (in MATLAB), with low-rank approximation (=-=Fine and Scheinberg, 2001-=-) of the kernel matrix added; 3. L2-SVM: RSVM (Lee and Mangasarian, 2001) implementation (in MATLAB). The RSVM addresses the scale-up issue by solving a smaller optimization problem that involves a ra... |

177 | Sparse greedy matrix approximation for machine learning
- Scholkopf
- 2000
(Show Context)
Citation Context ...o reduce the time and space complexities, a popular technique is to obtain low-rank approximations on the kernel matrix, by using the Nyström method (Williams and Seeger, 2001), greedy approximation (=-=Smola and Schölkopf, 2000-=-), sampling (Achlioptas et al., 2002) or matrix decompositions (Fine and Scheinberg, 2001). However, on very large data sets, the resulting rank of the kernel matrix may still be too high to be handle... |

173 | Smallest enclosing disks (balls and ellipsoids), in: New Results and New Trends
- Welzl
- 1991
(Show Context)
Citation Context ...ational geometry. The MEB problem computes the ball of minimum radius enclosing a given set of points (or, more generally, balls). Traditional algorithms for finding exact MEBs (e.g., (Megiddo, 1983; =-=Welzl, 1991-=-)) do not scale well with the dimensionality d of the points. Consequently, recent attention has shifted to the development of approximation algorithms (Bădoiu and Clarkson, 2002; Kumar et al., 2003; ... |

128 | O.L.Mangasarian. \RSVM: Reduced Support Vector Machines
- Lee
- 2001
(Show Context)
Citation Context ...focus will be on the 2-norm error. In theory, this could be less robust in the presence of outliers. However, experimentally, its generalization performance is often comparable to that of the L1-SVM (=-=Lee and Mangasarian, 2001-=-; Mangasarian and Musicant, 2001a,b). Besides, the 2-norm error is more advantageous here because a soft-margin L2-SVM can be transformed to a hard-margin one. While the 2-norm error has been used in ... |

118 | R.: Support vector domain description
- Tax, Duin
- 1999
(Show Context)
Citation Context ...e complexities of our algorithm to grow slowly (Section 4.3). 3. MEB Problems and Kernel Methods The MEB can be easily seen to be equivalent to the hard-margin support vector data description (SVDD) (=-=Tax and Duin, 1999-=-), which will be briefly reviewed in Section 3.1. The MEB problem can also be used to find the radius component of the radius-margin bound (Chapelle et al., 2002; Vapnik, 1998). Thus, Kumar et al. (20... |

111 | Approximate clustering via core-sets
- Bǎdoiu, Har-Peled, et al.
- 2002
(Show Context)
Citation Context ...e same. Hence, the tth iteration takes O(t 3 ) time. As probabilistic speeedup may not find the furthest point in each iteration, τ may be larger than 2/ε though it can still be bounded by O(1/ε 2 ) (=-=Bădoiu et al., 2002-=-). Hence, the whole procedure takes T = τ ∑ t=1 O(t 3 ) = O(τ 4 � 1 ) = O ε8 � . For a fixed ε, it is thus independent of m. The space complexity, which depends only on the number of iterations τ, bec... |

96 |
The Kernel-Adatron Algorithm: A fast and simple learning procedure for Support Vector Machines
- Friess, Cristianini, et al.
- 1998
(Show Context)
Citation Context ...ir et al., 2005) or even clustering (Boley and Cao, 2004; Yu et al., 2003) to intelligently sample a small number of training data for SVM training. Other scale-up methods include the Kernel Adatron (=-=Friess et al., 1998-=-) and the SimpleSVM (Vishwanathan et al., 2003). For a more complete survey, interested readers may consult (Tresp, 2001) or Chapter 10 of (Schölkopf and Smola, 2002). In practice, state-of-the-art SV... |

90 | Lagrangian support vector machines - Mangasarian - 2001 |

79 | A parallel mixture of SVMs for very large scale problems - Collobert, Bengio, et al. |

68 | A fast iterative nearest point algorithm for support vector machine classifier design
- Keerthi, Shevade, et al.
(Show Context)
Citation Context ... = ρ is the desired hyperplane and C is a user-defined parameter. Unlike the classification LSVM, the bias is not penalized here. Moreover, note that constraints ξi ≥ 0 are not needed for the L2-SVM (=-=Keerthi et al., 2000-=-). The corresponding dual is max −α α ′ � K+ 1 C I � α : α ≥ 0, α ′ 1 = 1, (8) where I is the m × m identity matrix. From the Karush-Kuhn-Tucker (KKT) conditions, we can recover w = and ξi = αi C , an... |

63 | M.: Face Detection in Still Gray Images
- Heisele, Poggio, et al.
- 1687
(Show Context)
Citation Context ...K patterns and CVM becomes faster than the other decomposition algorithms. 6.4 Extended MIT Face Data In this Section, we perform face detection using an extended version of the MIT face database 10 (=-=Heisele et al., 2000-=-; Sung, 1996). The original data set has 6,977 training images (with 2,429 faces and 4,548 nonfaces) and 24,045 test images (472 faces and 23,573 nonfaces). The original 19 × 19 grayscale images are f... |

61 |
Approximating the diameter, width, smallest enclosing cylinder and minimum-width annulus
- Chan
(Show Context)
Citation Context ...problem also belongs to the larger family of shape fitting problems, which attempt to find the shape (such as a slab, cylinder, cylindrical shell or spherical shell) that best fits a given point set (=-=Chan, 2000-=-). Traditional algorithms for finding exact MEBs (such as (Megiddo, 1983; Welzl, 1991)) are not efficient for problems with d > 30. Hence, as mentioned in Section 1, it is of practical interest to stu... |

58 |
Learning and Example Selection for Object and Pattern Detection
- Sung
- 1996
(Show Context)
Citation Context ...omes faster than the other decomposition algorithms. 6.4 Extended MIT Face Data In this Section, we perform face detection using an extended version of the MIT face database 10 (Heisele et al., 2000; =-=Sung, 1996-=-). The original data set has 6,977 training images (with 2,429 faces and 4,548 nonfaces) and 24,045 test images (472 faces and 23,573 nonfaces). The original 19 × 19 grayscale images are first enlarge... |

54 | Classifying large data sets using SVMs with hierarchical clusters
- Yu, Yang, et al.
- 2003
(Show Context)
Citation Context ...ndom sampling, one can also use active learning (Schohn and Cohn, 2000; Tong and Koller, 2000), squashing (Pavlov et al., 2000a), editing (Bakir et al., 2005) or even clustering (Boley and Cao, 2004; =-=Yu et al., 2003-=-) to intelligently sample a small number of training data for SVM training. Other scale-up methods include the Kernel Adatron (Friess et al., 1998) and the SimpleSVM (Vishwanathan et al., 2003). For a... |

44 | Optimal core-sets for balls
- Bădoiu, Clarkson
(Show Context)
Citation Context ...g exact MEBs (e.g., (Megiddo, 1983; Welzl, 1991)) do not scale well with the dimensionality d of the points. Consequently, recent attention has shifted to the development of approximation algorithms (=-=Bădoiu and Clarkson, 2002-=-; Kumar et al., 2003; Nielsen and Nock, 2004). In particular, a breakthrough was obtained by Bădoiu and Clarkson (2002), who showed that an (1 + ε)-approximation of the MEB can be efficiently obtained... |

42 | Efficient kernel machines using the improved fast Gauss transform - Yang, Davis - 2005 |

35 | Approximate minimum enclosing balls in high dimensions using coresets
- Kumar, Mitchell, et al.
(Show Context)
Citation Context ...o, 1983; Welzl, 1991)) do not scale well with the dimensionality d of the points. Consequently, recent attention has shifted to the development of approximation algorithms (Bădoiu and Clarkson, 2002; =-=Kumar et al., 2003-=-; Nielsen and Nock, 2004). In particular, a breakthrough was obtained by Bădoiu and Clarkson (2002), who showed that an (1 + ε)-approximation of the MEB can be efficiently obtained using core sets. Ge... |

31 | Incremental support vector machine classification
- Fung, Mangasarian
(Show Context)
Citation Context ...rnel-related learning problems such as imbalanced learning, ranking and clustering. The iterative recruitment of core vectors is also similar to incremental procedures (Cauwenberghs and Poggio, 2001; =-=Fung and Mangasarian, 2002-=-), and this connection will be further explored. Besides, although the CVM can obtain much fewer support vectors than standard SVM implementations on large data sets, the number of support vectors may... |

28 | Shape fitting with outliers - Har-Peled, Wang - 2004 |

22 | Training support vector machine using adaptive clustering
- Boley, Cao
- 2004
(Show Context)
Citation Context ...matrix. Instead of random sampling, one can also use active learning (Schohn and Cohn, 2000; Tong and Koller, 2000), squashing (Pavlov et al., 2000a), editing (Bakir et al., 2005) or even clustering (=-=Boley and Cao, 2004-=-; Yu et al., 2003) to intelligently sample a small number of training data for SVM training. Other scale-up methods include the Kernel Adatron (Friess et al., 1998) and the SimpleSVM (Vishwanathan et ... |

18 | 2005) “Breaking SVM Complexity with Cross-Training
- Bakır, Bottou, et al.
(Show Context)
Citation Context ...random rectangular subset of the kernel matrix. Instead of random sampling, one can also use active learning (Schohn and Cohn, 2000; Tong and Koller, 2000), squashing (Pavlov et al., 2000a), editing (=-=Bakir et al., 2005-=-) or even clustering (Boley and Cao, 2004; Yu et al., 2003) to intelligently sample a small number of training data for SVM training. Other scale-up methods include the Kernel Adatron (Friess et al., ... |

18 | Scaling-Up Support Vector Machines Using Boosting Algorithm‖, ICPR - Pavlov, Mao, et al. - 2000 |

16 | Finite Newton method for Lagrangian support vector machine classification. Special Issue on Support Vector Machines - Fung, Mangasarian |

16 | A question in the geometry of situation - Sylvester |

11 |
Dealing with Large Diagonals in Kernel Matrices
- Weston, Scholkopf, et al.
- 2002
(Show Context)
Citation Context .... The AUC is always between 0 and 1. A perfect face detector will have unit AUC, while random guessing will have an AUC of 0.5. Another performance measure that will be reported is the balanced loss (=-=Weston et al., 2002-=-) ℓbal = 1 − TP+TN , 2 which is also suitable for imbalanced data sets. Here, TP = positives correctly classified , TN = total positives negatives correctly classified , total negatives are the true p... |

10 | Scaling kernel-based systems to large data sets - Tresp - 2001 |

9 | Decomposition methods for linear support vector machines - Chung, Kao, et al. - 2002 |

8 |
Approximating smallest enclosing balls
- Nielsen, Nock
- 2004
(Show Context)
Citation Context ...)) do not scale well with the dimensionality d of the points. Consequently, recent attention has shifted to the development of approximation algorithms (Bădoiu and Clarkson, 2002; Kumar et al., 2003; =-=Nielsen and Nock, 2004-=-). In particular, a breakthrough was obtained by Bădoiu and Clarkson (2002), who showed that an (1 + ε)-approximation of the MEB can be efficiently obtained using core sets. Generally speaking, in an ... |

7 | Active set support vector machine classification - Mangasarian, Musicant - 2001 |

5 | The Bayesian committee support vector machine
- Schwaighofer, Tresp
- 2001
(Show Context)
Citation Context ...eckerboard Data We first experiment on the 4 × 4 checkerboard data (Figure 2) commonly used for evaluating large-scale SVM implementations (Lee and Mangasarian, 2001; Mangasarian and Musicant, 2001b; =-=Schwaighofer and Tresp, 2001-=-). We use training sets with a maximum of 1 million points and 2,000 independent points for testing. Of course, this problem does not need so many points for training, but it is convenient for illustr... |

4 |
DirectSVM: a simple support vector machine perceptron
- Roobaert
- 2000
(Show Context)
Citation Context ...e constants, choosing the pair (xa,xb) that maximizes R0 is then equivalent to choosing the closest pair belonging to opposing classes, which is also the heuristic used in initializing the DirectSVM (=-=Roobaert, 2000-=-) and SimpleSVM (Vishwanathan et al., 2003). 4.1.2 DISTANCE COMPUTATIONS Steps 2 and 3 involve computing �ct − ˜ϕ(zℓ)� for zℓ ∈ S. On using c = ∑zi∈St αi ˜ϕ(zi) in (3), we have �ct − ˜ϕ(zℓ)� 2 = ∑ αiα... |