## A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods (2003)

### Cached

### Download Links

Citations: | 59 - 6 self |

### BibTeX

@TECHREPORT{Lin03astudy,

author = {Hsuan-tien Lin and Chih-Jen Lin},

title = {A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods},

institution = {},

year = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

The sigmoid kernel was quite popular for support vector machines due to its origin from neural networks. However, as the kernel matrix may not be positive semidefinite (PSD), it is not widely used and the behavior is unknown. In this paper, we analyze such non-PSD kernels through the point of view of separability. Based on the investigation of parameters in different ranges, we show that for some parameters, the kernel matrix is conditionally positive definite (CPD), a property which explains its practical viability. Experiments are given to illustrate our analysis. Finally, we discuss how to solve the non-convex dual problems by SMO-type decomposition methods. Suitable modifications for any symmetric non-PSD kernel matrices are proposed with convergence proofs.

### Citations

8983 | N.: The nature of statistical learning theory
- Vapnik
- 1995
(Show Context)
Citation Context ... Given training vectors x i # R n , i = 1, . . . , l in two classes, labeled by the vector y # R l such that y i # {1, -1}, the support vector machine (SVM) (Boser, Guyon, and Vapnik 1992; Cortes and =-=Vapnik 1995-=-) tries to separate the training vectors in a #-mapped (and possibly infinite dimensional) space, with an error cost C > 0: min w,b,# 1 2 w T w + C l # i=1 # i subject to y i (w T #(x i ) + b) # 1 - #... |

3437 | LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm
- Chang, Lin
(Show Context)
Citation Context ...998). The first three data sets are linearly scaled so values of each attribute are in [-1, 1]. For a1a, its values of each attribute are in [0, 1] so we do not scale it. We solve (1.2) using LIBSVM (=-=Chang and Lin 2001-=-), with its model selection tool for grid search and contour drawing. Note that LIBSVM, an SMO-type decomposition implementation, uses techniques in Section 6 for solving non-convex optimization probl... |

2868 |
UCI Repository of machine learning databases
- Blake, Merz
- 1998
(Show Context)
Citation Context ...n log 2 a (-11 to-2 with grid space 1) and log 2 C (-2 to 13 with grid space 1). Four problems are tested: heart, german, diabete, and a1a. They are from (Michie, Spiegelhalter, and Taylor 1994) and (=-=Blake and Merz 1998-=-). The first three data sets are linearly scaled so values of each attribute are in [-1, 1]. For a1a, its values of each attribute are in [0, 1] so we do not scale it. We solve (1.2) using LIBSVM (Cha... |

2284 | A tutorial on support vector machines for pattern recognition
- Burges
- 1998
(Show Context)
Citation Context ... is related to neural networks. It was first pointed out in (Vapnik 1995) that its kernel matrix might not be PSD for certain values of the parameters a and r. More discussions are in, for instance, (=-=Burges 1998-=-; Scholkopf and Smola 2002). Without K(x i , x j ) being the inner product of two vectors, there is no problem (1.1) so it is unclear what kind of classification problems we are solving. Surprisingly,... |

2173 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...ntroduction Given training vectors x i # R n , i = 1, . . . , l in two classes, labeled by the vector y # R l such that y i # {1, -1}, the support vector machine (SVM) (Boser, Guyon, and Vapnik 1992; =-=Cortes and Vapnik 1995-=-) tries to separate the training vectors in a #-mapped (and possibly infinite dimensional) space, with an error cost C > 0: min w,b,# 1 2 w T w + C l # i=1 # i subject to y i (w T #(x i ) + b) # 1 - #... |

2029 | Learning with Kernels
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ...o neural networks. It was first pointed out in (Vapnik 1995) that its kernel matrix might not be PSD for certain values of the parameters a and r. More discussions are in, for instance, (Burges 1998; =-=Scholkopf and Smola 2002-=-). Without K(x i , x j ) being the inner product of two vectors, there is no problem (1.1) so it is unclear what kind of classification problems we are solving. Surprisingly, the sigmoid kernel has be... |

1441 |
Making large-Scale SVM Learning Practical
- Joachims
- 1999
(Show Context)
Citation Context ...ation for non-PSD Kernel Matrices First we discuss how decomposition methods work for PSD kernels and the di#culties for non-PSD cases. The decomposition method (e.g. (Osuna, Freund, and Girosi 1997; =-=Joachims 1998-=-; Platt 1998; Chang and Lin 2001)) is an iterative process. In each step, the index set of variables is partitioned to two sets B and N , where B is the working set. Then in that iteration variables c... |

1291 | A training algorithm for optimal margin classifiers - Boser, Guyon, et al. |

1011 |
Fast training of support vector machines using sequential minimal optimization
- Platt
- 1998
(Show Context)
Citation Context ...SD Kernel Matrices First we discuss how decomposition methods work for PSD kernels and the di#culties for non-PSD cases. The decomposition method (e.g. (Osuna, Freund, and Girosi 1997; Joachims 1998; =-=Platt 1998-=-; Chang and Lin 2001)) is an iterative process. In each step, the index set of variables is partitioned to two sets B and N , where B is the working set. Then in that iteration variables corresponding... |

565 | Training Support Vector Machines: An Application to Face Detection - Osuna, Osuna, et al. - 1997 |

342 | What every computer scientist should know about floating-point arithmetic
- Goldberg
- 1991
(Show Context)
Citation Context ...educe d j to the lower bound. If the kernel matrix is only PSD, it is possible that K ii - 2K ij + K jj = 0, as shown in Figure 3(b). In this case, using the trick under IEEE floating point standard (=-=Goldberg 1991-=-), we can make sure that (6.6) to be -# which is still defined. Then, a comparison with L still reduce d j to the lower bound. Thus, a direct (but careful) use of (6.6) does not cause any problem. Mor... |

277 | Interpolation of scattered data: Distance matrices and conditionally positive de nite functions
- Micchelli
- 1986
(Show Context)
Citation Context ... # 2 . If written in matrix products, we can see that the first and last terms form the same diagonal matrices with positive elements. And the middle one is in the form of an RBF kernel matrix. From (=-=Micchelli 1986-=-), if x i #= x j , for i #= j, the RBF kernel matrix is PD. Therefore, H is PD. If H r is not PD after r is small enough, there is an infinite sequence {r i } with lim i## r i = -# and H r i , #i are ... |

191 | K.: “Improvements to Platt’s SMO algorithm for SVM classifier design”, Neural Computation 13 - Keerthi, Shevade, et al. - 2001 |

150 | Support vector machines, reproducing kernel hilbert spaces and randomized gacv
- Wahba
- 1998
(Show Context)
Citation Context ... unlike standard SVM, the sparsity is lost. There are other formulations which use a non-PSD kernel matrix but remain convex. For example, we can consider the kernel logistic regression (KLR) (e.g., (=-=Wahba 1998-=-)) and use a convex regularization term: min α,b 1 2 l� r=1 α 2 r + C 24 i=1 l� log(1 + e ξr ), (33) r=1 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 84 81 78 75 72 69 rswhere ξr ≡ −yr( l� αjK(xr, xj) + b). By defin... |

136 | Training invariant support vector machines
- DeCoste, Schölkopf
(Show Context)
Citation Context ...e in (Scholkopf 1997). Recently, quite a few kernels specific to di#erent applications are proposed. However, similar to the sigmoid kernel, some of them are not PSD either (e.g. kernel jittering in (=-=DeCoste and Scholkopf 2002-=-) and tangent distance kernels in (Haasdonk and Keysers 2002)). Thus, it is essential to analyze such non-PSD kernels. In Section 2, we discuss them by considering the separability of training data. T... |

133 | Harmonic Analysis on Semigroups - Berg, Christensen, et al. - 1984 |

131 | Support vector learning
- Scholkopf
(Show Context)
Citation Context ...s no problem (1.1) so it is unclear what kind of classification problems we are solving. Surprisingly, the sigmoid kernel has been successfully used in some practical cases. Some explanations are in (=-=Scholkopf 1997-=-). Recently, quite a few kernels specific to di#erent applications are proposed. However, similar to the sigmoid kernel, some of them are not PSD either (e.g. kernel jittering in (DeCoste and Scholkop... |

109 | On the Convergence of the Decomposition Method for Support Vector Machines
- Lin
- 2001
(Show Context)
Citation Context ...oss-validation accuracy using problems (1.2) and (5.1). The same four problems as in Table 1 are used, with the same ranges of parameters. We use LIBSVM for solving (1.2), and a modification of BSVM (=-=Hsu and Lin 2002-=-) for (5.1). Results of CV accuracy are presented in Figure 2. Contours of (1.2) are on 16 the left column, and those of (5.1) are on the right. For each contour, the horizontal axis is log 2 C, while... |

101 | Asymptotic behaviors of support vector machines with Gaussian kernel
- Keerthi, Lin
(Show Context)
Citation Context ...ndix A. Theorem 8 tells us that when rs0 is small enough, the separating hyperplanes of (1.2) and (4.5) are almost the same. Similar cross-validation accuracy will be shown in the later experiments. (=-=Keerthi and Lin 2003-=-, Theorem 2) shows that when a # 0, for any given C, the decision value by the SVM using the RBF kernel e -a#x i -x j # 2 with the error costsC 2a approaches the decision value of the following linear... |

80 | The kernel trick for distances
- Scholkopf
- 2000
(Show Context)
Citation Context ...le, (Berg, Christensen, and Ressel 1984). Then, the use of a CPSD kernel is equivalent to the use of a PSD one as y T # = 0 in (1.2) plays a similar role of # l i=1 v i = 0 in the definition of CPSD (=-=Scholkopf 2000-=-). Note that here for easier analyses, we will work only on the kernel matrices but not the kernel functions. Therefore, results will be more restricted. The following theorem gives properties which i... |

60 |
Geometry and invariance in kernel based methods
- Burges
- 1999
(Show Context)
Citation Context ...ted to the RBF kernel when a is fixed and r gets small enough. In Section 4, we will discuss more about the relation between the sigmoid and the RBF kernels. Case 2: a > 0 and r # 0 It was stated in (=-=Burges 1999-=-) that if tanh(ax T i x j + r) is PD, then r # 0 and a # 0. However, the inverse does not hold so for this case, kernels may not be PD 8 and the practical viability is not clear. As Section 2 has show... |

56 | Reducing the run-time complexity in support vector machines
- Osuna, Girosi
- 1999
(Show Context)
Citation Context ...= 1, . . . , l. It is from substituting w = # l i=1 y i # i #(x i ) into (1.1) so that w T w = # T Q# and y i w T #(x i ) = (Q#) i . Note that in (2.1), # i may be negative. This problem was used in (=-=Osuna and Girosi 1998-=-) and some subsequent work. (Lin and Lin 2003) shows that if Q is symmetric PSD, the optimal solution # of the dual problem (1.2) is also optimal for (2.1). However, the opposite may not be true unles... |

37 | A Study on Reduced Support Vector Machines
- Lin, Lin
- 2003
(Show Context)
Citation Context ...1 y i # i #(x i ) into (1.1) so that w T w = # T Q# and y i w T #(x i ) = (Q#) i . Note that in (2.1), # i may be negative. This problem was used in (Osuna and Girosi 1998) and some subsequent work. (=-=Lin and Lin 2003-=-) shows that if Q is symmetric PSD, the optimal solution # of the dual problem (1.2) is also optimal for (2.1). However, the opposite may not be true unless Q is symmetric positive definite (PD). From... |

32 | Tangent distance kernels for support vector machines
- Haasdonk, Keysers
- 2002
(Show Context)
Citation Context ... to di#erent applications are proposed. However, similar to the sigmoid kernel, some of them are not PSD either (e.g. kernel jittering in (DeCoste and Scholkopf 2002) and tangent distance kernels in (=-=Haasdonk and Keysers 2002-=-)). Thus, it is essential to analyze such non-PSD kernels. In Section 2, we discuss them by considering the separability of training data. Then in Section 3, we explain the practical viability of the ... |

18 |
Neural Network FAQ - periodic posting to the Usenet newsgroup comp.ai.neural-nets
- Sarle
- 1997
(Show Context)
Citation Context ...nel, where discussion in Section 3 indicates that parameters with better accuracy tends to be with CPD kernel matrices. It is well known that Neural Networks have similar problems about local minima (=-=Sarle 1997-=-), and a popular way to prevent trapping in a bad one is multiple random initializations. Here we adapt this method and present an empirical study in Figure 5. We use the heart data set, with the same... |

16 | Asymptotic convergence of an SMO algorithm without any assumptions
- Lin
- 2002
(Show Context)
Citation Context ... method using (6.3) for the working set selection and (6.8) for solving the sub-problem, any limit point of {# k } is a local minimum of (1.2). 23 Proof. If we carefully check the proof in (Lin 2001; =-=Lin 2002-=-), it can be extended to nonPSDsQ if (1) (6.10) holds and (2) a local minimum of the sub-problem is obtained in each iteration. Now we have (6.10) from Lemma 3. In addition, d j = L is essentially one... |

16 |
Newton’s method for large-scale bound constrained problems
- Moré
- 2006
(Show Context)
Citation Context ...s in Section 6 for solving non-convex optimization problems. For KLR, two optimization procedures are compared. The first one, KLR-NT, is a Newton’s method implemented by modifying the software TRON (=-=Lin and Moré 1999-=-). The second one, KLR-CG, is conjugate gradient method (see, for example, (Nash and Sofer 1996)). The stopping criteria for the two procedures are set the same to ensure that the solutions are compar... |

13 | Data available at http://www.ncc.up.pt/liacc/ML/statlog/datasets.html - Michie, Spiegelhalter, et al. - 1994 |

9 | UCI Repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html - J - 1998 |

8 | On the convergence of a modified version of SVMlight algorithm
- Palagi, Sciandrone
(Show Context)
Citation Context ...e is not clear. In the whole convergence proof, Lemma 3 is used to obtain ## k+1 - # k # # 0 as k # #. A di#erent way to have this property is by slightly modifying the sub-problem (6.1) as shown in (=-=Palagi and Sciandrone 2002-=-). Then the convergence holds when we exactly solve the new sub-problem. 7 Discussions From the results in Sections 3 and 5, we clearly see the importance of the CPDness which is directly related to t... |

2 |
The Separability Theory of Hyperbolic Tangent Kernels and Support Vector Machines for Pattern Classification
- Sellathurai, Haykin
- 1999
(Show Context)
Citation Context ...out y T # = 0, CPD of K for small r is not useful. The experiments fully demonstrate the importance of incorporating constraints of the dual problem into the analysis of the kernel. An earlier paper (=-=Sellathurai and Haykin 1999-=-) on the sigmoid kernel explains that each kernel element (i.e., K ij ) is from a hyperbolic inner product. Thus, a special type of maximal margin still exists. However, as shown in Figure 2, without ... |

2 | Dongarra (2000). Automatically tuned linear algebra software and the ATLAS project - Whaley, Petitet, et al. |