## An Optimization Perspective on Kernel Partial Least Squares Regression (2003)

Venue: | Advances in Learning Theory: Methods, Models and Applications |

Citations: | 17 - 4 self |

### BibTeX

@INPROCEEDINGS{Bennett03anoptimization,

author = {K. P. Bennett and K. P. Bennett and M. J. Embrechts and M. J. Embrechts},

title = {An Optimization Perspective on Kernel Partial Least Squares Regression},

booktitle = {Advances in Learning Theory: Methods, Models and Applications},

year = {2003},

pages = {227--250},

publisher = {Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. This work provides a novel derivation based on optimization for the partial least squares (PLS) algorithm for linear regression and the kernel partial least squares (K-PLS) algorithm for nonlinear regression. This derivation makes the PLS algorithm, popularly and successfully used for chemometrics applications, more accessible to machine learning researchers. The work introduces Direct K-PLS, a novel way to kernelize PLS based on direct factorization of the kernel matrix. Computational results and discussion illustrate the relative merits of K-PLS and Direct K-PLS versus closely related kernel methods such as support vector machines and kernel ridge regression. ∗ This work was supported by NSF grant number IIS-9979860. Many thanks to Roman Rosipal, Nello Cristianini, and Johan Suykens for many helpful discussions on PLS and kernel methods, Sean Ekans from Concurrent Pharmaceutical for providing molecule descriptions for the Albumin data set, Curt Breneman and N. Sukumar for generating descriptors for the Albumin data, and Tony Van Gestel for an efficient Gaussian kernel

### Citations

9002 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ...required, once the kernel matrix has been determined. There are two general approaches for kernelizing PLS. The first approach by Rosipal and Trejo is based on the now classic methodology used in SVM =-=[30, 22]-=-. Each point is mapped nonlinearly to a higher dimensional feature space. A linear regression function is constructed in the mapped space corresponding to a nonlinear function in the original input sp... |

1551 |
An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...classic SVM as implemented in SVMTorch [30, 5]. The kernel was not centered for SVM-TORCH. More precisely, the LS-SVM solution is produced by solving the following set of equations (using notation in =-=[3]-=-): to produce the following function (K + λI)α = y (22) f(x) = y ′ (K + λI) −1 k (23) where ki = K(x, xi), i = 1, . . . , m. The LS-RSVM method is constructed by solving the following equations: (K ′ ... |

1052 | Nonlinear component analysis as a kernel Eigenvalue problem
- Smola, J, et al.
- 1998
(Show Context)
Citation Context ...test set kernels should be centered in a consistent manner. Kernel centering can be implemented using the following formulas for centering the training kernel and test kernel as suggested by Wu et al =-=[39, 23, 22]-=-. It is important to note that the equation for the test kernel is based on the un-centered training kernel: Ktrain center = (I − 1 ℓ 11′ )Ktrain (I − 1 ℓ 11′ ) Ktest center = (Ktest − 1 ℓ 11′ Ktrain ... |

1045 |
Introduction to Linear and Nonlinear Programming
- Luenberger
- 1973
(Show Context)
Citation Context ...|xiw|| 2 s.t. w ′ w = 1. (7) i Assuming that xi has mean 0, this problem is equivalent to Problem (4). The optimal solution for w can be easily constructed using the first order optimality conditions =-=[18]-=-. To derive the optimality conditions for Problem (7), convert the problem to a minimization problem, construct the Lagrangian function, and set the derivative of the Lagrangian with respect to w to z... |

493 |
Ridge regression: Biased estimation for nonorthogonal problems
- Hoerl, Kennard
- 1970
(Show Context)
Citation Context ...: min w ℓ� i=1 1 2 (xiw − yi) 2 . (1) where xi is a row vector representing the i th data point. For high-dimensional data, the problem must be regularized to prevent overfitting. In ridge regression =-=[12]-=- this is accomplished by by penalizing large values of ||w|| 2 to yield: min w ℓ� i=1 1 2 (xiw − yi) 2 + λ 2 ||w||2 . (2) PLS’s method of regularization or capacity control distinguishes it from other... |

403 | Learning with Kernels: Support Vector - Scholkopf, Smola - 2002 |

300 | A scaled conjugate gradient algorithm for fast supervised learning.”NEURAL
- Møller
- 1993
(Show Context)
Citation Context ...from the DDASSL website [6]. The Least Squares SVM variants (LS-SVM, LS-RSVM and LS-RSVM lin) apply an extension of scaled conjugate gradient method introduced by Möller (in a very different context) =-=[19]-=- for fast equation solving. The scaled conjugate gradient method is effectively a Krylov method [14] and the computation time for solving a linear set of ℓ equations scales roughly as 50ℓ 2 , rather t... |

251 | SVMTorch: Support vector machines for large-scale regression problems
- Collobert, Bengio
- 2001
(Show Context)
Citation Context ... (LS-SVM) applied to the centered kernel; vi) the reduced form of Least-Squares Support Vector Machines [10] (LS-RSVM) applied to the centered kernel; and viii) classic SVM as implemented in SVMTorch =-=[30, 5]-=-. The kernel was not centered for SVM-TORCH. More precisely, the LS-SVM solution is produced by solving the following set of equations (using notation in [3]): to produce the following function (K + λ... |

193 | Efficient SVM training using low-rank kernel representations
- Fine, Scheinberg
(Show Context)
Citation Context ...el matrix. DK-PLS explicitly produces a low rank approximation of the kernel matrix. Thus it is more closely related to other kernel matrix approximation approaches based on sampling or factorization =-=[16, 10, 17, 9, 27, 24]-=-. DK-PLS has the advantage that the kernel does not need to be square. When combined with sampling of the columns of the kernel matrix such as in [17, 16, 10], it is more scalable than the original KP... |

119 | Estimation of principal components and related models by iterative least squares. Analysis: Multivariate - Wold - 1966 |

112 | O.L.: Proximal Support Vector Machine Classifiers
- Fung, Mangasarian
- 2001
(Show Context)
Citation Context ...el matrix. DK-PLS explicitly produces a low rank approximation of the kernel matrix. Thus it is more closely related to other kernel matrix approximation approaches based on sampling or factorization =-=[16, 10, 17, 9, 27, 24]-=-. DK-PLS has the advantage that the kernel does not need to be square. When combined with sampling of the columns of the kernel matrix such as in [17, 16, 10], it is more scalable than the original KP... |

103 | Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Spaces
- Rosipal, Trejo
(Show Context)
Citation Context .... This ability to do inference in high-dimensional space effectively makes PLS an ideal candidate for a kernel approach. Rosipal and Trejo extended PLS to nonlinear regression using kernels functions =-=[22]-=-. As demonstrated in this chapter, kernel partial least squares (K-PLS) is a very effective general purpose regression approach. The goal of this work is to make PLS and K-PLS more accessible to machi... |

91 |
PLS-regression: A basic tool of chemometrics
- Wold, Sjostrom, et al.
- 2001
(Show Context)
Citation Context ...his section, we compare several different types of kernel regression methods with kernel PLS. 5.1 Methods Different methods considered in this benchmark study include: i) Linear Partial Least Squares =-=[32, 33, 34, 37]-=- (PLS); ii) Linear Proximal Support Vector Machines [10] (P-SVM Lin); iii) K-PLS algorithm as proposed by Rosipal and Trejo [22] with kernel centering (K-PLS ); iv) direct kernel PLS (DK-PLS), which f... |

90 |
PLS Regression Methods
- Hoskuldsson
- 1988
(Show Context)
Citation Context ... ) has a solution ¯v, then v is the optimal solution of Problem (17) since the objective will then be zero, the lowest possible value. It turns out that P ′ W is always a lower triangular matrix (see =-=[22, 13]-=- ) and thus nonsingular. Thus we know that ¯v exists satisfying P ′ Wv = C ′ . The solution ˆv can be computed efficiently using forward substitution. For notational convenience we will say ¯v = (P ′ ... |

88 | Input space versus feature space in kernel-based methods
- Schölkopf, Mika, et al.
- 1999
(Show Context)
Citation Context ...el matrix. DK-PLS explicitly produces a low rank approximation of the kernel matrix. Thus it is more closely related to other kernel matrix approximation approaches based on sampling or factorization =-=[16, 10, 17, 9, 27, 24]-=-. DK-PLS has the advantage that the kernel does not need to be square. When combined with sampling of the columns of the kernel matrix such as in [17, 16, 10], it is more scalable than the original KP... |

73 |
Soft modeling: the basic design and some extensions. Systems Under Indirect Observation: Causality
- Wold
- 1982
(Show Context)
Citation Context ...his section, we compare several different types of kernel regression methods with kernel PLS. 5.1 Methods Different methods considered in this benchmark study include: i) Linear Partial Least Squares =-=[32, 33, 34, 37]-=- (PLS); ii) Linear Proximal Support Vector Machines [10] (P-SVM Lin); iii) K-PLS algorithm as proposed by Rosipal and Trejo [22] with kernel centering (K-PLS ); iv) direct kernel PLS (DK-PLS), which f... |

37 | A study on reduced support vector machines
- Lin, Lin
(Show Context)
Citation Context |

35 |
Tropsha A: Beware of q2
- Golbraikh
(Show Context)
Citation Context ...ts are presented by the least-mean square error and Q2 defined below. Q2 is the square of the correlation between the actual and predicted response. Since Q2 is independent on the scaling of the data =-=[11]-=-, it is generally useful to compare the relative (26)sAn Optimization Perspective on Kernel Partial Least Squares Regression 15 Method Data Set Boston Housing Albumin Abalone mxn 506x13 94x551 4177x8 ... |

16 |
Non-Linear Partial Least Squares Modelling
- Wold
- 1992
(Show Context)
Citation Context ...d works very well on high-dimensional collinear data, PLS is a natural candidate for a kernel method. 3 Nonlinear PLS via Kernels While other nonlinear extensions to PLS have previously been proposed =-=[1, 35, 36, 2]-=-, PLS has only just recently been extended to nonlinear regression through the use of kernels. KPLS exhibits the elegance that only linear algebra is required, once the kernel matrix has been determin... |

13 |
Nonlinear projection to latent structures revisited (the neural network PLS algorithm),” Computers and chemical engineering
- Baffi, Martin, et al.
- 1999
(Show Context)
Citation Context ...d works very well on high-dimensional collinear data, PLS is a natural candidate for a kernel method. 3 Nonlinear PLS via Kernels While other nonlinear extensions to PLS have previously been proposed =-=[1, 35, 36, 2]-=-, PLS has only just recently been extended to nonlinear regression through the use of kernels. KPLS exhibits the elegance that only linear algebra is required, once the kernel matrix has been determin... |

12 |
The Population Biology of Abalone (Haliotis Species) in Tasmania. I. Blacklip Abalone (H Rubra) from the North Coast and the Islands of Bass Strait
- Nash
- 1978
(Show Context)
Citation Context ...cases include Boston Housing, Abalone, and Albumin. The Boston Housing data were obtained from the UCI data repository. The Abalone data, which relate to the prediction of the age for horseshoe crabs =-=[21]-=-, were obtained from http://ssi.umh.ac.be/abalone.html. The Albumin dataset [4] is a public QSAR drug-design-related dataset and can be obtained from the DDASSL homepage [6]. The aim of the Albumin da... |

12 |
Cross-Validatory Estimation of the Number of
- Wold
- 1987
(Show Context)
Citation Context ...matrix X ′ X. But this yields only a one-dimensional representation of the data. A series of orthogonal projections can be computed by the NIPALS (Nonlinear Iterative Partial Least Squares) algorithm =-=[34]-=-. The data matrix is reduced in order to account for the part of the data explained by w. The data matrix is “deflated” by subtracting away the part explained by w. So at the next iteration the method... |

11 |
RSVM: Reduced Support Vector
- Lee, Mangasarian
- 2001
(Show Context)
Citation Context |

10 |
Data Strip Mining for the Virtual Design of Pharmaceuticals with Neural Networks
- Kewley, Embrechts
- 2000
(Show Context)
Citation Context ...ude in the input space, can further enhance the performance of K-PLS. Analyze/StripMiner has a feature selection method incorporated based on sensitivity analysis as described by Kewley and Embrechts =-=[15]-=-. Sensitivity analysis monitors the response as the features are tweaked one-at-a-time within their allowable range, while holding the other input features constant at their average value. Features th... |

9 |
Sparse Greedy Matrix Approximation for
- Smola, Schölkopf
- 2000
(Show Context)
Citation Context |

7 |
INLR, Implicit Non-Linear Latent Variable Regression
- Berglund, Wold
- 1997
(Show Context)
Citation Context ...d works very well on high-dimensional collinear data, PLS is a natural candidate for a kernel method. 3 Nonlinear PLS via Kernels While other nonlinear extensions to PLS have previously been proposed =-=[1, 35, 36, 2]-=-, PLS has only just recently been extended to nonlinear regression through the use of kernels. KPLS exhibits the elegance that only linear algebra is required, once the kernel matrix has been determin... |

6 |
Regularization Networks and Support Vector
- Evgeniou, Pontil, et al.
- 2000
(Show Context)
Citation Context ...which factorizes the centered kernel matrix directly as described (21)sAn Optimization Perspective on Kernel Partial Least Squares Regression 13 above; v) LS-SVM also known as Kernel Ridge Regression =-=[26, 7, 8]-=- (LS-SVM) applied to the centered kernel; vi) the reduced form of Least-Squares Support Vector Machines [10] (LS-RSVM) applied to the centered kernel; and viii) classic SVM as implemented in SVMTorch ... |

5 |
Cheminformatic models to predict binding affinities to human serum albumin
- Colmenarejo, Alvarez-Pedraglio, et al.
(Show Context)
Citation Context ... obtained from the UCI data repository. The Abalone data, which relate to the prediction of the age for horseshoe crabs [21], were obtained from http://ssi.umh.ac.be/abalone.html. The Albumin dataset =-=[4]-=- is a public QSAR drug-design-related dataset and can be obtained from the DDASSL homepage [6]. The aim of the Albumin dataset is predicting the binding affinities of small molecules to the human seru... |

5 | Scalable kernel systems
- Tresp, Schwaighofer
- 2001
(Show Context)
Citation Context ...ximation of the kernel matrix, and then uses this approximation to construct the final function. Strategies for improving kernel methods through factorization have been receiving increasing attention =-=[29, 9]-=-. DK-PLS not only computes a factorization very efficiently relative to eigenvector methods, but it produces a low-rank approximation biased for good performance on the regression task. Algorithm 1 is... |

3 | The Kernel PCA Algorithm for Wide Data - Wu, Massarat, et al. - 1977 |

2 |
The Idea behind Krylov
- Ipsen, Meyer
- 1998
(Show Context)
Citation Context ...an extension of scaled conjugate gradient method introduced by Möller (in a very different context) [19] for fast equation solving. The scaled conjugate gradient method is effectively a Krylov method =-=[14]-=- and the computation time for solving a linear set of ℓ equations scales roughly as 50ℓ 2 , rather than ℓ 3 for traditional equation solvers. 5.2 Benchmark Cases The benchmark studies comprise four bi... |