## The Kernel Recursive Least Squares Algorithm (2003)

### Cached

### Download Links

- [www-ee.technion.ac.il]
- [webee.technion.ac.il]
- CiteULike

### Other Repositories/Bibliography

Venue: | IEEE Transactions on Signal Processing |

Citations: | 61 - 2 self |

### BibTeX

@ARTICLE{Engel03thekernel,

author = {Yaakov Engel and Shie Mannor and Ron Meir},

title = {The Kernel Recursive Least Squares Algorithm},

journal = {IEEE Transactions on Signal Processing},

year = {2003},

volume = {52},

pages = {2275--2285}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a non-linear kernel-based version of the Recursive Least Squares (RLS) algorithm. Our Kernel-RLS (KRLS) algorithm performs linear regression in the feature space induced by a Mercer kernel, and can therefore be used to recursively construct the minimum mean squared -error regressor. Sparsity of the solution is achieved by a sequential sparsification process that admits into the kernel representation a new input sample only if its feature space image cannot be suffciently well approximated by combining the images of previously admitted samples. This sparsification procedure is crucial to the operation of KRLS, as it allows it to operate on-line, and by effectively regularizing its solutions. A theoretical analysis of the sparsification method reveals its close affinity to kernel PCA, and a data-dependent loss bound is presented, quantifying the generalization performance of the KRLS algorithm. We demonstrate the performance and scaling properties of KRLS and compare it to a stateof -the-art Support Vector Regression algorithm, using both synthetic and real data. We additionally test KRLS on two signal processing problems in which the use of traditional least-squares methods is commonplace: Time series prediction and channel equalization.

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...cost functions in conjunction with an additional regularization term encouraging “flat” solutions by penalizing the squared norm of the weight vector [27]. For SV classification, it has been shown=-= in [38] t-=-hat the expected number of of SVs is bounded below by (t−1)E(perr) where t is the number of training samples and E(perr) is the expectation of the error probability on a test sample. In spite of cla... |

2028 | Learning with Kernels
- Scholkopf, Smola
- 2002
(Show Context)
Citation Context ... computed without making direct reference to feature vectors. This idea, commonly known as the “kernel trick”, has been used extensively in recent years, most notably in classification and regress=-=ion [5, 15, 27]-=-. Focusing on regression, several kernel based algorithms have been proposed, most prominently Support Vector Regression (SVR) [37] and Gaussian Process Regression (GPR) [42]. Kernel methods present a... |

2026 |
A Wavelet Tour of Signal Processing
- Mallat
- 1999
(Show Context)
Citation Context ... of the kernel machine. Second, sparsity is related to generalization ability, and is considered a desirable property in learning algorithms (see, e.g. [27, 15]) as well as in signal processing (e.g. =-=[19]-=-). The ability of a kernel machine to correctly generalize from its learned experience to new data can be shown to improve as the number of its free variables decreases (as long as the training error ... |

1273 |
Spline models for observational data
- Wahba
- 1990
(Show Context)
Citation Context ...lutions attained by 5sthese methods are non-parametric in nature and are typically of the form ˆf(x) = t� αik(xi, x) , (1.1) i=1 where {xi} t i=1 are the training data points. The Representer Theo=-=rem [40]-=- assures us that in most practical cases, we need not look any further than an expression of the form (1.1). Since the number of of tunable parameters in kernel solutions equals the size of the traini... |

1231 |
Adaptive Filter Theory
- Haykin
- 1996
(Show Context)
Citation Context ...a solid line. Note that on the first 60 steps the prediction error is hardly noticeable. . . . . . . . . . . . . . . . 29 4s1 Introduction The celebrated recursive least-squares (RLS) algorithm (e.g. =-=[16, 14, 25]-=-) is a popular and practical algorithm used extensively in signal processing, communications and control. The algorithm is an efficient on-line method for finding linear predictors minimizing the mean... |

1191 |
System Identification Theory for the User
- Ljung
- 1999
(Show Context)
Citation Context ...control. The algorithm is an efficient on-line method for finding linear predictors minimizing the mean squared error over the training data. We consider the classic system identification setup (e.g. =-=[17]-=-), where we assume access to a recorded sequence of input and output samples Z t = {(x1, y1), . . . , (xt, yt)} arising from some unknown source. In the classic regression (or function approximation) ... |

1048 | Nonlinear component analysis as a kernel eigenvalue problem
- Schölkopf, Smola, et al.
- 1998
(Show Context)
Citation Context ...cipal component analysis (PCA) is known to deliver the optimal unsupervised dimensionality reduction for the mean-squared reconstruction error criterion, it is therefore natural to turn to kernel PCA =-=[28]-=- as a sparsification device. Indeed, many of the unsupervised methods mentioned above are closely related to kernel PCA. In [35] a sparse variant of kernel PCA is proposed, based on a Gaussian generat... |

943 |
An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ... computed without making direct reference to feature vectors. This idea, commonly known as the “kernel trick”, has been used extensively in recent years, most notably in classification and regress=-=ion [5, 15, 27]-=-. Focusing on regression, several kernel based algorithms have been proposed, most prominently Support Vector Regression (SVR) [37] and Gaussian Process Regression (GPR) [42]. Kernel methods present a... |

552 | Sparse Bayesian learning and the relevance vector machine. The
- Tipping
- 2001
(Show Context)
Citation Context ...ean-squared error in regression tasks), while unsupervised sparsification attempts to faithfully reproduce the the images of input samples in feature space. Examples for supervised sparsification are =-=[2, 34, 30, 39]-=-, of which [34] is unique in that it aims at achieving sparsity by taking a Bayesian approach in which a prior favoring sparse solutions is employed 2 . In [30] a greedy sparsification method is sugge... |

425 | Multivariate adaptive regression splines
- Friedman
- 1988
(Show Context)
Citation Context ... of noise. Moreover, in terms of generalization, the KRLS solution is at least as good as the SVR solution. We tested our algorithm on three additional synthetic data-sets, Friedman 1,2 and 3, due to =-=[12]-=-. Both training and test sets were 1000 samples long, and introduced noise was zeromean Gaussian with a standard deviation of 0.1. For these data-sets a simple preprocessing step was performed which c... |

316 |
Sparse approximate solutions to linear systems
- Natarajan
- 1995
(Show Context)
Citation Context ...ne (e.g. [39]), in which case the algorithm is free to choose any one of the training samples at each step of the construction process. Due to the intractability of finding the best subset of samples =-=[22]-=-, these algorithms usually resort to employing various greedy selection strategies, in which at each step the sample selected is the one that maximizes the amount of increase (or decrease) its additio... |

310 |
Neural Network Learning: Theoretical Foundations
- Anthony, Bartlett
- 1999
(Show Context)
Citation Context ...2 = k(zi, zi) + k(z ∗ , z ∗ )−k(zi, z ∗ )−k(z ∗ , zi). Since k itself is continuous we have that �φ(zi)−φ(z ∗ )� 2 2 → 0, so φ is continuous. We now recall several ideas from =-=functional analysis. See [1] for prec-=-ise definitions and applications in learning. An ℓ2-norm based cover of φ(X ) at a scale of ε (an ε-cover) is a collection of ℓ2-balls of radius ε, whose union contains φ(X ). The covering nu... |

283 | Using the Nyström method to speed up kernel machines
- Williams, Seeger
- 2001
(Show Context)
Citation Context ...lso be cast within a probabilistic Bayesian framework, see [32]. 14sspecific to Gaussian Process regression and is similar to kernel Matching Pursuit [39]. Examples of unsupervised sparsification are =-=[31, 43, 11, 35].-=- In [31] a randomized-greedy selection strategy is used to reduce the rank of the kernel matrix K while [43] uses a purely random strategy based on the Nyström method to achieve the same goal. In [11... |

267 | Regularization networks and support vector machines
- Evgeniou, Pontil, et al.
- 2000
(Show Context)
Citation Context ... this form of regularization is that SVR solutions are typically sparse – meaning that many of the αi variables vanish in the SVR solution (1.1). In SVR, and more generally in regularization networ=-=ks [10]-=-, sparsity is achieved by elimination. This means that, at the outset, these algorithms consider all training samples as potential contributing members of the expansion (1.1); and upon solving the opt... |

252 | SVMTorch: Support Vector Machines for Large-Scale Regression Problems
- Collobert, Bengio
- 2001
(Show Context)
Citation Context ...e, the training algorithm will not be able to take full advantage of this sparsity in terms of efficiency. As a consequence, even the current state-of-the art SVM algorithm scales super-linearly in t =-=[4]. S-=-econd, in SVMs the λi 13ssolution’s sparsity depends on the level of noise in the training data; this effect is especially pronounced in the case of regression. Finally, SVM solutions are known to ... |

225 |
Oscillation and chaos in physiological control system
- Mackey, Glass
(Show Context)
Citation Context ...with the Mackey–Glass chaotic time series. This time series may be generated by numerical integration of a time-delay differential equation that was proposed as a model of white blood cell productio=-=n [18]: dy dt ay-=-(t − τ) = − by(t), (5.20) 1 + y(t − τ) 10 where a = 0.2, b = 0.1. For τ > 16.8 the dynamics become chaotic; we therefore conducted our tests using two value for τ, corresponding to weakly ch... |

195 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond
- Williams
- 1999
(Show Context)
Citation Context ...ion and regression [5, 15, 27]. Focusing on regression, several kernel based algorithms have been proposed, most prominently Support Vector Regression (SVR) [37] and Gaussian Process Regression (GPR) =-=[42]. Ke-=-rnel methods present an alternative to the parametric approach. Solutions attained by 5sthese methods are non-parametric in nature and are typically of the form ˆf(x) = t� αik(xi, x) , (1.1) i=1 w... |

192 | Support vector method for function approximation, regression estimation, and signal processing
- Vapnik, Golowich, et al.
- 1997
(Show Context)
Citation Context ...n recent years, most notably in classification and regression [5, 15, 27]. Focusing on regression, several kernel based algorithms have been proposed, most prominently Support Vector Regression (SVR) =-=[37]-=- and Gaussian Process Regression (GPR) [42]. Kernel methods present an alternative to the parametric approach. Solutions attained by 5sthese methods are non-parametric in nature and are typically of t... |

189 | Efficient svm training using low-rank kernel representations
- Fine, Scheinberg
(Show Context)
Citation Context ...lso be cast within a probabilistic Bayesian framework, see [32]. 14sspecific to Gaussian Process regression and is similar to kernel Matching Pursuit [39]. Examples of unsupervised sparsification are =-=[31, 43, 11, 35].-=- In [31] a randomized-greedy selection strategy is used to reduce the rank of the kernel matrix K while [43] uses a purely random strategy based on the Nyström method to achieve the same goal. In [11... |

177 | Sparse greedy matrix approximation for machine learning
- Smola, Schölkopf
- 2000
(Show Context)
Citation Context ... underlying principle, another class, of “sparse-greedy” methods, aim at greedily constructing a non-redundant set of feature vectors starting with an initially empty set, rather than a full solut=-=ion [31, 30, 44]-=- (see also Chapter 10 of [27]). These methods are also closely related to the kernel Matching Pursuit algorithm [39]. It should be noted that the reason why greedy strategies are resorted to is due to... |

145 | Simplified support vector decision rules - Burges - 1996 |

144 | Improving the accuracy and speed of support vector machines
- Burges, Schölkopf
- 1997
(Show Context)
Citation Context ... the solutions provided by SVMs, both for classification and regression, may often be made significantly sparser, without altering the solutions’ weight vectors. It was also shown in [2] and later i=-=n [3] tha-=-t additional sparsity may be attained by allowing small changes to be made in the SVM solution, with little or no degradation in generalization ability. Burges’ idea is based on using a “reduced-s... |

120 | Sparse on-line Gaussian processes
- Csató, Opper
- 2002
(Show Context)
Citation Context ...not as well studied in the kernel-methods community, and is the one we address with our sparsification algorithm. Our method is most closely related to a sparsification method used By Csató and Opper=-= [6, 7] in-=- the context of learning with Gaussian Processes [13, 42]. Csató and Opper’s method also incrementally constructs a dictionary of input samples on which all other data are projected (with the proje... |

113 |
Input space vs. feature space in kernel-based methods
- Schölkopf, Mika, et al.
- 1999
(Show Context)
Citation Context ...rt from its size, is virtually unconstrained and therefore the algorithmic complexity of finding reduced-set solutions is rather high, posing a major obstacle to the widespread use of this method. In =-=[26]-=- and [8] it was suggested that restricting the reduced set to be a subset of the training samples would help alleviate the computational cost associated with the original reduced-set method. This was ... |

108 | Sparse greedy Gaussian process regression
- Smola, Bartlett
- 2000
(Show Context)
Citation Context ... underlying principle, another class, of “sparse-greedy” methods, aim at greedily constructing a non-redundant set of feature vectors starting with an initially empty set, rather than a full solut=-=ion [31, 30, 44]-=- (see also Chapter 10 of [27]). These methods are also closely related to the kernel Matching Pursuit algorithm [39]. It should be noted that the reason why greedy strategies are resorted to is due to... |

72 | Efficient Implementation of Gaussian Processes,” Draft manuscript, available from http://wol.ra.phy. cam.ac.uk/mackay/homepage.html
- Gibbs, MacKay
- 1997
(Show Context)
Citation Context ...s the one we address with our sparsification algorithm. Our method is most closely related to a sparsification method used By Csató and Opper [6, 7] in the context of learning with Gaussian Processes=-= [13, 42]. C-=-sató and Opper’s method also incrementally constructs a dictionary of input samples on which all other data are projected (with the projection performed in the feature space H). However, while in o... |

72 | A state-space approach to adaptive RLS filtering - Sayed, Kaileth - 1994 |

64 | Incremental Learning with Support Vector Machines
- Syed, Huan, et al.
- 1999
(Show Context)
Citation Context ...umber of of SVs is bounded below by (t−1)E(perr) where t is the number of training samples and E(perr) is the expectation of the error probability on a test sample. In spite of claims to the contrar=-=y [33]-=-, it has been shown, both theoretically and empirically [2, 8], that the solutions provided by SVMs are not always maximally sparse. It also stands to reason that once a sufficiently large training se... |

61 |
Learning kernel classifiers
- Herbrich
- 2002
(Show Context)
Citation Context ... computed without making direct reference to feature vectors. This idea, commonly known as the “kernel trick”, has been used extensively in recent years, most notably in classification and regress=-=ion [5, 15, 27]-=-. Focusing on regression, several kernel based algorithms have been proposed, most prominently Support Vector Regression (SVR) [37] and Gaussian Process Regression (GPR) [42]. Kernel methods present a... |

61 | Kernel matching pursuit
- Vincent, Bengio
- 2002
(Show Context)
Citation Context ...he algorithm starts with an empty representation, in which all coefficients vanish, and gradually adds samples according to some criterion. Constructive sparsification is normally used off-line (e.g. =-=[39]-=-), in which case the algorithm is free to choose any one of the training samples at each step of the construction process. Due to the intractability of finding the best subset of samples [22], these a... |

59 | Exact simplification of support vector solutions
- Downs, Gates, et al.
- 2001
(Show Context)
Citation Context ...the number of training samples and E(perr) is the expectation of the error probability on a test sample. In spite of claims to the contrary [33], it has been shown, both theoretically and empirically =-=[2, 8]-=-, that the solutions provided by SVMs are not always maximally sparse. It also stands to reason that once a sufficiently large training set has been learned, any additional training samples would not ... |

56 | Reducing the run-time complexity in support vector machines
- Osuna, Girosi
- 1999
(Show Context)
Citation Context ...Finally, SVM solutions are known to be non-maximally sparse. This is due to the special form of the SVM quadratic optimization problem, in which the constraints limit the level of sparsity attainable =-=[23]-=-. Shortly after the introduction of SVMs to the machine learning community it was realized [2] that the solutions provided by SVMs, both for classification and regression, may often be made significan... |

48 | Sparse representation for gaussian process models
- Csató, Opper
(Show Context)
Citation Context ...not as well studied in the kernel-methods community, and is the one we address with our sparsification algorithm. Our method is most closely related to a sparsification method used By Csató and Opper=-= [6, 7] in-=- the context of learning with Gaussian Processes [13, 42]. Csató and Opper’s method also incrementally constructs a dictionary of input samples on which all other data are projected (with the proje... |

44 | Bayesian methods for support vector machines: Evidence and predictive class probabilities
- Sollich
- 2002
(Show Context)
Citation Context ...solutions is employed 2 . In [30] a greedy sparsification method is suggested that is 2 It should be noted that support vector machines may also be cast within a probabilistic Bayesian framework, see =-=[32]-=-. 14sspecific to Gaussian Process regression and is similar to kernel Matching Pursuit [39]. Examples of unsupervised sparsification are [31, 43, 11, 35]. In [31] a randomized-greedy selection strateg... |

33 |
Sequential greedy approximation for certain convex optimization problems
- Zhang
(Show Context)
Citation Context ... underlying principle, another class, of “sparse-greedy” methods, aim at greedily constructing a non-redundant set of feature vectors starting with an initially empty set, rather than a full solut=-=ion [31, 30, 44]-=- (see also Chapter 10 of [27]). These methods are also closely related to the kernel Matching Pursuit algorithm [39]. It should be noted that the reason why greedy strategies are resorted to is due to... |

32 | Sparse online greedy support vector regression
- Engel, Mannor, et al.
- 2002
(Show Context)
Citation Context ...tions the algorithm is presented at each time step with a single training sample and a simple dichotomic decision has to be made: Either add the next sample into the representation, or discard it. In =-=[9]-=- we proposed a solution to this problem by an on-line constructive sparsification method based on sequentially admitting into the kernel representation only samples that cannot be approximately repres... |

30 |
Support vector machine techniques for nonlinear equalization
- Sebald, Bucklew
- 2000
(Show Context)
Citation Context ...is claim we test KRLS on two well known and difficult time series prediction problems. Finally, we apply KRLS to a non-linear channel equalization problem, on which SVMs were reported to perform well =-=[29]-=-. All tests were run on a 256Mb, 667MHz Pentium 3 Linux workstation. 5.1 Non-Linear Regression We report the results of experiments comparing the KRLS algorithm (coded in C) to the state-of-the-art SV... |

29 | Sparse kernel principal component analysis
- Tipping
- 2001
(Show Context)
Citation Context ...lso be cast within a probabilistic Bayesian framework, see [32]. 14sspecific to Gaussian Process regression and is similar to kernel Matching Pursuit [39]. Examples of unsupervised sparsification are =-=[31, 43, 11, 35].-=- In [31] a randomized-greedy selection strategy is used to reduce the rank of the kernel matrix K while [43] uses a purely random strategy based on the Nyström method to achieve the same goal. In [11... |

27 | Using support vector machines for time series prediction
- Müller, Smola, et al.
- 1999
(Show Context)
Citation Context ...umerous learning architectures and algorithms have been thrown at this problem with mixed results (see e.g. [41]). One of the more successful general purpose algorithms tested on TSP is again the SVM =-=[21]-=-; however SVMs are inherently limited by their off-line (batch) mode of training, and their poor scaling properties. We argue that KRLS is a more appropriate tool in this domain and to support this cl... |

4 |
Generalization bounds for Bayesian mixture algorithms
- Meir, Zhang
- 2003
(Show Context)
Citation Context ...ction. While this assumption is often acceptable for classification, this is clearly not the case for regression. Recently a generalization error bound for unbounded loss functions was established in =-=[20]-=-, where the boundedness assumption is replaced by a moment condition. Theorem 4.1 below is a slightly revised version of Theorem 8 in [20]. We quote the general theorem, and then apply it to the speci... |

1 |
Gershenfeld eds. Time Series Prediction
- Weigend, A
- 1994
(Show Context)
Citation Context ...m to a regression problem with the caveat that samples can no longer be assumed to be IID. Numerous learning architectures and algorithms have been thrown at this problem with mixed results (see e.g. =-=[41]-=-). One of the more successful general purpose algorithms tested on TSP is again the SVM [21]; however SVMs are inherently limited by their off-line (batch) mode of training, and their poor scaling pro... |