## Online Learning with Kernels (2003)

### Cached

### Download Links

- [mlg.anu.edu.au]
- [omega.albany.edu:8008]
- [axiom.anu.edu.au]
- [users.cecs.anu.edu.au]
- [www-2.cs.cmu.edu]
- [books.nips.cc]
- [mlg.anu.edu.au]
- DBLP

### Other Repositories/Bibliography

Citations: | 2044 - 128 self |

### BibTeX

@MISC{Kivinen03onlinelearning,

author = {Jyrki Kivinen and Alexander J. Smola and Robert C. Williamson},

title = {Online Learning with Kernels},

year = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

Kernel based algorithms such as support vector machines have achieved considerable success in various problems in the batch setting where all of the training data is available in advance. Support vector machines combine the so-called kernel trick with the large margin idea. There has been little use of these methods in an online setting suitable for real-time applications. In this paper we consider online learning in a Reproducing Kernel Hilbert Space. By considering classical stochastic gradient descent within a feature space, and the use of some straightforward tricks, we develop simple and computationally efficient algorithms for a wide range of problems such as classification, regression, and novelty detection. In addition to allowing the exploitation of the kernel trick in an online setting, we examine the value of large margins for classification in the online setting with a drifting target. We derive worst case loss bounds and moreover we show the convergence of the hypothesis to the minimiser of the regularised risk functional. We present some experimental results that support the theory as well as illustrating the power of the new algorithms for online novelty detection. In addition

### Citations

9021 | The Nature of Statistical Learning Theory - Vapnik - 1995 |

1304 | A training algorithm for optimal margin classifiers - Boser, Guyon, et al. - 1992 |

1225 | Additive logistic regression: a statistical view of boosting
- Friedman, Hastie, et al.
(Show Context)
Citation Context ...h learning with the �-insensitive loss, this goal has proven elusive for other estimators in the standard batch setting. In the online situation, however, such an extension is quite natural (see also =-=[4]-=-). All we need to do is make � a variable of the optimization problem and set �� �� � � 4 Theoretical Analysis �� � � �� � if � � � � � ��� � �� � otherwise. (21) Consider now the classification probl... |

786 | Theory of reproducing kernels - Aronszajn - 1950 |

641 | Networks for approximation and learning - Poggio, Girosi - 1990 |

510 | Estimating the support of a high-dimensional distribution. Neural Comput 2001;13(7):1443–71
- Schölkopf, Platt, et al.
(Show Context)
Citation Context ...regularized risk ���� ℄ �� � �� ℄ �Å�� ℄ � � �� �� ��� � �Å�� ℄ for � � � (2) Common loss functions are the soft margin loss function [1] or the logistic loss for classification and novelty detection =-=[14]-=-, the quadratic loss, absolute loss, Huber’s robust loss [9], or the �-insensitive loss [16] for regression. We discuss these in Section 3. In some cases the loss function depends on an additional par... |

470 | Real Analysis and Probability - Dudley - 1989 |

438 | Graphical models, exponential families, and variational inference - Wainwright, Jordan - 2003 |

369 | Convolution kernels on discrete structures - Haussler - 1999 |

327 | New support vector algorithms
- Schölkopf, Smola, et al.
(Show Context)
Citation Context ...ons increases. More specifically, the Representer Theorem [10] implies that the number of kernel functions can grow up to linearly with the number of observations. Depending on the loss function used =-=[15]-=-, this will happen in practice in most cases. Thereby the complexity of the estimator used in prediction increases linearly over time (in some restricted situations this can be reduced to logarithmica... |

280 |
Some results on tchebycheffian spline functions
- Kimeldorf, Wahba
- 1971
(Show Context)
Citation Context ... the Gaussian Process view is taken). Secondly, the functional representation of the estimator becomes more complex as the number of observations increases. More specifically, the Representer Theorem =-=[10]-=- implies that the number of kernel functions can grow up to linearly with the number of observations. Depending on the loss function used [15], this will happen in practice in most cases. Thereby the ... |

269 | Regularization networks and support vector machines - Evegeniou, Pontil, et al. - 2000 |

257 | Functions of positive and negative type and their connection with the theory of integral equation - Mercer - 1909 |

210 | Robust linear programming discrimination of two linearly inseparable sets
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ...an additional regularization term Å�� ℄. This sum is known as the regularized risk ���� ℄ �� � �� ℄ �Å�� ℄ � � �� �� ��� � �Å�� ℄ for � � � (2) Common loss functions are the soft margin loss function =-=[1]-=- or the logistic loss for classification and novelty detection [14], the quadratic loss, absolute loss, Huber’s robust loss [9], or the �-insensitive loss [16] for regression. We discuss these in Sect... |

205 | An equivalence between sparse approximation and support vector machines. Neural computation - Girosi - 1998 |

195 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond - Williams - 1999 |

193 | Support vector method for function approximation, regression estimation, and signal processing
- Vapnik, Golowich, et al.
- 1997
(Show Context)
Citation Context ...nctions are the soft margin loss function [1] or the logistic loss for classification and novelty detection [14], the quadratic loss, absolute loss, Huber’s robust loss [9], or the �-insensitive loss =-=[16]-=- for regression. We discuss these in Section 3. In some cases the loss function depends on an additional parameter such as the width of the margin � or the size of the �-insensitive zone. One may make... |

171 | Incremental and decremental support vector machine learning
- Cauwenberghs, Poggio
- 2001
(Show Context)
Citation Context ...duced to logarithmical cost [8]). Finally, training time of batch and/or incremental update algorithms typically increases superlinearly with the number of observations. Incremental update algorithms =-=[2]-=- attempt to overcome this problem but cannot guarantee a bound on the number of operations required per iteration. Projection methods [3] on the other hand, will ensure a limited number of updates per... |

163 | On the influence of the kernel on the consistency of support vector machines - Steinwart |

149 | Spline models for observational data, volume 59 - Wahba - 1990 |

148 | The connection between regularization operators and support vector kernels - Smola, Schölkopf, et al. - 1998 |

142 |
Functional gradient techniques for combining hypotheses
- Mason, Baxter, et al.
- 1999
(Show Context)
Citation Context ...e would like to minimize ���� ℄. This can be costly if the number of observations is large. Recently several gradient descent algorithms for minimizing such functionals efficiently have been proposed =-=[13, 7]-=-. Below we extend these methods to stochastic gradient descent by approximating ���� ℄ by ���� ℄ �� � �� �Å�� ℄ (3)and then performing gradient descent with respect to ���� ℄. Here is either randomly... |

135 | G: A correspondence between bayesian estimation on stochastic processes and smoothing by splines - GS, Wahba - 1970 |

134 | Harmonic Analysis on Semigroups - Berg, Christenssen, et al. - 1984 |

131 | Correcting sample selection bias by unlabeled data - HUANG, SMOLA, et al. - 2007 |

113 | Input space vs. feature space in kernel-based methods - Schölkopf, Mika, et al. - 1999 |

112 | Introduction to Gaussian processes - MacKay - 1998 |

87 | A new approximate maximal margin classification algorithm
- Gentile
(Show Context)
Citation Context ...sive since they require one matrix multiplication at each step. The size of the matrix is given by the number of kernel functions required at each step. Recently several algorithms have been proposed =-=[5, 8, 6, 12]-=- performing perceptron-like updates for classification at each step. Some algorithms work only in the noise free case, others not for moving targets, and yet again others assume an upper bound on the ... |

75 | Theory of Reproducing Kernels and its Applications - Saitoh - 1988 |

73 | The relaxed online maximum margin algorithm
- Li, Long
- 1999
(Show Context)
Citation Context ...sive since they require one matrix multiplication at each step. The size of the matrix is given by the number of kernel functions required at each step. Recently several algorithms have been proposed =-=[5, 8, 6, 12]-=- performing perceptron-like updates for classification at each step. Some algorithms work only in the noise free case, others not for moving targets, and yet again others assume an upper bound on the ... |

72 | Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators - Williamson, Smola, et al. - 2001 |

71 | Asymptotic analysis of penalized likelihood and related estimators,” Ann - Cox, O’Sullivan - 1990 |

67 | A kernel method for the two-sample-problem - Gretton, Borgwardt, et al. - 2007 |

60 | Geometry and invariance in kernel based methods - Burges - 1999 |

55 | A Hilbert space embedding for distributions - Smola, Gretton, et al. - 2007 |

52 | Performance guarantees for regularized maximum entropy density estimation - Dudík, Phillips, et al. - 2004 |

49 | Sparse representation for gaussian process models
- Csató, Opper
- 2001
(Show Context)
Citation Context ...with the number of observations. Incremental update algorithms [2] attempt to overcome this problem but cannot guarantee a bound on the number of operations required per iteration. Projection methods =-=[3]-=- on the other hand, will ensure a limited number of updates per iteration. However they can be computationally expensive since they require one matrix multiplication at each step. The size of the matr... |

48 | Squashing flat files flatter - DuMouchel, Volinsky, et al. - 1999 |

37 | A.: Unifying divergence minimization and statistical inference via convex duality - Altun, Smola - 2006 |

35 |
Robust statistics: a review
- Huber
- 1972
(Show Context)
Citation Context ... ℄ for � � � (2) Common loss functions are the soft margin loss function [1] or the logistic loss for classification and novelty detection [14], the quadratic loss, absolute loss, Huber’s robust loss =-=[9]-=-, or the �-insensitive loss [16] for regression. We discuss these in Section 3. In some cases the loss function depends on an additional parameter such as the width of the margin � or the size of the ... |

26 | Support Vector Learning. R. Oldenbourg Verlag - Schölkopf - 1997 |

22 | From margin to sparsity
- Graepel, Herbrich, et al.
- 2000
(Show Context)
Citation Context ...sive since they require one matrix multiplication at each step. The size of the matrix is given by the number of kernel functions required at each step. Recently several algorithms have been proposed =-=[5, 8, 6, 12]-=- performing perceptron-like updates for classification at each step. Some algorithms work only in the noise free case, others not for moving targets, and yet again others assume an upper bound on the ... |

20 | The motion coherence theory - Yuille, Grzywacz - 1988 |

13 | Sparse multiscale gaussian process regression - Walder, Kim, et al. - 2008 |

10 | A Maximum Margin Miscellany - Williamson, Scholkopf, et al. - 1998 |

6 | A principle for system identification in the behavioural framework - Williamson, Smola, et al. - 1993 |

4 | A maximum entropy kernel density estimator with applications to function interpolation and texture segmentation - Balakrishnan, Schonfeld - 2006 |

4 | Sample complexity of least squares identification of FIR models - Weyer, Williamson, et al. - 1999 |

1 |
Norm-based regularization of boosting. manuscript, in preparation
- Guo, Bartlett, et al.
- 2000
(Show Context)
Citation Context ...e would like to minimize ���� ℄. This can be costly if the number of observations is large. Recently several gradient descent algorithms for minimizing such functionals efficiently have been proposed =-=[13, 7]-=-. Below we extend these methods to stochastic gradient descent by approximating ���� ℄ by ���� ℄ �� � �� �Å�� ℄ (3)and then performing gradient descent with respect to ���� ℄. Here is either randomly... |

1 |
Learning additive models with fast evaluating kernels
- Herbster
- 2001
(Show Context)
Citation Context ...will happen in practice in most cases. Thereby the complexity of the estimator used in prediction increases linearly over time (in some restricted situations this can be reduced to logarithmical cost =-=[8]-=-). Finally, training time of batch and/or incremental update algorithms typically increases superlinearly with the number of observations. Incremental update algorithms [2] attempt to overcome this pr... |