## Bayesian support vector regression using a unified loss function (2004)

### Cached

### Download Links

Venue: | IEEE Transactions on Neural Networks |

Citations: | 23 - 2 self |

### BibTeX

@ARTICLE{Chu04bayesiansupport,

author = {Wei Chu and S. Sathiya Keerthi and Chong Jin Ong},

title = {Bayesian support vector regression using a unified loss function},

journal = {IEEE Transactions on Neural Networks},

year = {2004},

volume = {15},

pages = {29--44}

}

### OpenURL

### Abstract

In this paper, we use a unified loss function, called the soft insensitive loss function, for Bayesian support vector regression. We follow standard Gaussian processes for regression to set up the Bayesian framework, in which the unified loss function is used in the likelihood evaluation. Under this framework, the maximum a posteriori estimate of the function values corresponds to the solution of an extended support vector regression problem. The overall approach has the merits of support vector regression such as convex quadratic programming and sparsity in solution representation. It also has the advantages of Bayesian methods for model adaptation and error bars of its predictions. Experimental results on simulated and real-world data sets indicate that the approach works well even on large data sets.

### Citations

10025 | An overview of statistical learning theory
- Vapnik
- 1999
(Show Context)
Citation Context ... compare our method with GPR and SVR for generalization capability and computational cost on some benchmark data. A. Sinc Data The function sinc(x) = |x| −1 sin |x| is commonly used to illustrate SVR =-=[10]-=-. Training and testing data sets were obtained by uniformly sampling data points from the interval [−10, 10]. Eight training data sets with sizes ranging from 50 to 4000 and a single common testing da... |

5396 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...etermination, Model Selection I. INTRODUCTION THe application of Bayesian techniques to neural networks was pioneered by MacKay [1], Neal [2] and, Buntine and Weigend [3]. These works are reviewed in =-=[4]-=-, [5] and [6]. Unlike standard neural network design, the Bayesian approach considers probability distributions in the weight space of the network. Together with the observed data, prior distributions... |

2136 |
Robust Statistics
- Huber
- 1981
(Show Context)
Citation Context ...distributions then the solution can be dominated by a very small number of outliers, which is an undesirable result. Techniques that attempt to solve this problem are referred to as robust statistics =-=[22]-=-. Non-quadratic loss functions have been introduced to reduce the sensitivity to the outliers. The three non-quadratic loss functions commonly used in regression problems are: 1) the Laplacian loss fu... |

1239 |
Practical methods of optimization
- Fletcher
- 1987
(Show Context)
Citation Context ...(ξ ∗ i ))+ 1 2 f T Σ −1 f (18) yi − f(xi) ≤ (1 − β)ɛ + ξi f(xi) − yi ≤ (1 − β)ɛ + ξ ∗ i ξi ≥ 0, ξ ∗ i ≥ 0 ∀i { 2 π 4βɛ if π ∈ [0, 2βɛ) π − βɛ if π ∈ [2βɛ, +∞) (19) (20) Standard Lagrangian techniques =-=[25]-=- are used to derive the dual problem. Let αi ≥ 0, α ∗ i ≥ 0, γi ≥ 0 and γi ≥ 0 ∀i beJOURNAL OF L ATEX CLASS FILES, VOL. 1, NO. 11, NOVEMBER 2002 5 the corresponding Lagrange multipliers for the inequ... |

546 | A tutorial on support vector regression
- Smola, Schölkopf
(Show Context)
Citation Context ...(29) is a convex quadratic programming problem. Matrix-based quadratic programming techniques that use the “chunking” idea can be used for its solutions [26]. Popular SMO algorithms for classical SVR =-=[27]-=- [28] can also be adapted for its solution. For more details about the adaptation, refer to [29]. The optimal value of the primal variables f can be obtained from the solution of (29) as f MP = Σ · (α... |

429 | A practical Bayesian framework for back propagation networks
- MacKay
- 1992
(Show Context)
Citation Context ..., Gaussian Processes, Non-quadratic loss function, Automatic Relevance Determination, Model Selection I. INTRODUCTION THe application of Bayesian techniques to neural networks was pioneered by MacKay =-=[1]-=-, Neal [2] and, Buntine and Weigend [3]. These works are reviewed in [4], [5] and [6]. Unlike standard neural network design, the Bayesian approach considers probability distributions in the weight sp... |

325 | A Limited Memory Algorithm for Bound Constrained Optimization
- Byrd, Lu, et al.
- 1995
(Show Context)
Citation Context ...ven in (15), the variance of tx is therefore σ 2 t + σ 2 n. VII. NUMERICAL EXPERIMENTS In the implementation of our Bayesian approach to support vector regression (BSVR), we used the routine L-BFGS-B =-=[39]-=- as the gradient-based optimization package, and started from the initial values of the hyperparameters to infer the optimal ones. 12 We also implemented standard GPR [9] and classical SVR [10] for co... |

300 | Interpolation of scattered data: Distance matrices and conditionally positive definite functions, Constructive Appmzimation 2
- Micchelli
- 1986
(Show Context)
Citation Context ...mization problem: 2 min S(f) = C f n∑ i=1 ℓ (yi − f(xi)) + 1 2 f T Σ −1 f (8) 1 If the covariance is defined using (3), Σ is symmetric and positive definite if {xi} is a set of distinct points in R d =-=[18]-=-. 2 S(f) is a regularized functional. As for the connection to regularization theory, Evgeniou et al. [19] have given a comprehensive discussion.JOURNAL OF L ATEX CLASS FILES, VOL. 1, NO. 11, NOVEMBE... |

295 |
Some results on Tchebycheffian spline functions
- Kimeldorf, Wahba
- 1971
(Show Context)
Citation Context ...f unknowns wi = ∣ ∀i, and w as the column vector confMP(xi) taining {wi}. Then f MP can be written as: f MP = Σ · w (9) The elegant form of a minimizer of (8) is also known as the representer theorem =-=[20]-=-. A generalized representer theorem can be found in [21], in which the loss function is merely required to be any strictly monotonically increasing function ℓ : R → [0, +∞). Quadratic Loss Function 2 ... |

230 | Gaussian processes for regression
- Williams, Rasmussen
- 1996
(Show Context)
Citation Context ... Neal [7] observed that a Gaussian prior for the weights approaches a Gaussian process for functions as the number of hidden units approaches infinity. Inspired by Neal’s work, Williams and Rasmussen =-=[8]-=- extended the use of Gaussian process prior to higher dimensional regression problems that have been traditionally tackled with other techniques, such as neural networks, decision trees etc, and good ... |

230 | The relevance vector machine
- Tipping
(Show Context)
Citation Context ...that is equal √ to prediction minus target, and solid curves indicate the error bars ±2 σ2 t + σ2 n in predictive distribution. TABLE V COMPARISON WITH RIDGE REGRESSION [17], RELEVANCE VECTOR MACHINE =-=[41]-=-, GPR AND SVR ON PRICE PREDICTION OF THE BOSTON HOUSING DATA SET. ASE DENOTES THE AVERAGE SQUARED TEST ERROR. IMPLEMENTATION METHOD KERNEL TYPE ASE Ridge Regression Polynomial 10.44 Ridge Regression S... |

224 | Feature selection for SVMs
- Weston, Mukherjee, et al.
- 2000
(Show Context)
Citation Context ...eature selection is an essential part in regression modelling. Recently, Jebara and Jaakkola [32] formalized a kind of feature weighting in maximum entropy discrimination framework, and Weston et al. =-=[33]-=- introduced a method of feature selection for support vector machines by minimizing the bounds on the leave-one-out error. MacKay [34] and Neal [7] proposed automatic relevance determination (ARD) as ... |

206 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond
- Williams
- 1999
(Show Context)
Citation Context ...s that have been traditionally tackled with other techniques, such as neural networks, decision trees etc, and good results have been obtained. Regression with Gaussian processes (GPR) is reviewed in =-=[9]-=-. The important advantages of GPR models over other non-Bayesian models are the ability to infer hyperparameters and the provision of confidence intervals of its predictions. The drawback of GPR model... |

157 | Probable networks and plausible predictions – a review of practical Bayesian methods for supervised neural networks
- MacKay
- 1995
(Show Context)
Citation Context ...ination, Model Selection I. INTRODUCTION THe application of Bayesian techniques to neural networks was pioneered by MacKay [1], Neal [2] and, Buntine and Weigend [3]. These works are reviewed in [4], =-=[5]-=- and [6]. Unlike standard neural network design, the Bayesian approach considers probability distributions in the weight space of the network. Together with the observed data, prior distributions are ... |

157 | Linear Programming : Foundations and Extensions
- Vanderbei
- 2001
(Show Context)
Citation Context ... and 0 ≤ α∗ i ≤ C. Obviously, the dual problem (29) is a convex quadratic programming problem. Matrix-based quadratic programming techniques that use the “chunking” idea can be used for its solutions =-=[26]-=-. Popular SMO algorithms for classical SVR [27] [28] can also be adapted for its solution. For more details about the adaptation, refer to [29]. The optimal value of the primal variables f can be obta... |

149 | A generalized representer theorem
- Schölkopf, Herbrich, et al.
- 2001
(Show Context)
Citation Context ...xi) taining {wi}. Then f MP can be written as: f MP = Σ · w (9) The elegant form of a minimizer of (8) is also known as the representer theorem [20]. A generalized representer theorem can be found in =-=[21]-=-, in which the loss function is merely required to be any strictly monotonically increasing function ℓ : R → [0, +∞). Quadratic Loss Function 2 1.5 1 0.5 0 −2 −1 0 1 2 Huber’s Loss Function 2 1.5 1 0.... |

127 |
Bayesian back-propagation
- Buntine, Weigend
- 1991
(Show Context)
Citation Context ...s function, Automatic Relevance Determination, Model Selection I. INTRODUCTION THe application of Bayesian techniques to neural networks was pioneered by MacKay [1], Neal [2] and, Buntine and Weigend =-=[3]-=-. These works are reviewed in [4], [5] and [6]. Unlike standard neural network design, the Bayesian approach considers probability distributions in the weight space of the network. Together with the o... |

109 | Ridge regression learning algorithm in dual variables
- Saunders, Gammerman, et al.
- 1998
(Show Context)
Citation Context ...l in SVM, while the second term corresponds to the variance of the bias in classical SVR [10]. Other kernel functions in SVM, such as polynomial kernel, spline kernel [11], ANOVA decomposition kernel =-=[17]-=- etc., or their combinations can also be used in covariance function, but we only focus on Gaussian kernel in the present work. Thus, the prior probability of the functions is a multivariate Gaussian ... |

64 |
Bayesian methods for backpropagation networks
- MacKay
- 1994
(Show Context)
Citation Context ...ng in maximum entropy discrimination framework, and Weston et al. [33] introduced a method of feature selection for support vector machines by minimizing the bounds on the leave-one-out error. MacKay =-=[34]-=- and Neal [7] proposed automatic relevance determination (ARD) as a hierarchical prior over the weights in neural networks. The weights connected to an irrelevant input can be automatically punished w... |

55 | Improvements to smo algorithm for svm regression
- Shevade, Keerthi, et al.
- 1999
(Show Context)
Citation Context ...is a convex quadratic programming problem. Matrix-based quadratic programming techniques that use the “chunking” idea can be used for its solutions [26]. Popular SMO algorithms for classical SVR [27] =-=[28]-=- can also be adapted for its solution. For more details about the adaptation, refer to [29]. The optimal value of the primal variables f can be obtained from the solution of (29) as f MP = Σ · (α − α ... |

54 | A unified framework for regularization networks and support vector machines
- Evgniou, Pontil, et al.
- 1999
(Show Context)
Citation Context ...ed using (3), Σ is symmetric and positive definite if {xi} is a set of distinct points in R d [18]. 2 S(f) is a regularized functional. As for the connection to regularization theory, Evgeniou et al. =-=[19]-=- have given a comprehensive discussion.JOURNAL OF L ATEX CLASS FILES, VOL. 1, NO. 11, NOVEMBER 2002 3 Let f MP be the optimal solution of (8). If the loss function in (8) is differentiable, the deriv... |

52 | Bayesian methods for support vector machines: Evidence and predictive class probabilities
- Sollich
- 2002
(Show Context)
Citation Context ...f SVM. For classification, Kwok [12] built up MacKay’s evidence framework [1] using a weight-space interpretation. Seeger [13] presented a variational Bayesian method for model selection, and Sollich =-=[14]-=- proposed Bayesian methods with normalized evidence and error bar. In SVM for regression (SVR), Law and Kwok [15] applied MacKay’s Bayesian framework to SVR in the weight space. Gao et al. [16] derive... |

49 | Bayesian model selection for support vector machines, Gaussian processes and other kernel classi
- Seeger
- 2000
(Show Context)
Citation Context ... these hyperparameters. There is some literature on Bayesian interpretations of SVM. For classification, Kwok [12] built up MacKay’s evidence framework [1] using a weight-space interpretation. Seeger =-=[13]-=- presented a variational Bayesian method for model selection, and Sollich [14] proposed Bayesian methods with normalized evidence and error bar. In SVM for regression (SVR), Law and Kwok [15] applied ... |

49 |
Advanced Mean Field Methods Theory and Practice
- Saad, Opper
- 2001
(Show Context)
Citation Context ...mate. Law and Kwok [36] applied the evidence framework [1] to ν−SVR with a particular prior for ɛ, but the dependency on ɛ makes the consequent evidence approximation intractable. Variational methods =-=[37]-=- might be used here to tackle the integral. VI. ERROR BAR IN PREDICTION In this section, we present error bars for predictions on new data points [1] [4]. This ability to provide error bars is one of ... |

48 | Feature selection and dualities in maximum entropy discrimination
- Jebara, Jaakkola
- 2000
(Show Context)
Citation Context ...mal solution of (29). Note that the non-SVs are not involved in these evaluations. 7 B. Feature Selection Feature selection is an essential part in regression modelling. Recently, Jebara and Jaakkola =-=[32]-=- formalized a kind of feature weighting in maximum entropy discrimination framework, and Weston et al. [33] introduced a method of feature selection for support vector machines by minimizing the bound... |

42 |
Hybrid Monte Carlo
- Duane, Kennedy, et al.
- 1987
(Show Context)
Citation Context ...of (29) associated with non-SVs are not at all involved in the prediction process. 11 In a full Bayesian treatment, these hyperparameters θ must be integrated over θ-space. Hybrid Monte Carlo methods =-=[38]-=- [2] can be adopted here to approximate the integral efficiently by using the gradients of P(D|θ) to choose search directions which favor regions of high posterior probability of θ. 12In numerical exp... |

40 | Bayesian training of backpropagation networks by the Hybrid Monte Carlo method - Neal - 1992 |

30 |
The Evidence Framework Applied to Support Vector
- Kwok
(Show Context)
Citation Context ...ved. Typically, Bayesian methods are regarded as suitable tools to determine the values of these hyperparameters. There is some literature on Bayesian interpretations of SVM. For classification, Kwok =-=[12]-=- built up MacKay’s evidence framework [1] using a weight-space interpretation. Seeger [13] presented a variational Bayesian method for model selection, and Sollich [14] proposed Bayesian methods with ... |

29 | Using support vector machines for time series prediction
- Müller, Smola, et al.
- 1999
(Show Context)
Citation Context ... indicate that our approach gives a performance that is very similar to that given by well-respected techniques. 18 C. Laser Generated Data SVR has been successfully applied to time series prediction =-=[40]-=-. Here we choose the laser data to illustrate the error bar in predictions. The laser data has been used in the Santa Fe Time Series Prediction Analysis Competition. 19 A total of 1000 points of far-i... |

21 | Bayesian approach for neural networks — review and case studies
- Lampinen, Vehtari
- 2001
(Show Context)
Citation Context ... Model Selection I. INTRODUCTION THe application of Bayesian techniques to neural networks was pioneered by MacKay [1], Neal [2] and, Buntine and Weigend [3]. These works are reviewed in [4], [5] and =-=[6]-=-. Unlike standard neural network design, the Bayesian approach considers probability distributions in the weight space of the network. Together with the observed data, prior distributions are converte... |

15 | A probabilistic framework for SVM regression and error bar estimation, Machine Learning
- Gao, Gunn, et al.
(Show Context)
Citation Context ...ollich [14] proposed Bayesian methods with normalized evidence and error bar. In SVM for regression (SVR), Law and Kwok [15] applied MacKay’s Bayesian framework to SVR in the weight space. Gao et al. =-=[16]-=- derived the evidence and error bar approximation for SVR along the way proposed by Sollich [14]. In these two approaches, the lack of smoothness of the ɛ-insensitive loss function (ɛ-ILF) in SVR may ... |

13 | Bayesian support vector regression
- Law, Kwok
- 2001
(Show Context)
Citation Context .... Seeger [13] presented a variational Bayesian method for model selection, and Sollich [14] proposed Bayesian methods with normalized evidence and error bar. In SVM for regression (SVR), Law and Kwok =-=[15]-=- applied MacKay’s Bayesian framework to SVR in the weight space. Gao et al. [16] derived the evidence and error bar approximation for SVR along the way proposed by Sollich [14]. In these two approache... |

10 |
Spline Models for Observational Data, ser
- Wahba
- 1990
(Show Context)
Citation Context ...pends on the shape of the kernel function and other hyperparameters that represent the characteristics of the noise distribution in the training data. Re-sampling approaches, such as cross-validation =-=[11]-=-, are commonly used in practice to decide values of these hyperparameters, but such approaches are very expensive when a large number of hyperparameters are involved. Typically, Bayesian methods are r... |

10 |
Support vector machines
- Schölkopf, Dumais, et al.
- 1998
(Show Context)
Citation Context ...for the evidence in the case has been discussed by Gao et al. [16], in which the (left/right) first order derivative at the insensitive tube is used in the evidence approximation. Schölkopf and Smola =-=[35]-=- proposed an interesting variant of SVR, known as ν−SVR, in which the hyperparameter ɛ is optimized in the MAP estimate. Law and Kwok [36] applied the evidence framework [1] to ν−SVR with a particular... |

5 |
Bayesian Learning for Neural Networks, ser
- Neal
- 1996
(Show Context)
Citation Context ...ers probability distributions in the weight space of the network. Together with the observed data, prior distributions are converted to posterior distributions through the use of Bayes’ theorem. Neal =-=[7]-=- observed that a Gaussian prior for the weights approaches a Gaussian process for functions as the number of hidden units approaches infinity. Inspired by Neal’s work, Williams and Rasmussen [8] exten... |

5 | SMO Algorithm for Least Squares - Keerthi, Shevade - 2003 |

4 | A unified loss function in bayesian framework for support vector regression
- Chu, Keerthi, et al.
- 2001
(Show Context)
Citation Context ...egrating over σi and ti as follows: ∫ ∫ P(δi) = dσi dtiP(δi|σi, ti)λ(ti)µ(σi) (16) The probability (16) can also be evaluated in the form of loss function as (13). Under such settings, it is possible =-=[24]-=- to find a Rayleigh distribution on σi and a specific distribution on ti, such that the evaluations of expression (13) and (16) are equivalent. Therefore, the use of SILF can also be explained as a ge... |

3 |
Bayesian approach to support vector machines
- Chu
- 2003
(Show Context)
Citation Context ...at use the “chunking” idea can be used for its solutions [26]. Popular SMO algorithms for classical SVR [27] [28] can also be adapted for its solution. For more details about the adaptation, refer to =-=[29]-=-. The optimal value of the primal variables f can be obtained from the solution of (29) as f MP = Σ · (α − α ∗ ) (30) where α = [α1, α2, . . . , αn] T and α ∗ = [α ∗ 1, α ∗ 2, . . . , α ∗ n] T . This ... |

2 |
On the noise model of support vector regression
- Pontil, Mukherjee, et al.
- 1998
(Show Context)
Citation Context ... + C C + (1 − β)2ɛ 2 ) (√ ) erf Cβɛ ( 2 2 ɛ (1 − β) 2ɛ(1 + β) + + C C2 + 2 C3 ) } exp(−Cβɛ) (15) Remark 1: We now give an interpretation for SILF, which is an extension of that given by Pontil et al. =-=[23]-=- for ɛ-ILF. If we discard the popular assumption that the distribution of the noise variables δi is a zero-mean Gaussian, but assume that the noise variables δi have a Gaussian distribution P(δi|σi, t... |

1 |
Applyig the Bayesian evidence framework to ν−support vector regression
- Law, Kwok
(Show Context)
Citation Context ...be is used in the evidence approximation. Schölkopf and Smola [35] proposed an interesting variant of SVR, known as ν−SVR, in which the hyperparameter ɛ is optimized in the MAP estimate. Law and Kwok =-=[36]-=- applied the evidence framework [1] to ν−SVR with a particular prior for ɛ, but the dependency on ɛ makes the consequent evidence approximation intractable. Variational methods [37] might be used here... |