## Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods (1999)

### Cached

### Download Links

- [research.microsoft.com]
- [www.research.microsoft.com]
- [research.microsoft.com]
- CiteULike

### Other Repositories/Bibliography

Venue: | ADVANCES IN LARGE MARGIN CLASSIFIERS |

Citations: | 701 - 0 self |

### BibTeX

@INPROCEEDINGS{Platt99probabilisticoutputs,

author = {John C. Platt},

title = {Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods},

booktitle = {ADVANCES IN LARGE MARGIN CLASSIFIERS},

year = {1999},

pages = {61--74},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

The output of a classifier should be a calibrated posterior probability to enable post-processing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. However, training with a maximum likelihood score will produce non-sparse kernel machines. Instead, we train an SVM, then train the parameters of an additional sigmoid function to map the SVM outputs into probabilities. This chapter compares classification error rate and likelihood scores for an SVM plus sigmoid versus a kernel method trained with a regularized likelihood error function. These methods are tested on three data-mining-style data sets. The SVM+sigmoid yields probabilities of comparable quality to the regularized maximum likelihood kernel method, while still retaining the sparseness of the SVM.

### Citations

8973 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...se of a multi-category classier, choosing the category based on maximal posterior probability over all classes is the Bayes optimal decision for the equal loss case. However, Support Vector Machines [=-=1-=-9] (SVMs) produce an uncalibrated value that is not a probability. Let the unthresholded output of an SVM be f(x) = h(x) + b; (1) where h(x) = X i y i i k(x i ; x) (2) In Advances in Large Margin Cl... |

2865 |
UCI repository of machine learning databases
- Blake, Merz
- 1998
(Show Context)
Citation Context ...s tasks were used. Thesrst task is determining the category of a Reuters news article [5, 8]. The second task is the UCI Adult benchmark of estimating the income of a household given census form data =-=[13]-=-, where the input vectors are quantized [15]. The third task is determining the category of a web page given key words in the page [15]. The Reuters task is solved using a linear SVM, while the Adult ... |

2721 |
Learning Internal Representations by Error Propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...2 : (14) These targets are used instead of f0; 1g for all of the data in the sigmoidst. 6 These non-binary targets value are Bayes-motivated, unlike traditional non-binary targets for neural networks =-=[18-=-]. Furthermore, the non-binary targets will converge to f0; 1g when the training set size approaches innity, which recovers the maximum likelihood sigmoidst. The pseudo-code in Appendix 5 shows the op... |

1696 | Text categorization with support vector machines: learning with many relevant features
- Joachims
- 1998
(Show Context)
Citation Context ...of the results in this chapter are presented using three-fold cross-validation. Even with cross-validated unbiased training data, the sigmoid can still be overt. For example, in the Reuters data set [=-=5, 8]-=-, some of the categories have very few positive examples which are linearly separable from all of the negative examples. Fitting a sigmoid for these SVMs with maximum likelihood will simply drive the ... |

1652 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 2001
(Show Context)
Citation Context ...o note that there are other kernel methods that produce sparse machines without relying on an RKHS. One such class of methods penalize the ` 1 norm of the function h in (3), rather than the RKHS norm =-=[12, 2]-=- (see, for example, [this volume, chapter by Mangasarian]). Fitting a sigmoid afterstting these sparse kernel machines may, in future work, yield reasonable estimates of probabilities. 8 4 Conclusions... |

1440 |
Making large-scale svm learning practical
- Joachims
- 1999
(Show Context)
Citation Context ...all of the other system parameters are determined from the hold out set, the main SVM can be re-trained on the entire training set. If SVM training scales roughly quadratically with training set size =-=[16, 9]-=-, then the hold-out set will be only 1.5 times slower than simply training on the entire data set. Because determining the system parameters is often unavoidable, determining A and B from the hold-out... |

1346 |
Practical Optimization
- E, Murray, et al.
- 1981
(Show Context)
Citation Context ...l examples falling into a bin of width 0.1. The solid line is the best-t sigmoid to the posterior, using the algorithm described in this chapter. performed using a model-trust minimization algorithm [=-=6-=-], whose pseudo-code is shown in Appendix 5. Two issues arise in the optimization of (11): the choice of the sigmoid training set (f i ; y i ), and the method to avoid over-tting this set. The easiest... |

1011 |
Fast training of support vector machines using sequential minimal optimization
- Platt
- 1998
(Show Context)
Citation Context ...rametric model can be inspired by looking at empirical data. Figure 1 shows a plot of the class-conditional densities p(f jy = 1) for a linear SVM trained on a version of the UCI Adult data set (see [=-=15]-=-). The plot shows histograms of the densities (with bins 0.1 wide), derived from threefold cross-validation. These densities are very far away from Gaussian. There are discontinuities in the derivativ... |

528 | Approximate statistical tests for comparing supervised classification learning algorithms
- Dietterich
- 1998
(Show Context)
Citation Context ... an SVM+sigmoid, and a regularized likelihood kernel method. It also lists the negative log likelihood of the test set for SVM+sigmoid and for the regularized likelihood kernel method. McNemar's test [3] was used tosnd statistically signicant dierences in classication error rate, while the Wilcoxson signed rank test [14] is used tosnd signicant dierences in the 7 log likelihood. Both of these te... |

522 | Bayesian interpolation
- Mackay
- 1992
(Show Context)
Citation Context ...cian prior on A. However, there is always one free parameter in the prior distribution (e.g., the variance). This free parameter can be set using cross-validation or Bayesian hyperparameter inference =-=[11]-=-, but these methods add complexity to the code. A simpler method is to create a model of out-of-sample data. One model is to assume that the out-of-sample data is simply the training data perturbed wi... |

290 | Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines”, Microsoft Research
- Platt
(Show Context)
Citation Context ...all of the other system parameters are determined from the hold out set, the main SVM can be re-trained on the entire training set. If SVM training scales roughly quadratically with training set size =-=[16, 9]-=-, then the hold-out set will be only 1.5 times slower than simply training on the entire data set. Because determining the system parameters is often unavoidable, determining A and B from the hold-out... |

275 | Classification by pairwise coupling
- Hastie, Tibshirani
- 1996
(Show Context)
Citation Context ...les. Another method forstting probabilities to the output of an SVM is tost Gaussians to the class-conditional densities p(f jy = 1) and p(f jy = 1). This wassrst proposed by Hastie and Tibshirani in =-=[7]-=-, where a single tied variance is estimated for both Gaussians. The posterior probability rule P (y = 1jf) is thus a sigmoid, whose slope is determined by the tied variance. Hastie and Tibshirani [7] ... |

150 | Support vector machines, reproducing kernel hilbert spaces and randomized gacv
- Wahba
- 1998
(Show Context)
Citation Context ...n Classiers, Alexander J. Smola, Peter Bartlett, Bernhard Scholkopf, Dale Schuurmans, eds., MIT Press, (1999), to appear. 1 lies in a Reproducing Kernel Hilbert Space (RKHS) F induced by a kernel k [2=-=2-=-]. Training an SVM minimizes an error function that penalizes an approximation to the training misclassication rate plus a term that penalizes the norm of h in the RKHS: C X i (1 y i f i ) + + 1 2 jjh... |

140 |
Pattern Classi and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...g a classier to produce a posterior probability P (classjinput) is very useful in practical recognition situations. For example, a posterior probability allows decisions that can use a utility model [=-=-=-4]. Posterior probabilities are also required when a classier is making a small part of an overall decision, and the classication outputs must be combined for the overall decision. An example of this ... |

43 |
Using SVMs for text categorization
- Dumais
- 1998
(Show Context)
Citation Context ...of the results in this chapter are presented using three-fold cross-validation. Even with cross-validated unbiased training data, the sigmoid can still be overt. For example, in the Reuters data set [=-=5, 8]-=-, some of the categories have very few positive examples which are linearly separable from all of the negative examples. Fitting a sigmoid for these SVMs with maximum likelihood will simply drive the ... |

24 | A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split
- Kearns
- 1997
(Show Context)
Citation Context ...idable, determining A and B from the hold-out set may not incur extra computation with this method. Cross-validation is an even better method than a hold-out set for estimating the parameters A and B =-=[10]-=-. In three-fold cross-validation, the training set is split into three parts. Each of three SVMs are trained on permutations of two out of three parts, and the f i are evaluated on the remaining third... |

23 | multivariate function and operator estimation, based on smoothing splines and reproducing kernels,” in Nonlinear Modeling and Forecasting - Wahba - 1992 |

18 |
A continuous speech recognition system embedding MLP into HMM
- Bourlard, Morgan
- 1990
(Show Context)
Citation Context ...cation outputs must be combined for the overall decision. An example of this combination is using a Viterbi search or HMM to combine recognition results from phoneme recognizers into word recognition =-=[1-=-]. Even in the simple case of a multi-category classier, choosing the category based on maximal posterior probability over all classes is the Bayes optimal decision for the equal loss case. However, S... |

10 |
Probability and Statistics
- Mosteller, Rourke, et al.
- 1967
(Show Context)
Citation Context ...r SVM+sigmoid and for the regularized likelihood kernel method. McNemar's test [3] was used tosnd statistically signicant dierences in classication error rate, while the Wilcoxson signed rank test [14] is used tosnd signicant dierences in the 7 log likelihood. Both of these tests examine the results of a pair of algorithms on every example in the test set. In Table 2, underlined entries are pairw... |

3 |
The bias-variance tradeo# and the randomized GACV
- Wahba, Lin, et al.
- 1999
(Show Context)
Citation Context ... p(x) of such a machine will be a posterior probability. Minimizing this error function will not directly produce a sparse machine, but a modication to this method can produce sparse kernel machines [=-=21-=-]. This chapter presents modications to SVMs which yield posterior probabilities, while still maintaining their sparseness. First, the chapter reviews recent work in modifying SVMs to produce probabil... |

1 |
Linear and nonlinear separation of patterns by linear programming
- Mangasiarian
- 1965
(Show Context)
Citation Context ...o note that there are other kernel methods that produce sparse machines without relying on an RKHS. One such class of methods penalize the ` 1 norm of the function h in (3), rather than the RKHS norm =-=[12, 2]-=- (see, for example, [this volume, chapter by Mangasarian]). Fitting a sigmoid afterstting these sparse kernel machines may, in future work, yield reasonable estimates of probabilities. 8 4 Conclusions... |