## An Empirical Comparison of Supervised Learning Algorithms (2006)

### Cached

### Download Links

Venue: | In Proc. 23 rd Intl. Conf. Machine learning (ICML’06 |

Citations: | 106 - 6 self |

### BibTeX

@INPROCEEDINGS{Caruana06anempirical,

author = {Rich Caruana and Alexandru Niculescu-mizil},

title = {An Empirical Comparison of Supervised Learning Algorithms},

booktitle = {In Proc. 23 rd Intl. Conf. Machine learning (ICML’06},

year = {2006},

pages = {161--168}

}

### OpenURL

### Abstract

A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. We present a large-scale empirical comparison between ten supervised learning methods: SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps. We also examine the effect that calibrating the models via Platt Scaling and Isotonic Regression has on their performance. An important aspect of our study is the use of a variety of performance criteria to evaluate the learning methods. 1.

### Citations

9458 | The nature of statistical learning theory - Vapnik - 1995 |

3181 |
Data Mining: Practical Machine Learning Tools and Techniques
- Witten, Frank
- 2005
(Show Context)
Citation Context ...ts. Logistic Regression (LOGREG): we train both unregularized and regularized models, varying the ridge (regularization) parameter by factors of 10 from 10 −8 to 10 4 . Naive Bayes (NB): we use Weka (=-=Witten & Frank, 2005-=-) and try all three of the Weka options for handling continuous attributes: modeling them as a single normal, modeling them with kernel estimation, or discretizing them using supervised discretization... |

2963 |
Uci repository of machine learning databases, university of california, irvine, ca . www.ics.uci.edu/mlearn/MLRepository.html
- Blake, Merz
- 1998
(Show Context)
Citation Context ...ints validation set that will be used for model selection. 2.5 Data Sets We compare the algorithms on 8 binary classification problems. ADULT, COVT and LETTER are from UCI Machine Learning Repository =-=[11]-=-. ADULT is the only problem that has nominal attributes. For ANNs, SVMs and KNNs we transform nominal attributes to boolean. Each DT, BAG-DT, BST-DT, BST-STMP, and NB model is trained twice, once with... |

2609 | Bagging Predictors - Breiman - 1996 |

1552 | 2001: Random forests - Breiman |

1499 |
Making large-scale SVM Learning Practical
- Joachims
- 1999
(Show Context)
Citation Context ...ing algorithm as thoroughly as is computationally feasible. This section summarizes the parameters used for each learning algorithms and may be skipped. SVMs: we use the following kernels in SVMLight =-=[2]-=-: linear, polynomial degree 2 & 3, radial with width {0.001,0.005,0.01,0.05,0.1,0.5,1,2} and vary the regularization parameter by factors of ten from 10 −7 to 10 3 with each kernel. ANN we train neura... |

744 | Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, Advances in Large Margin Classifiers Raux
- Platt
- 1999
(Show Context)
Citation Context ...000 14000 53% medis 63 4000 8199 11% mg 124 4000 12807 17% slac 59 4000 25000 50% hs 200 4000 4366 24% A number of methods have been proposed for mapping predictions to posterior probabilities. Platt =-=[6]-=- proposed transforming SVM predictions to posterior probabilities by passing them through a sigmoid. Platt’s method also works well for boosted trees and boosted stumps. A sigmoid, however, might not ... |

561 | An empirical comparison of voting classification algorithms: bagging, boosting, and variants. http:// robotics. stanford. edu/ users/ ronnyk
- Bauer, Kohavi
- 1997
(Show Context)
Citation Context ...t using both accuracy and an ROC-like metric. Lim et al. [15] perform an empirical comparison of decision trees and other classification methods using accuracy as the main criterion. Bauer and Kohavi =-=[16]-=- present an impressive empirical analysis of ensemble methods such as bagging and boosting. Perlich et al. [17] conducts an empirical comparison between decision trees and logistic regression. Provost... |

331 | The boosting approach to machine learning: An overview, Nonlinear Estimation and Classification - Schapire - 2003 |

271 | Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions
- Provost, Fawcett
- 1997
(Show Context)
Citation Context ...as a summary of model performance across all possible thresholds. The rank metrics we use are area under the ROC curve (ROC), average precision (APR), and precision/recall break even point (BEP). See =-=[5]-=- for a discussion of ROC from a machine learning perspective. The probability metrics are minimized (in expectation) when the predicted value for each case coincides with the true conditional probabil... |

242 |
Order Restricted Statistical Inference
- ROBERTSON, WRIGHT, et al.
- 1988
(Show Context)
Citation Context ...sted trees and boosted stumps. A sigmoid, however, might not be the correct transformation for all learning algorithms. Zadrozny and Elkan[7,8] used a more general method based on Isotonic Regression =-=[9]-=- to calibrate predictions from SVMs, naive bayes, boosted naive bayes, and decision trees. Isotonic Regression is more general in that the only restriction it makes is that the mapping function be iso... |

176 | A comparison of prediction accuracy, complexity, and training time for thirty–three old and new classification algorithms - Lim, Loh, et al. - 2000 |

136 | Tree induction for probability-based ranking
- Provost, Domingos
- 1992
(Show Context)
Citation Context ..., CART, CART0, C4, MML, and SMML. We also generate trees of type C44LS (C4 with no pruning and Laplacian smoothing), C44BS (C44 with Bayesian smoothing), and MMLLS (MML with Laplacian smoothing). See =-=[3]-=- for a description of C44LS. Bagged trees (BAG-DT): we bag 100 trees of each type described above. With boosted trees (BST-DT) we boost each tree type as well. Boosting can overfit, so we consider boo... |

102 | Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers
- Zadrozny, Elkan
- 2001
(Show Context)
Citation Context ...them through a sigmoid. Platt’s method also works well for boosted trees and boosted stumps. A sigmoid, however, might not be the correct transformation for all learning algorithms. Zadrozny and Elkan=-=[7,8]-=- used a more general method based on Isotonic Regression [9] to calibrate predictions from SVMs, naive bayes, boosted naive bayes, and decision trees. Isotonic Regression is more general in that the o... |

97 |
An Empirical Distribution Function for Sampling with Incomplete Information
- Ayer, Brunk, et al.
- 1955
(Show Context)
Citation Context ... that the mapping function be isotonic (monotonically increasing). A standard algorithm for Isotonic Regression that finds a piecewise constant solution is the pair-adjacent violators (PAV) algorithm =-=[10]-=-. To calibrate models, we use the same 1000 points validation set that will be used for model selection. 2.5 Data Sets We compare the algorithms on 8 binary classification problems. ADULT, COVT and LE... |

91 | Comparison of learning algorithms for handwritten digit recognition
- LeCun, Jackel, et al.
- 1995
(Show Context)
Citation Context ...]. STATLOG was a very comprehensive study when it was performed, but since then important new learningsalgorithms have been introduced such as bagging, boosting, SVMs, and random forests. LeCun et al.=-=[13]-=- present a study that compares several learning algorithms (including SVMs) on a handwriting recognition problem using three performance criteria: accuracy, rejection rate, and computational cost. Coo... |

74 | Transforming classifier scores into accurate multiclass probability estimates
- Zadrozny, Elkan
- 2002
(Show Context)
Citation Context ...them through a sigmoid. Platt’s method also works well for boosted trees and boosted stumps. A sigmoid, however, might not be the correct transformation for all learning algorithms. Zadrozny and Elkan=-=[7,8]-=- used a more general method based on Isotonic Regression [9] to calibrate predictions from SVMs, naive bayes, boosted naive bayes, and decision trees. Isotonic Regression is more general in that the o... |

65 | Tree Induction vs. Logistic Regression: A Learning-curve Analysis
- Perlich, Provost, et al.
- 2003
(Show Context)
Citation Context ... and other classification methods using accuracy as the main criterion. Bauer and Kohavi [16] present an impressive empirical analysis of ensemble methods such as bagging and boosting. Perlich et al. =-=[17]-=- conducts an empirical comparison between decision trees and logistic regression. Provost and Domingos [3] examine the issue of predicting probabilities with decision trees, including smoothed and bag... |

59 | Predicting good probabilities with supervised learning
- Niculescu-Mizil, Caruana
- 2005
(Show Context)
Citation Context ...or probabilities. Platt (1999) proposed transforming SVM predictions to posterior probabilities by passing them through a sigmoid. Platt’s method also works well for boosted trees and boosted stumps (=-=Niculescu-Mizil & Caruana, 2005-=-). A sigmoid, however, might not be the correct transformation for all learning algorithms. Zadrozny and Elkan (2002; 2001) used a more general calibration method based on Isotonic Regression (Roberts... |

51 | Statlog: Comparison of Classification Algorithms on Large Real-Worlds Problems
- King, Feng, et al.
- 1995
(Show Context)
Citation Context ...models are boosted trees, random forests, and unscaled neural nets. 1 Introduction There are few comprehensive empirical studies comparing learning algorithms. STATLOG is perhaps the best known study =-=[1]-=-. STATLOG was very comprehensive, but since it was performed new learning algorithms have emerged (e.g., bagging, boosting, SVMs, random forests) that have excellent performance. Also, learning algori... |

50 | Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria - Caruana, Niculescu-Mizil - 2004 |

30 | Introduction to ind and recursive partitioning
- Buntine
- 1991
(Show Context)
Citation Context ... 1,2,4,6,8,12,16 or 20. Decision trees (DT): we vary the splitting criterion, pruning options, and smoothing (Laplacian or Bayesian smoothing). We use all of the tree models in Buntine’s IND package (=-=Buntine & Caruana, 1991-=-): BAYES, ID3, CART, CART0, C4, MML, and SMML. We also generate trees of type C44LS (C4 with no pruning and Laplacian smoothing), C44BS (C44 with Bayesian smoothing), and MMLLS (MML with Laplacian smo... |

24 | Support vector machine classifiers as applied to AVIRIS data
- Gualtieri, Chettri, et al.
- 1999
(Show Context)
Citation Context ... negative, yielding a very unbalanced binary problem. LTR.p2 uses letters A-M as positives and the rest as negatives, yielding a difficult, but well balanced, problem. HS is the IndianPine92 data set =-=[12]-=- where the difficult class Soybean-mintill is the positive class. SLAC is a problem from the Stanford Linear Accelerator. MEDIS and MG are medical data sets. The characteristics of these data sets are... |

16 |
Applied Data Mining
- Giudici
- 2003
(Show Context)
Citation Context ...core (FSC) and lift (LFT), it is not important how close a prediction is to a threshold, only if it is above or below threshold. Usually ACC and FSC have a fixed threshold (we use 0.5). For lift (see =-=[4]-=- for a description of Lift), often a fixed percent, p, of cases are predicted as positive and the rest as negative (we use p = 25%). The ordering/rank metrics depend only on the ordering of the cases,... |

12 |
An evaluation of machine learning methods for predicting pneumonia mortality
- Cooper, Aliferis, et al.
(Show Context)
Citation Context ...study that compares several learning algorithms (including SVMs) on a handwriting recognition problem using three performance criteria: accuracy, rejection rate, and computational cost. Cooper et al. =-=[14]-=- present results from a study that evaluates nearly a dozen learning methods on a real medical data set using both accuracy and an ROC-like metric. Lim et al. [15] perform an empirical comparison of d... |

12 | An empirical comparison of decision trees and other classification methods
- Lim, Loh, et al.
- 1997
(Show Context)
Citation Context ...nd computational cost. Cooper et al. [14] present results from a study that evaluates nearly a dozen learning methods on a real medical data set using both accuracy and an ROC-like metric. Lim et al. =-=[15]-=- perform an empirical comparison of decision trees and other classification methods using accuracy as the main criterion. Bauer and Kohavi [16] present an impressive empirical analysis of ensemble met... |