Results 1  10
of
161
An Empirical Comparison of Supervised Learning Algorithms
 In Proc. 23 rd Intl. Conf. Machine learning (ICML’06
, 2006
"... A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. We present a largescale empirical comparison between ten supervised learning methods: SVMs, n ..."
Abstract

Cited by 212 (6 self)
 Add to MetaCart
(Show Context)
A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. We present a largescale empirical comparison between ten supervised learning methods: SVMs, neural nets, logistic regression, naive bayes, memorybased learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps. We also examine the effect that calibrating the models via Platt Scaling and Isotonic Regression has on their performance. An important aspect of our study is the use of a variety of performance criteria to evaluate the learning methods. 1.
Ensemble Selection from Libraries of Models
 In Proceedings of the 21st International Conference on Machine Learning
, 2004
"... We present a method for constructing ensembles from libraries of thousands of models. Model libraries are generated using different learning algorithms and parameter settings. Forward stepwise selection is used to add to the ensemble the models that maximize its performance. Ensemble selection allow ..."
Abstract

Cited by 94 (4 self)
 Add to MetaCart
We present a method for constructing ensembles from libraries of thousands of models. Model libraries are generated using different learning algorithms and parameter settings. Forward stepwise selection is used to add to the ensemble the models that maximize its performance. Ensemble selection allows ensembles to be optimized to performance metric such as accuracy, cross entropy, mean precision, or ROC Area. Experiments with seven test problems and ten metrics demonstrate the benefit of ensemble selection.
Relational Markov Models and their Application to Adaptive Web Navigation
, 2002
"... Relational Markov models (RMMs) are a generalization of Markov models where states can be of different types, with each type described by a different set of variables. The domain of each variable can be hierarchically structured, and shrinkage is carried out over the cross product of these hierarchi ..."
Abstract

Cited by 90 (9 self)
 Add to MetaCart
Relational Markov models (RMMs) are a generalization of Markov models where states can be of different types, with each type described by a different set of variables. The domain of each variable can be hierarchically structured, and shrinkage is carried out over the cross product of these hierarchies. RMMs make effective learning possible in domains with very large and heterogeneous state spaces, given only sparse data. We apply them to modeling the behavior of web site users, improving prediction in our PROTEUS architecture for personalizing web sites. We present experiments on an ecommerce and an academic web site showing that RMMs are substantially more accurate than alternative methods, and make good predictions even when applied to previouslyunvisited parts of the site.
Using AUC and accuracy in evaluating learning algorithms
 IEEE Transactions on Knowledge and Data Engineering
, 2005
"... The area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, has been recently proposed as an alternative singlenumber measure for evaluating the predictive ability of learning algorithms. However, no formal arguments were given as to why AUC should be preferred over accuracy. ..."
Abstract

Cited by 89 (1 self)
 Add to MetaCart
(Show Context)
The area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, has been recently proposed as an alternative singlenumber measure for evaluating the predictive ability of learning algorithms. However, no formal arguments were given as to why AUC should be preferred over accuracy. In this paper, we establish formal criteria for comparing two different measures for learning algorithms, and we show theoretically and empirically that AUC is, in general, a better measure (defined precisely) than accuracy. We then reevaluate wellestablished claims in machine learning based on accuracy using AUC, and obtain interesting and surprising new results. We also show that AUC is more directly associated with the net profit than accuracy in direct marketing, suggesting that learning algorithms should optimize AUC instead of accuracy in realworld applications. The conclusions drawn in this paper may make a significant impact to machine learning and data mining applications. Note: This paper integrates results in our papers published in IJCAI 2003 [22] and ICDM 2003 [15]. It also includes many new results. For example, the concept of indifferency in Section IIB is new, and Sections IIIB, IIIC, IVA, IVD, and V are all new and unpublished. Index Terms Evaluation of learning algorithms, AUC vs accuracy, ROC
Predicting Good Probabilities with Supervised Learning
 In Proc. Int. Conf. on Machine Learning (ICML
, 2005
"... We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion i ..."
Abstract

Cited by 89 (7 self)
 Add to MetaCart
We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Models such as Naive Bayes, which make unrealistic independence assumptions, push probabilities toward 0 and 1. Other models such as neural nets and bagged trees do not have these biases and predict well calibrated probabilities. We experiment with two ways of correcting the biased probabilities predicted by some learning methods: Platt Scaling and Isotonic Regression. We qualitatively examine what kinds of distortions these calibration methods are suitable for and quantitatively examine how much data they need to be effective. The empirical results show that after calibration boosted trees, random forests, and SVMs predict the best probabilities.
Learning Bayesian network classifiers by maximizing conditional likelihood
 In ICML2004
, 2004
"... Bayesian networks are a powerful probabilistic representation, and their use for classification has received considerable attention. However, they tend to perform poorly when learned in the standard way. This is attributable to a mismatch between the objective function used (likelihood or a function ..."
Abstract

Cited by 85 (0 self)
 Add to MetaCart
Bayesian networks are a powerful probabilistic representation, and their use for classification has received considerable attention. However, they tend to perform poorly when learned in the standard way. This is attributable to a mismatch between the objective function used (likelihood or a function thereof) and the goal of classification (maximizing accuracy or conditional likelihood). Unfortunately, the computational cost of optimizing structure and parameters for conditional likelihood is prohibitive. In this paper we show that a simple approximation— choosing structures by maximizing conditional likelihood while setting parameters by maximum likelihood—yields good results. On a large suite of benchmark datasets, this approach produces better class probability estimates than naive Bayes, TAN, and generativelytrained Bayesian networks. 1.
Finding latent code errors via machine learning over program executions
 In ICSE
, 2004
"... This paper proposes a technique for identifying program properties that indicate errors. The technique generates machine learning models of program properties known to result from errors, and applies these models to program properties of userwritten code to classify and rank properties that may lea ..."
Abstract

Cited by 81 (6 self)
 Add to MetaCart
(Show Context)
This paper proposes a technique for identifying program properties that indicate errors. The technique generates machine learning models of program properties known to result from errors, and applies these models to program properties of userwritten code to classify and rank properties that may lead the user to errors. Given a set of properties produced by the program analysis, the technique selects a subset of properties that are most likely to reveal an error. An implementation, the Fault Invariant Classifier, demonstrates the efficacy of the technique. The implementation uses dynamic invariant detection to generate program properties. It uses support vector machine and decision tree learning tools to classify those properties. In our experimental evaluation, the technique increases the relevance (the concentration of faultrevealing properties) by a factor of 50 on average for the C programs, and 4.8 for the Java programs. Preliminary experience suggests that most of the faultrevealing properties do lead a programmer to an error. 1
Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria
, 2004
"... Many criteria can be used to evaluate the performance of supervised learning. Different criteria are appropriate in different settings, and it is not always clear which criteria to use. A further complication is that learning methods that perform well on one criterion may not perform well on other c ..."
Abstract

Cited by 69 (3 self)
 Add to MetaCart
Many criteria can be used to evaluate the performance of supervised learning. Different criteria are appropriate in different settings, and it is not always clear which criteria to use. A further complication is that learning methods that perform well on one criterion may not perform well on other criteria. For example, SVMs and boosting are designed to optimize accuracy, whereas neural nets typically optimize squared error or cross entropy. We conducted an empirical study using a variety of learning methods (SVMs, neural nets, knearest neighbor, bagged and boosted trees, and boosted stumps) to compare nine boolean classification performance metrics: Accuracy, Lift, FScore, Area under the ROC Curve, Average Precision, Precision/Recall BreakEven Point, Squared Error, Cross Entropy, and Probability Calibration. Multidimensional scaling (MDS) shows that these metrics span a low dimensional manifold. The three metrics that are appropriate when predictions are interpreted as probabilities: squared error, cross entropy, and calibration, lay in one part of metric space far away from metrics that depend on the relative order of the predicted values: ROC area, average precision, breakeven point, and lift. In between them fall two metrics that depend on comparing predictions to a threshold: accuracy and Fscore. As expected, maximum margin methods such as SVMs and boosted trees have excellent performance on metrics like accuracy, but perform poorly on probability metrics such as squared error. What was not expected was that the margin methods have excellent performance on ordering metrics such as ROC area and average precision. We introduce a new metric, SAR, that combines squared error, accuracy, and ROC area into one metric. MDS and correlation analysis shows that SAR is centrally located and correlates well with other metrics, suggesting that it is a good general purpose metric to use when more specific criteria are not known.
Using relational knowledge discovery to prevent securities fraud
 In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining
, 2005
"... We describe an application of relational knowledge discovery to a key regulatory mission of the National Association of Securities Dealers (NASD). NASD is the world’s largest privatesector securities regulator, with responsibility for preventing and discovering misconduct among securities brokers. ..."
Abstract

Cited by 65 (16 self)
 Add to MetaCart
(Show Context)
We describe an application of relational knowledge discovery to a key regulatory mission of the National Association of Securities Dealers (NASD). NASD is the world’s largest privatesector securities regulator, with responsibility for preventing and discovering misconduct among securities brokers. Our goal was to help focus NASD’s limited regulatory resources on the brokers who are most likely to engage in securities violations. Using statistical relational learning algorithms, we developed models that rank brokers with respect to the probability that they would commit a serious violation of securities regulations in the near future. Our models incorporate organizational relationships among brokers (e.g., past coworker), which domain experts consider important but have not been easily used before now. The learned models were subjected to an extensive evaluation using more than 18 months of data unseen by the model developers and comprising over two person weeks of effort by NASD staff. Model predictions were found to correlate highly with the subjective evaluations of experienced NASD examiners. Furthermore, in all performance measures, our models performed as well as or better than the handcrafted rules that are currently in use at NASD.
Estimation of conditional Probabilities with Decision Trees and an Application to FineGrained POS Tagging
 COLING 2008
, 2008
"... We present a HMM partofspeech tagging method which is particularly suited for POS tagsets with a large number of finegrained tags. It is based on three ideas: (1) splitting of the POS tags into attribute vectors and decomposition of the contextual POS probabilities of the HMM into a product of at ..."
Abstract

Cited by 48 (3 self)
 Add to MetaCart
We present a HMM partofspeech tagging method which is particularly suited for POS tagsets with a large number of finegrained tags. It is based on three ideas: (1) splitting of the POS tags into attribute vectors and decomposition of the contextual POS probabilities of the HMM into a product of attribute probabilities, (2) estimation of the contextual probabilities with decision trees, and (3) use of highorder HMMs. In experiments on German and Czech data, our tagger outperformed stateoftheart POS taggers.