Results 1  10
of
27
Predicting Good Probabilities with Supervised Learning
 In Proc. Int. Conf. on Machine Learning (ICML
, 2005
"... We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion i ..."
Abstract

Cited by 57 (7 self)
 Add to MetaCart
We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Models such as Naive Bayes, which make unrealistic independence assumptions, push probabilities toward 0 and 1. Other models such as neural nets and bagged trees do not have these biases and predict well calibrated probabilities. We experiment with two ways of correcting the biased probabilities predicted by some learning methods: Platt Scaling and Isotonic Regression. We qualitatively examine what kinds of distortions these calibration methods are suitable for and quantitatively examine how much data they need to be effective. The empirical results show that after calibration boosted trees, random forests, and SVMs predict the best probabilities.
Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria
, 2004
"... Many criteria can be used to evaluate the performance of supervised learning. Different criteria are appropriate in different settings, and it is not always clear which criteria to use. A further complication is that learning methods that perform well on one criterion may not perform well on other c ..."
Abstract

Cited by 49 (2 self)
 Add to MetaCart
Many criteria can be used to evaluate the performance of supervised learning. Different criteria are appropriate in different settings, and it is not always clear which criteria to use. A further complication is that learning methods that perform well on one criterion may not perform well on other criteria. For example, SVMs and boosting are designed to optimize accuracy, whereas neural nets typically optimize squared error or cross entropy. We conducted an empirical study using a variety of learning methods (SVMs, neural nets, knearest neighbor, bagged and boosted trees, and boosted stumps) to compare nine boolean classification performance metrics: Accuracy, Lift, FScore, Area under the ROC Curve, Average Precision, Precision/Recall BreakEven Point, Squared Error, Cross Entropy, and Probability Calibration. Multidimensional scaling (MDS) shows that these metrics span a low dimensional manifold. The three metrics that are appropriate when predictions are interpreted as probabilities: squared error, cross entropy, and calibration, lay in one part of metric space far away from metrics that depend on the relative order of the predicted values: ROC area, average precision, breakeven point, and lift. In between them fall two metrics that depend on comparing predictions to a threshold: accuracy and Fscore. As expected, maximum margin methods such as SVMs and boosted trees have excellent performance on metrics like accuracy, but perform poorly on probability metrics such as squared error. What was not expected was that the margin methods have excellent performance on ordering metrics such as ROC area and average precision. We introduce a new metric, SAR, that combines squared error, accuracy, and ROC area into one metric. MDS and correlation analysis shows that SAR is centrally located and correlates well with other metrics, suggesting that it is a good general purpose metric to use when more specific criteria are not known.
Probabilistic forecasts, calibration and sharpness
 Journal of the Royal Statistical Society Series B
, 2007
"... Summary. Probabilistic forecasts of continuous variables take the form of predictive densities or predictive cumulative distribution functions. We propose a diagnostic approach to the evaluation of predictive performance that is based on the paradigm of maximizing the sharpness of the predictive dis ..."
Abstract

Cited by 38 (15 self)
 Add to MetaCart
Summary. Probabilistic forecasts of continuous variables take the form of predictive densities or predictive cumulative distribution functions. We propose a diagnostic approach to the evaluation of predictive performance that is based on the paradigm of maximizing the sharpness of the predictive distributions subject to calibration. Calibration refers to the statistical consistency between the distributional forecasts and the observations and is a joint property of the predictions and the events that materialize. Sharpness refers to the concentration of the predictive distributions and is a property of the forecasts only. A simple theoretical framework allows us to distinguish between probabilistic calibration, exceedance calibration and marginal calibration. We propose and study tools for checking calibration and sharpness, among them the probability integral transform histogram, marginal calibration plots, the sharpness diagram and proper scoring rules. The diagnostic approach is illustrated by an assessment and ranking of probabilistic forecasts of wind speed at the Stateline wind energy centre in the US Pacific Northwest. In combination with crossvalidation or in the time series context, our proposal provides very general, nonparametric alternatives to the use of information criteria for model diagnostics and model selection.
Bayesian Modeling of Uncertainty in Ensembles of Climate Models
, 2008
"... Projections of future climate change caused by increasing greenhouse gases depend critically on numerical climate models coupling the ocean and atmosphere (GCMs). However, different models differ substantially in their projections, which raises the question of how the different models can best be co ..."
Abstract

Cited by 25 (6 self)
 Add to MetaCart
Projections of future climate change caused by increasing greenhouse gases depend critically on numerical climate models coupling the ocean and atmosphere (GCMs). However, different models differ substantially in their projections, which raises the question of how the different models can best be combined into a probability distribution of future climate change. For this analysis, we have collected both current and future projected mean temperatures produced by nine climate models for 22 regions of the earth. We also have estimates of current mean temperatures from actual observations, together with standard errors, that can be used to calibrate the climate models. We propose a Bayesian analysis that allows us to combine the different climate models into a posterior distribution of future temperature increase, for each of the 22 regions, while allowing for the different climate models to have different variances. Two versions of the analysis are proposed, a univariate analysis in which each region is analyzed separately, and a multivariate analysis in which the 22 regions are combined into an overall statistical model. A crossvalidation approach is proposed to confirm the reasonableness of our Bayesian predictive distributions. The results of this analysis allow for a quantification of the uncertainty of climate model projections as a Bayesian posterior distribution, substantially extending previous approaches to uncertainty in climate models.
Obtaining calibrated probabilities from boosting
 in Proc. 21st Conference on Uncertainty in Artificial Intelligence (UAI’05
, 2005
"... Boosted decision trees typically yield good accuracy, precision, and ROC area. However, because the outputs from boosting are not well calibrated posterior probabilities, boosting yields poor squared error and crossentropy. We empirically demonstrate why AdaBoost predicts distorted probabilities an ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Boosted decision trees typically yield good accuracy, precision, and ROC area. However, because the outputs from boosting are not well calibrated posterior probabilities, boosting yields poor squared error and crossentropy. We empirically demonstrate why AdaBoost predicts distorted probabilities and examine three calibration methods for correcting this distortion: Platt Scaling, Isotonic Regression, and Logistic Correction. We also experiment with boosting using logloss instead of the usual exponential loss. Experiments show that Logistic Correction and boosting with logloss work well when boosting weak models such as decision stumps, but yield poor performance when boosting more complex models such as full decision trees. Platt Scaling and Isotonic Regression, however, significantly improve the probabilities predicted by both boosted stumps and boosted trees. After calibration, boosted full decision trees predict better probabilities than other learning methods such as SVMs, neural nets, bagged decision trees, and KNNs, even after these methods are calibrated.
An Experimental Comparison of Performance Measures for Classification
, 2007
"... Performance metrics in classification are fundamental to assess the quality of learning methods and learned models. However, many different measures have been defined in the literature with the aim of making better choices in general or for a specific application area. Choices made by one metric are ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
Performance metrics in classification are fundamental to assess the quality of learning methods and learned models. However, many different measures have been defined in the literature with the aim of making better choices in general or for a specific application area. Choices made by one metric are claimed to be different from choices made by other metrics. In this work we analyse experimentally the behaviour of 18 different performance metrics in several scenarios, identifying clusters and relationships between measures. We also perform a sensitivity analysis for all of them in terms of several traits: class threshold choice, separability/ranking quality, calibration performance and sensitivity to changes in prior class distribution. From the definitions and the experiments, we give a comprehensive analysis on the relationships between metrics, and a taxonomy and arrangement of them according to the previous traits. This can be useful to choose the most adequate measure (or set of measures) for a specific application. Additionally, the study also highlights some niches in which new measures might be defined and also shows that some supposedly innovative measures make the same choices (or almost) than existing ones. Finally, this work can also be used as a reference for comparing experimental results in the pattern recognition and machine learning literature, when using different measures.
Estimating and Evaluating Confidence for Forensic Speaker Recognition," presented at ICASSP
, 2005
"... Estimating and evaluating confidence has become a key aspect of the speaker recognition problem because of the increased use of this technology in forensic applications. We discuss evaluation measures for speaker recognition and some of their properties. We then propose a framework for confidence es ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Estimating and evaluating confidence has become a key aspect of the speaker recognition problem because of the increased use of this technology in forensic applications. We discuss evaluation measures for speaker recognition and some of their properties. We then propose a framework for confidence estimation based upon scores and metainformation, such as utterance duration, channel type, and SNR. The framework uses regression techniques with multilayer perceptrons to estimate confidence with a datadriven methodology. As an application, we show the use of the framework in a speaker comparison task drawn from the NIST 2000 evaluation. A relative comparison of different types of metainformation is given. We demonstrate that the new framework can give substantial improvements over standard distribution methods of estimating confidence. 1.
The Maximum Entropy Approach and Probabilistic IR Models
 ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 1998
"... The Principle of Maximum Entropy is discussed and two classic probabilistic models of information retrieval, the Binary Independence Model of Robertson and Sparck Jones and the Combination Match Model of Croft and Harper are derived using the maximum entropy approach. The assumptions on which the cl ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
The Principle of Maximum Entropy is discussed and two classic probabilistic models of information retrieval, the Binary Independence Model of Robertson and Sparck Jones and the Combination Match Model of Croft and Harper are derived using the maximum entropy approach. The assumptions on which the classical models are based are not made. In their place, the probability distribution of maximum entropy consistent with a set of constraints is determined. It is argued that this subjectivist approach is more philosophically coherent than the frequentist conceptualization of probability that is often assumed as the basis of probabilistic modeling and that this philosophical stance has important practical consequences with respect to the realization of information retrieval research.
Inducing models of blackbox storage arrays
, 2004
"... statistical model induction, storage arrays, I/O response time prediction, performance model induction This paper applies statistical modelinduction techniques to the problem of forecasting response times in storage systems. Our work differs from prior research in several ways: we regard storage sy ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
statistical model induction, storage arrays, I/O response time prediction, performance model induction This paper applies statistical modelinduction techniques to the problem of forecasting response times in storage systems. Our work differs from prior research in several ways: we regard storage systems as black boxes; we automatically induce models rather than constructing them from detailed expert knowledge; we use lightweight passive observations, rather than extensive controlled experiments, to collect input data; we forecast individual response times rather than aggregates or averages; and we focus on large and complex enterprise storage arrays that comprise many RAID groups. We evaluate our methods using a lengthy storage trace collected in a realworld environment, and measure the predictive value of information available when requests are issued. This paper makes several contributions. First, we quantify the potential of a class of statistical methods for the challenging problem of automatic performance model induction. Second, we quantify improvements in accuracy that result when the range of information available to our models increases. Finally, we describe a general, lowcost modeling methodology that can be applied to a wide range of storage arrays.
Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition
 IEEE Transactions on Audio, Speech and Language Processing
, 2007
"... Abstract—Forensic DNA profiling is acknowledged as the model for a scientifically defensible approach in forensic identification science, as it meets the most stringent court admissibility requirements demanding transparency in scientific evaluation of evidence and testability of systems and protoco ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
Abstract—Forensic DNA profiling is acknowledged as the model for a scientifically defensible approach in forensic identification science, as it meets the most stringent court admissibility requirements demanding transparency in scientific evaluation of evidence and testability of systems and protocols. In this paper, we propose a unified approach to forensic speaker recognition (FSR) oriented to fulfil these admissibility requirements within a framework which is transparent, testable, and understandable, both for scientists and factfinders. We show how the evaluation of DNA evidence, which is based on a probabilistic similaritytypicality metric in the form of likelihood ratios (LR), can also be generalized to continuous LR estimation, thus providing a common framework for phonetic–linguistic methods and automatic systems. We highlight the importance of calibration, and we exemplify with LRs from diphthongal Fpattern, and LRs in NISTSRE06 tasks. The application of the proposed approach in daily casework remains a sensitive issue, and special caution is enjoined. Our objective is to show how traditional and automatic FSR methodologies can be transparent and testable, but simultaneously remain conscious of the present limitations. We conclude with a discussion on the combined use of traditional and automatic approaches and current challenges for the admissibility of speech evidence. Index Terms—Admissibility of speech evidence, calibration, Daubert, deoxyribonucleic acid (DNA), forensic speaker recognition (FSR), likelihood ratio (LR). I.