Results 1  10
of
26
Strictly Proper Scoring Rules, Prediction, and Estimation
, 2007
"... Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he ..."
Abstract

Cited by 143 (17 self)
 Add to MetaCart
Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he or she issues the probabilistic forecast F, rather than G ̸ = F. It is strictly proper if the maximum is unique. In prediction problems, proper scoring rules encourage the forecaster to make careful assessments and to be honest. In estimation problems, strictly proper scoring rules provide attractive loss and utility functions that can be tailored to the problem at hand. This article reviews and develops the theory of proper scoring rules on general probability spaces, and proposes and discusses examples thereof. Proper scoring rules derive from convex functions and relate to information measures, entropy functions, and Bregman divergences. In the case of categorical variables, we prove a rigorous version of the Savage representation. Examples of scoring rules for probabilistic forecasts in the form of predictive densities include the logarithmic, spherical, pseudospherical, and quadratic scores. The continuous ranked probability score applies to probabilistic forecasts that take the form of predictive cumulative distribution functions. It generalizes the absolute error and forms a special case of a new and very general type of score, the energy score. Like many other scoring rules, the energy score admits a kernel representation in terms of negative definite functions, with links to inequalities of Hoeffding type, in both univariate and multivariate settings. Proper scoring rules for quantile and interval forecasts are also discussed. We relate proper scoring rules to Bayes factors and to crossvalidation, and propose a novel form of crossvalidation known as randomfold crossvalidation. A case study on probabilistic weather forecasts in the North American Pacific Northwest illustrates the importance of propriety. We note optimum score approaches to point and quantile
Interpretation Of Rank Histograms For Verifying Ensemble Forecasts
, 2000
"... Rank histograms are a tool for evaluating ensemble forecasts. They are useful for determining the reliability of ensemble forecasts and for diagnosing errors in its mean and spread. Rank histograms are generated by repeatedly tallying the rank of the verification (usually, an observation) relative t ..."
Abstract

Cited by 49 (5 self)
 Add to MetaCart
Rank histograms are a tool for evaluating ensemble forecasts. They are useful for determining the reliability of ensemble forecasts and for diagnosing errors in its mean and spread. Rank histograms are generated by repeatedly tallying the rank of the verification (usually, an observation) relative to values from an ensemble sorted from lowest to highest. However, an uncritical use of the rank histogram can lead to misinterpretations of the qualities of that ensemble. For example, a flat rank histogram, ususally taken as a sign of reliability, can still be generated from unreliable ensembles. Similarly, a Ushaped rank histogram, commonly understood as indicating a lack of variability in the ensemble, can also be a sign of conditional bias. It is also shown that flat rank histograms can be generated for some model variables if the variance of the ensemble is correctly specified, yet if covariances between model grid points are improperly specified, rank histograms for combinations of mo...
M.: Hypothesis tests for evaluating numerical precipitation forecasts
 Wea. Forecasting
, 1999
"... When evaluating differences between competing precipitation forecasts, formal hypothesis testing is rarely performed. This may be due to the difficulty in applying common tests given the spatial correlation of and nonnormality of errors. Possible ways around these difficulties are explored here. Two ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
When evaluating differences between competing precipitation forecasts, formal hypothesis testing is rarely performed. This may be due to the difficulty in applying common tests given the spatial correlation of and nonnormality of errors. Possible ways around these difficulties are explored here. Two datasets of precipitation forecasts are evaluated, a set of two competing gridded precipitation forecasts from operational weather prediction models and sets of competing probabilistic quantitative precipitation forecasts from model output statistics and from an ensemble of forecasts. For each test, data from each competing forecast are collected into one sample for each case day to avoid problems with spatial correlation. Next, several possible hypothesis test methods are evaluated: the paired t test, the nonparametric Wilcoxon signedrank test, and two resampling tests. The more involved resampling test methodology is the most appropriate when testing threat scores from nonprobabilistic forecasts. The simpler paired t test or Wilcoxon test is appropriate to use in testing the skill of probabilistic forecasts evaluated with the ranked probability score. 1.
Simulation of interannual variability of tropical storm frequency in an ensemble of GCM integrations
 J. Climate
, 1997
"... The present study examines the simulation of the number of tropical storms produced in GCM integrations with a prescribed SST. A 9member ensemble of 10yr integrations (1979–88) of a T42 atmospheric model forced by observed SSTs has been produced; each ensemble member differs only in the initial at ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
The present study examines the simulation of the number of tropical storms produced in GCM integrations with a prescribed SST. A 9member ensemble of 10yr integrations (1979–88) of a T42 atmospheric model forced by observed SSTs has been produced; each ensemble member differs only in the initial atmospheric conditions. An objective procedure for trackingmodelgenerated tropical storms is applied to this ensemble during the last 9 yr of the integrations (1980–88). The seasonal and monthly variations of tropical storm numbers are compared with observations for each ocean basin. Statistical tools such as the Chisquare test, the F test, and the t test are applied to the ensemble number of tropical storms, leading to the conclusion that the potential predictability is particularly strong over the western North Pacific and the eastern North Pacific, and to a lesser extent over the western North Atlantic. A set of tools including the joint probability distribution and the ranked probability score are used to evaluate the simulation skill of this ensemble simulation. The simulation skill over the western North Atlantic basin appears to be exceptionally high, particularly during years of strong potential predictability. 1.
Evaluation of EtaRSM ensemble probabilistic precipitation forecasts
 Mon. Wea. Rev
, 1998
"... The accuracy of shortrange probabilistic forecasts of quantitative precipitation (PQPF) from the experimental Eta–Regional Spectral Model ensemble is compared with the accuracy of forecasts from the Nested Grid Model’s model output statistics (MOS) over a set of 13 case days from September 1995 thr ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
The accuracy of shortrange probabilistic forecasts of quantitative precipitation (PQPF) from the experimental Eta–Regional Spectral Model ensemble is compared with the accuracy of forecasts from the Nested Grid Model’s model output statistics (MOS) over a set of 13 case days from September 1995 through January 1996. Ensembles adjusted to compensate for deficiencies noted in prior forecasts were found to be more skillful than MOS for all precipitation categories except the basic probability of measurable precipitation. Gamma distributions fit to the corrected ensemble probability distributions provided an additional small improvement. Interestingly, despite the favorable comparison with MOS forecasts, this ensemble configuration showed no ability to ‘‘forecast the forecast skill’ ’ of precipitation—that is, the ensemble was not able to forecast the variable specificity of the ensemble probability distribution from daytoday and locationtolocation. Probability forecasts from gamma distributions developed as a function of the ensemble mean alone were as skillful at PQPF as forecasts from distributions whose specificity varied with the spread of the ensemble. Since forecasters desire information on forecast uncertainty from the ensemble, these results suggest that future ensemble configurations should be checked carefully for their presumed ability to forecast uncertainty. 1.
Predictive model assessment for count data
, 2007
"... Summary. We discuss tools for the evaluation of probabilistic forecasts and the critique of statistical models for ordered discrete data. Our proposals include a nonrandomized version of the probability integral transform, marginal calibration diagrams and proper scoring rules, such as the predicti ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Summary. We discuss tools for the evaluation of probabilistic forecasts and the critique of statistical models for ordered discrete data. Our proposals include a nonrandomized version of the probability integral transform, marginal calibration diagrams and proper scoring rules, such as the predictive deviance. In case studies, we critique count regression models for patent data, and assess the predictive performance of Bayesian ageperiodcohort models for larynx cancer counts in Germany.
The geometry of proper scoring rules
, 2007
"... A decision problem is defined in terms of an outcome space, an action space and a loss function. Starting from these simple ingredients, we ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
A decision problem is defined in terms of an outcome space, an action space and a loss function. Starting from these simple ingredients, we
Seasonal tropical cyclone forecasts
 WMO Bulletin
, 2007
"... Seasonal forecasts of tropical cyclone activity in various regions have been developed since the first attempts in the early 1980s by Neville ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Seasonal forecasts of tropical cyclone activity in various regions have been developed since the first attempts in the early 1980s by Neville
Diagnostics verification of the Climate Prediction Center longlead outlooks
 J. Climate
, 2000
"... The performance of the Climate Prediction Center’s longlead forecasts for the period 1995–98 is assessed through a diagnostic verification, which involves examination of the full joint frequency distributions of the forecasts and the corresponding observations. The most striking results of the veri ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The performance of the Climate Prediction Center’s longlead forecasts for the period 1995–98 is assessed through a diagnostic verification, which involves examination of the full joint frequency distributions of the forecasts and the corresponding observations. The most striking results of the verifications are the strong cool and dry biases of the outlooks. These seem clearly related to the 1995–98 period being warmer and wetter than the 1961–90 climatological base period. This bias results in the ranked probability score indicating very low skill for both temperature and precipitation forecasts at all leads. However, the temperature forecasts at all leads, and the precipitation forecasts for leads up to a few months, exhibit very substantial resolution: low (high) forecast probabilities are consistently associated with lower (higher) than average relative frequency of event occurrence, even though these relative frequencies are substantially different (because of the unconditional biases) from the forecast probabilities. Conditional biases, related to systematic under or overconfidence on the part of the forecasters, are also evident in some circumstances. 1.