Results 1  10
of
276
Greedy Function Approximation: A Gradient Boosting Machine
 Annals of Statistics
, 2000
"... Function approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest{descent minimization. A general gradient{descent \boosting" paradigm is developed for additive ex ..."
Abstract

Cited by 564 (12 self)
 Add to MetaCart
Function approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest{descent minimization. A general gradient{descent \boosting" paradigm is developed for additive expansions based on any tting criterion. Specic algorithms are presented for least{squares, least{absolute{deviation, and Huber{M loss functions for regression, and multi{class logistic likelihood for classication. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such \TreeBoost" models are presented. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classication, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire 1996, and Frie...
Strictly Proper Scoring Rules, Prediction, and Estimation
, 2007
"... Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he ..."
Abstract

Cited by 144 (17 self)
 Add to MetaCart
Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he or she issues the probabilistic forecast F, rather than G ̸ = F. It is strictly proper if the maximum is unique. In prediction problems, proper scoring rules encourage the forecaster to make careful assessments and to be honest. In estimation problems, strictly proper scoring rules provide attractive loss and utility functions that can be tailored to the problem at hand. This article reviews and develops the theory of proper scoring rules on general probability spaces, and proposes and discusses examples thereof. Proper scoring rules derive from convex functions and relate to information measures, entropy functions, and Bregman divergences. In the case of categorical variables, we prove a rigorous version of the Savage representation. Examples of scoring rules for probabilistic forecasts in the form of predictive densities include the logarithmic, spherical, pseudospherical, and quadratic scores. The continuous ranked probability score applies to probabilistic forecasts that take the form of predictive cumulative distribution functions. It generalizes the absolute error and forms a special case of a new and very general type of score, the energy score. Like many other scoring rules, the energy score admits a kernel representation in terms of negative definite functions, with links to inequalities of Hoeffding type, in both univariate and multivariate settings. Proper scoring rules for quantile and interval forecasts are also discussed. We relate proper scoring rules to Bayes factors and to crossvalidation, and propose a novel form of crossvalidation known as randomfold crossvalidation. A case study on probabilistic weather forecasts in the North American Pacific Northwest illustrates the importance of propriety. We note optimum score approaches to point and quantile
Estimating Portfolio and Consumption Choice: A Conditional Euler Equations Approach
 JOURNAL OF FINANCE
, 1999
"... This paper develops a nonparametric approach to examine how portfolio and consumption choice depends on variables that forecast timevarying investment opportunities. I estimate singleperiod and multiperiod portfolio and consumption rules of an investor with constant relative risk aversion and a on ..."
Abstract

Cited by 119 (11 self)
 Add to MetaCart
This paper develops a nonparametric approach to examine how portfolio and consumption choice depends on variables that forecast timevarying investment opportunities. I estimate singleperiod and multiperiod portfolio and consumption rules of an investor with constant relative risk aversion and a onemonth to 20year horizon. The investor allocates wealth to the NYSE index and a 30day Treasury bill. I find that the portfolio choice varies significantly with the dividend yield, default premium, term premium, and lagged excess return. Furthermore, the optimal decisions depend on the investor’s horizon and rebalancing frequency.
Piecewise linear regularized solution paths
 Ann. Statist
, 2007
"... We consider the generic regularized optimization problem ˆ β(λ) = arg minβ L(y, Xβ) + λJ(β). Recently, Efron et al. (2004) have shown that for the Lasso – that is, if L is squared error loss and J(β) = ‖β‖1 is the l1 norm of β – the optimal coefficient path is piecewise linear, i.e., ∂ ˆ β(λ)/∂λ i ..."
Abstract

Cited by 86 (8 self)
 Add to MetaCart
We consider the generic regularized optimization problem ˆ β(λ) = arg minβ L(y, Xβ) + λJ(β). Recently, Efron et al. (2004) have shown that for the Lasso – that is, if L is squared error loss and J(β) = ‖β‖1 is the l1 norm of β – the optimal coefficient path is piecewise linear, i.e., ∂ ˆ β(λ)/∂λ is piecewise constant. We derive a general characterization of the properties of (loss L, penalty J) pairs which give piecewise linear coefficient paths. Such pairs allow for efficient generation of the full regularized coefficient paths. We investigate the nature of efficient path following algorithms which arise. We use our results to suggest robust versions of the Lasso for regression and classification, and to develop new, efficient algorithms for existing problems in the literature, including Mammen & van de Geer’s Locally Adaptive Regression Splines. 1
Asymptotic analysis of stochastic programs, Annals of Operations Research 30
 169–186. Shapiro / Journal of Multivariate Analysis 100 (2009) 936–945 945
, 1991
"... In this paper we discuss a general approach to studying asymptotic properties of statistical estimators in stochastic programming. The approach is based on an extended delta method and appears to be particularly suitable for deriving asymptotics of the optimal value of stochastic programs. Asymptoti ..."
Abstract

Cited by 61 (13 self)
 Add to MetaCart
In this paper we discuss a general approach to studying asymptotic properties of statistical estimators in stochastic programming. The approach is based on an extended delta method and appears to be particularly suitable for deriving asymptotics of the optimal value of stochastic programs. Asymptotic analysis of the optimal value will be presented in detail. Asymptotic properties of the corresponding optimal solutions are briefly discussed.
Practical selection of svm parameters and noise estimation for svm regression
 Neural Networks
, 2004
"... We investigate practical selection of metaparameters for SVM regression (that is, εinsensitive zone and regularization parameter C). The proposed methodology advocates analytic parameter selection directly from the training data, rather than resampling approaches commonly used in SVM applications. ..."
Abstract

Cited by 57 (0 self)
 Add to MetaCart
We investigate practical selection of metaparameters for SVM regression (that is, εinsensitive zone and regularization parameter C). The proposed methodology advocates analytic parameter selection directly from the training data, rather than resampling approaches commonly used in SVM applications. Good generalization performance of the proposed parameter selection is demonstrated empirically using several lowdimensional and highdimensional regression problems. Further, we point out the importance of Vapnik’s εinsensitive loss for regression problems with finite samples. To this end, we compare generalization performance of SVM regression (with optimally chosen ε) with regression using ‘leastmodulus ’ loss (ε =0). These comparisons indicate superior generalization performance of SVM regression, for finite sample settings.
Predictive learning via rule ensembles
, 2005
"... General regression and classification models are constructed as linear combinations of simple rules derived from the data. Each rule consists of a conjunction of a small number of simple statements concerning the values of individual input variables. These rule ensembles are shown to produce predict ..."
Abstract

Cited by 53 (2 self)
 Add to MetaCart
General regression and classification models are constructed as linear combinations of simple rules derived from the data. Each rule consists of a conjunction of a small number of simple statements concerning the values of individual input variables. These rule ensembles are shown to produce predictive accuracy comparable to the best methods. However, their principal advantage lies in interpretation. Because of its simple form, each rule is easy to understand, as is its influence on individual predictions, selected subsets of predictions, or globally over the entire space of joint input variable values. Similarly, the degree of relevance of the respective input variables can be assessed globally, locally in different regions of the input space, or at individual prediction points. Techniques are presented for automatically identifying those variables that are involved in interactions with other variables, the strength and degree of those interactions, as well as the identities of the other variables with which they interact. Graphical representations are used to visualize both main and interaction effects. 1. Introduction. Predictive
Variable Kernel Density Estimation
 Annals of Statistics
, 1992
"... In this paper, we propose a method for robust kernel density estimation. We interpret a KDE with Gaussian kernel as the inner product between a mapped test point and the centroid of mapped training points in kernel feature space. Our robust KDE replaces the centroid with a robust estimate based on M ..."
Abstract

Cited by 51 (2 self)
 Add to MetaCart
In this paper, we propose a method for robust kernel density estimation. We interpret a KDE with Gaussian kernel as the inner product between a mapped test point and the centroid of mapped training points in kernel feature space. Our robust KDE replaces the centroid with a robust estimate based on Mestimation [1]. The iteratively reweighted least squares (IRWLS) algorithm for Mestimation depends only on inner products, and can therefore be implemented using the kernel trick. We prove the IRWLS method monotonically decreases its objective value at every iteration for a broad class of robust loss functions. Our proposed method is applied to synthetic data and network traffic volumes, and the results compare favorably to the standard KDE. Index Terms — kernel density estimation, Mestimator, outlier, kernel feature space, kernel trick 1.
Robust mixture modelling using the t distribution
 Statistics and Computing
"... Normal mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster sets of continuous multivariate data. However, for a set of data containing a group or groups of observations with longer than normal tails or atypical observations, the ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
Normal mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster sets of continuous multivariate data. However, for a set of data containing a group or groups of observations with longer than normal tails or atypical observations, the use of normal components may unduly affect the fit of the mixture model. In this paper, we consider a more robust approach by modelling the data by a mixture of t distributions. The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise.