Results 1  10
of
58
Regularized discriminant analysis
 J. Amer. Statist. Assoc
, 1989
"... Linear and quadratic discriminant analysis are considered in the small sample highdimensional setting. Alternatives to the usual maximum likelihood (plugin) estimates for the covariance matrices are proposed. These alternatives are characterized by two parameters, the values of which are customize ..."
Abstract

Cited by 340 (2 self)
 Add to MetaCart
Linear and quadratic discriminant analysis are considered in the small sample highdimensional setting. Alternatives to the usual maximum likelihood (plugin) estimates for the covariance matrices are proposed. These alternatives are characterized by two parameters, the values of which are customized to individual situations by jointly minimizing a sample based estimate of future misclassification risk. Computationally fast implementations are presented, and the efficacy of the approach is examined through simulation studies and application to data. These studies indicate that in many circumstances dramatic gains in classification accuracy can be achieved. Submitted to Journal of the American Statistical Association
The State of Authorship Attribution Studies: Some Problems and Solutions
 Computer and the Humanities
, 1998
"... Abstract. The statement, “Results of most nontraditional authorship attribution studies are not universally accepted as definitive, ” is explicated. A variety of problems in these studies are listed and discussed: studies governed by expediency; a lack of competent research; flawed statistical tech ..."
Abstract

Cited by 43 (0 self)
 Add to MetaCart
Abstract. The statement, “Results of most nontraditional authorship attribution studies are not universally accepted as definitive, ” is explicated. A variety of problems in these studies are listed and discussed: studies governed by expediency; a lack of competent research; flawed statistical techniques; corrupted primary data; lack of expertise in allied fields; a dilettantish approach; inadequate treatment of errors. Various solutions are suggested: construct a correct and complete experimental design; educate the practitioners; study style in its totality; identify and educate the gatekeepers; develop a complete theoretical framework; form an association of practitioners. 1.
Two Estimators of the Mean of a Counting Process with Panel Count Data
, 1998
"... We study two estimators of the mean function of a counting process based on "panel count data". The setting for "panel count data" is one in which n independent subjects, each with a counting process with common mean function, are observed at several possibly di erent times durin ..."
Abstract

Cited by 22 (12 self)
 Add to MetaCart
We study two estimators of the mean function of a counting process based on "panel count data". The setting for "panel count data" is one in which n independent subjects, each with a counting process with common mean function, are observed at several possibly di erent times during a study. Following a model proposed by Schick and Yu (1997), we allow the number of observation times, and the observation times themselves, to be random variables. Our goal is to estimate the mean function of the counting process. We show that the estimator of the mean function proposed by Sun and Kalbfleisch (1995) can be viewed as a pseudomaximum likelihood estimator when a nonhomogeneous Poisson process model is assumed for the counting process. We establish consistency of both the nonparametric pseudo maximum likelihood estimator of Sun and Kalbfleisch (1995) and the full maximum likelihood estimator, even if the underlying counting process is not a Poisson process. We also derive the asymptotic distribution of both estimators at a xed time t, and compare the resulting theoretical relative e ciency with nite sample relative efficiency by way of a limited montecarlo study.
Fitting Tweedie's Compound Poisson Model to Insurance Claims Data: Dispersion Modelling
 ASTIN Bulletin
, 2002
"... We reconsider the problem of producing fair and accurate taris based on aggregated insurance data giving numbers of claims and total costs for the claims. Jrgensen and de Souza (Scand. Actuarial J., 1994) assumed Poisson arrival of claims and gamma distributed costs for individual claims. Jrgens ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
We reconsider the problem of producing fair and accurate taris based on aggregated insurance data giving numbers of claims and total costs for the claims. Jrgensen and de Souza (Scand. Actuarial J., 1994) assumed Poisson arrival of claims and gamma distributed costs for individual claims. Jrgensen and de Souza (1994) directly modelled the risk or expected cost of claims per insured unit, say. They observed that the dependence of the likelihood function on is as for a linear exponential family, so that modelling similar to that of generalized linear models is possible. In this paper we observe that, when modelling the cost of insurance claims, it is generally necessary to model the dispersion of the costs as well as their mean. In order to model the dispersion we use the framework of double generalized linear models. Modelling the dispersion increases the precision of the estimated taris. The use of double generalized linear models also allows us to handle the case where only the total cost of claims and not the number of claims has been recorded. Keywords: Car insurance, claims data, compound Poisson model, exposure, generalized linear models, dispersion modelling, double generalized linear models, REML, risk theory, tarication. Address for correspondence: Dr G. K. Smyth, Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research, Post Oce, Royal Melbourne Hospital, Parkville, VIC 3050, Australia 1 1
On the cost of data analysis
 Journal of Computational and Graphical Statistics
, 1992
"... A regression analysis usually consists of several stages such as variable selection, transformation and residual diagnosis. Inference is often made from the selected model without regard to the model selection methods that preceeded it. This can result in overoptimistic and biased inferences. We fir ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
A regression analysis usually consists of several stages such as variable selection, transformation and residual diagnosis. Inference is often made from the selected model without regard to the model selection methods that preceeded it. This can result in overoptimistic and biased inferences. We first characterize data analytic actions as functions acting on regression models. We investigate the extent of the problem and test bootstrap, jackknife and sample splitting methods for ameliorating it. We also demonstrate an interactive LISPSTAT system for assessing the cost of the data analysis while it is taking place.
Penalized Regression with ModelBased Penalties
, 2000
"... Nonparametric regression techniques such as spline smoothing and local fitting depend implicitly on a parametric model. For instance, the cubic smoothing spline estimate of a regression function based on observations t i ,Y i is the minimizer of # {Y i  (t i )} 2 + # # ( ## ) 2 .Since ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Nonparametric regression techniques such as spline smoothing and local fitting depend implicitly on a parametric model. For instance, the cubic smoothing spline estimate of a regression function based on observations t i ,Y i is the minimizer of # {Y i  (t i )} 2 + # # ( ## ) 2 .Since # ( ## ) 2 is zero when is a line, the cubic smoothing spline estimate favors the parametric model (t)=# 0+# 1 t. Here the authors consider replacing # ( ## ) 2 with the more general expression # (L) 2 where L is a linear di#erential operator with possibly nonconstant coe#cients. The resulting estimate of performs well, particularly if L is small. They present present a O(n) algorithm for the computation of . This algorithm is applicable to a wide class of L's. They also suggest a method for the estimation of L. They study our estimates via simulation and apply them to several data sets. R ESUM E Les techniques de regression non parametrique telles que l'ajustement local ou ...
Manual Controls For HighDimensional Data Projections
 Journal of Computational and Graphical Statistics
, 1997
"... Projections of highdimensional data onto lowdimensional subspaces provide insightful views for understanding multivariate relationships. In this paper we discuss how to manually control the variable contributions to the projection. The user has control of the way a particular variable contributes ..."
Abstract

Cited by 19 (13 self)
 Add to MetaCart
Projections of highdimensional data onto lowdimensional subspaces provide insightful views for understanding multivariate relationships. In this paper we discuss how to manually control the variable contributions to the projection. The user has control of the way a particular variable contributes to the viewed projection and can interactively adjust the variable's contribution. These manual controls complement the automatic views provided by a grand tour, or a guided tour, and give greatly improved flexibility to data analysts. 1 Introduction This paper builds on dynamic visualization methods for highdimensional data using lowdimensional projections. Among these methods, the most familiar are 3D data rotations, generated by displaying a continuous sequence of 2D projections of 3D data. From a statistical perspective it is rare to have data that are strictly 3D, and so, unlike most computer graphics applications, the more useful methods for data analysis show projections from a...
Two likelihoodbased semiparametric estimation methods for panel count data with covariates
, 2005
"... We consider estimation in a particular semiparametric regression model for the mean of a counting process with “panel count ” data. The basic model assumption is that the conditional mean function of the counting process is of the form E{N(t)Z} = exp(β T 0 Z)Λ0(t) where Z is a vector of covariates ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
(Show Context)
We consider estimation in a particular semiparametric regression model for the mean of a counting process with “panel count ” data. The basic model assumption is that the conditional mean function of the counting process is of the form E{N(t)Z} = exp(β T 0 Z)Λ0(t) where Z is a vector of covariates and Λ0 is the baseline mean function. The “panel count ” observation scheme involves observation of the counting process N for an individual at a random number K of random time points; both the number and the locations of these time points may differ across individuals. We study semiparametric maximum pseudolikelihood and maximum likelihood estimators of the unknown parameters (β0,Λ0) derived on the basis of a nonhomogeneous Poisson process assumption. The pseudolikelihood estimator is fairly easy to compute, while the maximum likelihood estimator poses more challenges from the computational perspective. We study asymptotic properties of both estimators assuming that the proportional mean model holds, but dropping the Poisson process assumption used to derive the estimators. In particular we establish asymptotic normality for the estimators of the regression parameter β0 under appropriate hypotheses. The results show that our estimation procedures are robust in the sense that the estimators converge to the truth regardless of the underlying counting process.
A Quantification Of DistanceBias Between Evaluation Metrics In Classification
 In Proceedings of the 17th International Conference on Machine Learning
, 2000
"... This paper provides a characterization of bias for evaluation metrics in classification (e.g., Information Gain, Gini, 2 , etc.). Our characterization provides a uniform representation for all traditional evaluation metrics. Such representation leads naturally to a measure for the distance ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
This paper provides a characterization of bias for evaluation metrics in classification (e.g., Information Gain, Gini, 2 , etc.). Our characterization provides a uniform representation for all traditional evaluation metrics. Such representation leads naturally to a measure for the distance between the bias of two evaluation metrics. We give a practical value to our measure by observing if the distance between the bias of two evaluation metrics correlates with differences in predictive accuracy when we compare two versions of the same learning algorithm that differ in the evaluation metric only. Experiments on realworld domains show how the expectations on accuracy differences generated by the distancebias measure correlate with actual differences when the learning algorithm is simple (e.g., search for the best singlefeature or the best singlerule). The correlation, however, weakens with more complex algorithms (e.g., learning decision trees). Our results sh...
On Locally Uniformly Linearizable High Breakdown Location and Scale Functionals
, 1998
"... this paper and the standard one model situation of robust statistics. They consider a finite number of models or challenges and look for a procedure which performs well at all of them. The hope is that such a procedure will also perform reasonably well for challenges which lie between. For a given s ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
this paper and the standard one model situation of robust statistics. They consider a finite number of models or challenges and look for a procedure which performs well at all of them. The hope is that such a procedure will also perform reasonably well for challenges which lie between. For a given sample a likelihood based compromise between the two challenges is made. The use of likelihood means that the method of Morgenthaler and Tukey does not satisfy DP5. In Section 6 we show how it is possible to "coarsen" a large class of distributions by reducing them to a finite sample of m points which themselves satisfy DP5. These points can be used to decide between a finite set of challenges and hence to make the weights of the weighted mean depend on the shape of the sample but in a differentiable manner. 3 Local uniform linearity