Results 1  10
of
159
Logistic Regression in Rare Events Data
, 1999
"... We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a ..."
Abstract

Cited by 152 (4 self)
 Add to MetaCart
We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a problem that seems to have at least two sources. First, popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events. We recommend corrections that outperform existing methods and change the estimates of absolute and relative risks by as much as some estimated effects reported in the literature. Second, commonly used data collection strategies are grossly inefficient for rare events data. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables, such as in international conflict data with more than a quartermillion dyads, only a few of which are at war. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99 % of their (nonfixed) data collection costs or to collect much more meaningful explanatory
2001), “The Impact of Differential Payroll Tax Subsidies on Minimum Wage Employment
 Journal of Public Economics
"... for very helpful comments and suggestions. We would also like to thank participants at the CEPRBrussels Conference on minimum wage, the CNRS winter workshop in ..."
Abstract

Cited by 60 (2 self)
 Add to MetaCart
for very helpful comments and suggestions. We would also like to thank participants at the CEPRBrussels Conference on minimum wage, the CNRS winter workshop in
Discriminative learning under covariate shift
 The Journal of Machine Learning Research
"... We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither tr ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection.
Large Sample Theory for Semiparametric Regression Models with TwoPhase, Outcome Dependent Sampling
, 2000
"... Outcomedependent, twophase sampling designs can dramatically reduce the costs of observational studies by judicious selection of the most informative subjects for purposes of detailed covariate measurement. Here we derive asymptotic information bounds and the form of the efficient score and inuenc ..."
Abstract

Cited by 30 (9 self)
 Add to MetaCart
Outcomedependent, twophase sampling designs can dramatically reduce the costs of observational studies by judicious selection of the most informative subjects for purposes of detailed covariate measurement. Here we derive asymptotic information bounds and the form of the efficient score and inuence functions for the semiparametric regression models studied by Lawless, Kalbfleisch, and Wild (1999) under twophase sampling designs. We relate the efficient score to the leastfavorable parametric submodel by use of formal calculations suggested by Newey (1994). We then proceed to show that the maximum likelihood estimators proposed by Lawless, Kalbfleisch, and Wild (1999) for both the parametric and nonparametric parts of the model are asymptotically normal and efficient, and that the efficient influence function for the parametric part agrees with the more general calculations of Robins, Hsieh, and Newey (1995).
Parametric distributions of complex survey data under informative probability sampling
 Statistica Sinica
, 1998
"... Abstract: The sample distribution is defined as the distribution of the sample measurements given the selected sample. Under informative sampling, this distribution is different from the corresponding population distribution, although for several examples the two distributions are shown to be in th ..."
Abstract

Cited by 28 (7 self)
 Add to MetaCart
Abstract: The sample distribution is defined as the distribution of the sample measurements given the selected sample. Under informative sampling, this distribution is different from the corresponding population distribution, although for several examples the two distributions are shown to be in the same family and only differ in some or all the parameters. A general approach of approximating the marginal sample distribution for a given population distribution and first order sample selection probabilities is discussed and illustrated. Theoretical and simulation results indicate that under common sampling methods of selection with unequal probabilities, when the population measurements are independently drawn from some distribution (superpopulation), the sample measurements are asymptotically independent as the population size increases. This asymptotic independence combined with the approximation of the marginal sample distribution permits the use of standard methods such as direct likelihood inference or residual analysis for inference on the population distribution.
Explaining Rare Events in International Relations
, 2000
"... Some of the most important phenomena in international conflict are coded as "rare events data," binary dependent variables with dozens to thousands of times fewer events, such as wars, coups, etc., than "nonevents". Unfortunately, rare events data are difficult to explain and pre ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
(Show Context)
Some of the most important phenomena in international conflict are coded as "rare events data," binary dependent variables with dozens to thousands of times fewer events, such as wars, coups, etc., than "nonevents". Unfortunately, rare events data are difficult to explain and predict, a problem that seems to have at least two sources. First, and most importantly, the data collection strategies used in international conflict are grossly inefficient. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99% of their (nonfixed) data collection costs, or to collect much more meaningful explanatory variables. Second, logistic regression, and other commonly ...
Population structure and cryptic relatedness in genetic association studies
 Statistical Science
, 2009
"... ar ..."
(Show Context)
Shrinkage Estimators for Robust and Efficient Inference in HaplotypeBased CaseControl Studies
"... Casecontrol association studies often aim to investigate the role of genes and geneenvironment interactions in terms of the underlying haplotypes, i.e. the combinations of alleles at multiple genetic loci along chromosomal regions. The goal of this article is to develop robust but efficient approac ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
(Show Context)
Casecontrol association studies often aim to investigate the role of genes and geneenvironment interactions in terms of the underlying haplotypes, i.e. the combinations of alleles at multiple genetic loci along chromosomal regions. The goal of this article is to develop robust but efficient approaches to the estimation of disease oddsratio parameters associated with haplotypes and haplotypeenvironment interactions. We consider “shrinkage” estimation techniques that can adaptively relax the model assumptions of HardyWeinbergEquilibrium and geneenvironment independence required by recently proposed efficient “retrospective ” methods. Our proposal involves first development of a novel retrospective approach to the analysis of casecontrol data, one that is robust to the nature of the geneenvironment distribution in the underlying population. Next, it involves shrinkage of the robust retrospective estimator towards a more precise, but modeldependent, retrospective estimator using novel empirical Bayes and penalized regression techniques. Methods for variance estimation are proposed based on asymptotic theories. Simulations and two data examples illustrate both the robustness and efficiency of the proposed methods.
A semiparametric approach to the oneway layout
 Technometrics
"... ( shortd @.ensco.com) We consider m distributions in which the rst mƒ 1 are obtained by multiplicative exponential distortions of the mth distribution, which is a reference. The combined data from m samples, one from each distribution, are used in the semiparametric largesample problem of estimati ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
( shortd @.ensco.com) We consider m distributions in which the rst mƒ 1 are obtained by multiplicative exponential distortions of the mth distribution, which is a reference. The combined data from m samples, one from each distribution, are used in the semiparametric largesample problem of estimating each distortion and the reference distribution and testing the hypothesis that the distributions are identical. The approach generalizes the classical normalbased oneway analysis of variance in the sense that it obviates the need for a completely speci ed parametric model. An advantage is that the probability density of the reference distribution is estimated from the combined data and not only from the mth sample. A power comparison with the t and F tests and with two nonparametric tests, obtained by means of a simulation, points to the merit of the present approach. The method is applied to rainrate data from meteorological instruments.
Bayesian Analysis of CaseControl Studies with Categorical Covariates
, 2000
"... this paper, we are not specifically concerned with measurement error. However, Muller & Roeder's exposition is highly relevant to any Bayesian analysis of a casecontrol study. ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
this paper, we are not specifically concerned with measurement error. However, Muller & Roeder's exposition is highly relevant to any Bayesian analysis of a casecontrol study.