Results 1  10
of
65
Logistic Regression in Rare Events Data
, 1999
"... We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a ..."
Abstract

Cited by 55 (4 self)
 Add to MetaCart
We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a problem that seems to have at least two sources. First, popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events. We recommend corrections that outperform existing methods and change the estimates of absolute and relative risks by as much as some estimated effects reported in the literature. Second, commonly used data collection strategies are grossly inefficient for rare events data. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables, such as in international conflict data with more than a quartermillion dyads, only a few of which are at war. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99 % of their (nonfixed) data collection costs or to collect much more meaningful explanatory
Discriminative learning under covariate shift
 The Journal of Machine Learning Research
"... We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither tr ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection.
Large Sample Theory for Semiparametric Regression Models with TwoPhase, Outcome Dependent Sampling
, 2000
"... Outcomedependent, twophase sampling designs can dramatically reduce the costs of observational studies by judicious selection of the most informative subjects for purposes of detailed covariate measurement. Here we derive asymptotic information bounds and the form of the efficient score and inuenc ..."
Abstract

Cited by 20 (9 self)
 Add to MetaCart
Outcomedependent, twophase sampling designs can dramatically reduce the costs of observational studies by judicious selection of the most informative subjects for purposes of detailed covariate measurement. Here we derive asymptotic information bounds and the form of the efficient score and inuence functions for the semiparametric regression models studied by Lawless, Kalbfleisch, and Wild (1999) under twophase sampling designs. We relate the efficient score to the leastfavorable parametric submodel by use of formal calculations suggested by Newey (1994). We then proceed to show that the maximum likelihood estimators proposed by Lawless, Kalbfleisch, and Wild (1999) for both the parametric and nonparametric parts of the model are asymptotically normal and efficient, and that the efficient influence function for the parametric part agrees with the more general calculations of Robins, Hsieh, and Newey (1995).
Prospective Analysis Of Logistic CaseControl Studies
, 1994
"... In a classical casecontrol study, Prentice & Pyke (1979) propose to ignore the study design and instead base estimation and inference upon a random sampling, i.e., prospective, formulation. We generalize this prospective formulation of casecontrol studies to include multiplicative models, stratif ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
In a classical casecontrol study, Prentice & Pyke (1979) propose to ignore the study design and instead base estimation and inference upon a random sampling, i.e., prospective, formulation. We generalize this prospective formulation of casecontrol studies to include multiplicative models, stratification, missing data, measurement error, robustness and other examples. The resulting estimators, which ignore the casecontrol study aspect and instead are based upon a randomsampling formulation, are typically consistent for nonintercept parameters and are asymptotically normally distributed. We derive the resulting asymptotic covariance matrix of the parameter estimates. The covariance matrix obtained by ignoring the casecontrol sampling scheme and using prospective formulae instead is shown to be, at worst, asymptotically conservative, and asymptotically correct in a variety of problems; a simple sufficient condition guaranteeing the latter is obtained. Some Key Words: Asymptotics; C...
Application of Convolution Theorems in Semiparametric Models with noni.i.d. Data
"... A useful approach to asymptotic e ciency for estimators in semiparametric models is the study of lower bounds on asymptotic variances via convolution theorems. Such theorems are often applicable in models in which the classical assumptions of independence and identical distributions fail to hold, bu ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
A useful approach to asymptotic e ciency for estimators in semiparametric models is the study of lower bounds on asymptotic variances via convolution theorems. Such theorems are often applicable in models in which the classical assumptions of independence and identical distributions fail to hold, but to date, much of the research has focused on semiparametric models with independent and identically distributed (i.i.d.) data because tools are available in the i.i.d. setting for verifying preconditions of the convolution theorems. We develop tools for noni.i.d. data that are similar in spirit to those for i.i.d. data and also analogous to the approaches used in parametric models with dependent data. This involves extending the notion of the tangent vector guring so prominently in the i.i.d. theory and providing conditions for smoothness, or differentiability, of the parameter of interest as a function of the underlying probability measures. As a corollary to the differentiability result we obtain sufficient conditions for equivalence, in terms of asymptotic variance bounds, of two models. Regularity and asymptotic linearity of estimators are also discussed.
Bayesian Analysis of CaseControl Studies with Categorical Covariates
, 2000
"... this paper, we are not specifically concerned with measurement error. However, Muller & Roeder's exposition is highly relevant to any Bayesian analysis of a casecontrol study. ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
this paper, we are not specifically concerned with measurement error. However, Muller & Roeder's exposition is highly relevant to any Bayesian analysis of a casecontrol study.
Estimating Risk and Rate Levels, Ratios, and Differences in CaseControl Studies
, 2001
"... Classic (or "cumulative") casecontrol sampling designs do not admit inferences about quantities of interest other than risk ratios, and then only by making the rare events assumption. Probabilities, risk differences, and other quantities cannot be computed without knowledge of the population incide ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Classic (or "cumulative") casecontrol sampling designs do not admit inferences about quantities of interest other than risk ratios, and then only by making the rare events assumption. Probabilities, risk differences, and other quantities cannot be computed without knowledge of the population incidence fraction. Similarly, density (or "risk set") casecontrol sampling designs do not allow inferences about quantities other than the rate ratio. Rates, rate differences, cumulative rates, risks, and other quantities cannot be estimated unless auxiliary information about the underlying cohort such as the number of controls in each full risk set is available. Most scholars who have considered the issue recommend reporting more than just risk and rate ratios, but auxiliary population information needed to do this is not usually available. We address this problem by developing methods that allow valid inferences about all relevant quantities of interest from either type of casecontrol study when completely ignorant of or only partially knowledgeable about relevant auxiliary population information.
Estimation in ChoiceBased Sampling With Measurement Error and Bootstrap Analysis
 J. Economet
, 1997
"... In this paper we discuss the estimation of a logit binary response model. The sampling is choicebased and is done in two stages. We investigate a likelihood based estimator which reduces to the usual logistic estimator when there is no measurement error and which takes into account the constraints ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
In this paper we discuss the estimation of a logit binary response model. The sampling is choicebased and is done in two stages. We investigate a likelihood based estimator which reduces to the usual logistic estimator when there is no measurement error and which takes into account the constraints imposed by the structure of the problem. Estimated standard errors obtained by formulae for prospective analysis are asymptotically correct. A robust estimation procedure is proposed and an asymptotic covariance matrix obtained. Several bootstrap methods are applied to this retrospective problem. Numerical results are presented to illustrate useful properties of the methods. Key words: Binary logit; Bootstrap; Choicebased sampling; Measurement error; Robustness JEL classification: C13; C25; C35 Correspondence to: C.Y. Wang, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, MP 1002, Seattle, WA 98104, USA. Email: cywang@mule.fhcrc.org, Fax: (206) 6674142. * The...
Explaining Rare Events in International Relations
, 2000
"... Some of the most important phenomena in international conflict are coded as "rare events data," binary dependent variables with dozens to thousands of times fewer events, such as wars, coups, etc., than "nonevents". Unfortunately, rare events data are difficult to explain and predict, a problem that ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Some of the most important phenomena in international conflict are coded as "rare events data," binary dependent variables with dozens to thousands of times fewer events, such as wars, coups, etc., than "nonevents". Unfortunately, rare events data are difficult to explain and predict, a problem that seems to have at least two sources. First, and most importantly, the data collection strategies used in international conflict are grossly inefficient. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99% of their (nonfixed) data collection costs, or to collect much more meaningful explanatory variables. Second, logistic regression, and other commonly ...
A Nonparametric Mixture Approach to CaseControl Studies with Errors in Covariables
 Journal of the American Statistical Association
, 1996
"... Methods are devised for estimating the parameters of a prospective logistic model in a casecontrol study with dichotomous response D which depends on a covariate X. For a portion of the sample, both the gold standard X and a surrogate covariate W are available; however, for the greater portion of ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
Methods are devised for estimating the parameters of a prospective logistic model in a casecontrol study with dichotomous response D which depends on a covariate X. For a portion of the sample, both the gold standard X and a surrogate covariate W are available; however, for the greater portion of the data only the surrogate covariate W is available. By using a mixture model, the relationship between the true covariate and the response can be modeled appropriately for both types of data. The likelihood depends on the marginal distribution of X and the measurement error density (W jX; D). The latter is modeled parametrically based on the validation sample. The marginal distribution of the true covariate is modeled using a nonparametric mixture distribution. In this way we can improve the efficiency and reduce the bias of the parameter estimates. The results are sufficiently general that they allow us to provide the first results which allow no validation, if the error distribution is m...