Results 1  10
of
13
A Probability Model For Census Adjustment
 Mathematical Population Studies
, 2000
"... The census can be adjusted using capturerecapture techniques: capture in the census, recapture in a special Post Enumeration Survey (PES) done after the census. The population is estimated using the Dual System Estimator (DSE). Estimates are made separately for demographic groups called post strata ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
The census can be adjusted using capturerecapture techniques: capture in the census, recapture in a special Post Enumeration Survey (PES) done after the census. The population is estimated using the Dual System Estimator (DSE). Estimates are made separately for demographic groups called post strata; adjustment factors are then applied to these demographic groups within small geographic areas. We offer a probability model for this process, in which several sources of error can be distinguished. In this model, correlation bias arises from behavioral differences between persons counted in the census and persons missed by the census. The first group may on the whole be more likely to respond to the PES: if so, the DSE will be systematically too low, and that is an example of correlation bias. Correlation bias is distinguished from heterogeneity, which occurs if the census has a higher capture rate in some geographic areas than others. Finally, ratio estimator bias and variance are conside...
Supplement to “A covariate adjustment for zerotruncated approaches to estimating the size of hidden and elusive populations
, 2008
"... In this paper we consider the estimation of population size from onesource capture–recapture data, that is, a list in which individuals can potentially be found repeatedly and where the question is how many individuals are missed by the list. As a typical example, we provide data from a drug user s ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In this paper we consider the estimation of population size from onesource capture–recapture data, that is, a list in which individuals can potentially be found repeatedly and where the question is how many individuals are missed by the list. As a typical example, we provide data from a drug user study in Bangkok from 2001 where the list consists of drug users who repeatedly contact treatment institutions. Drug users with 1, 2, 3,... contacts occur, but drug users with zero contacts are not present, requiring the size of this group to be estimated. Statistically, these data can be considered as stemming from a zerotruncated count distribution. We revisit an estimator for the population size suggested by Zelterman that is known to be robust under potential unobserved heterogeneity. We demonstrate that the Zelterman estimator can be viewed as a maximum likelihood estimator for a locally truncated Poisson likelihood which is equivalent to a binomial likelihood. This result allows the extension of the Zelterman estimator by means of logistic regression to include observed heterogeneity in the form of covariates. We also review an estimator proposed by Chao and explain why we are not able to obtain similar results for this estimator. The Zelterman estimator is applied in two case studies, the first a drug user study from Bangkok, the second an illegal immigrant study in the Netherlands. Our results suggest the new estimator should be used, in particular, if substantial unobserved heterogeneity is present. 1. Introduction. Registration
SPECIES RICHNESS ESTIMATION
"... Abstract. Various models and estimation procedures for estimating the number of species in a community are reviewed under the following sampling schemes: sampling by continuoustype of efforts, sampling by individuals, and sampling by quadrats (or multiple occasions). Applications and relevant sof ..."
Abstract
 Add to MetaCart
Abstract. Various models and estimation procedures for estimating the number of species in a community are reviewed under the following sampling schemes: sampling by continuoustype of efforts, sampling by individuals, and sampling by quadrats (or multiple occasions). Applications and relevant software are briefly reviewed.
unknown title
"... In many classification problems such as spam detection and network intrusion, a large number of unlabeled test instances are predicted negative by the classifier. However, the high costs as well as time constraints on an expert’s time prevent further analysis of the “predicted false” class instances ..."
Abstract
 Add to MetaCart
In many classification problems such as spam detection and network intrusion, a large number of unlabeled test instances are predicted negative by the classifier. However, the high costs as well as time constraints on an expert’s time prevent further analysis of the “predicted false” class instances in order to segregate the false negatives from the true negatives. A systematic method is thus required to obtain an estimate of the number of false negatives. A capturerecapture based method can be used to obtain an MLestimate of false negatives when two or more independent classifiers are available. In the case for which independence does not hold, we can apply loglinear models to obtain an estimate of false negatives. However, as shown in this paper, lesser the dependencies among the classifiers, better is the estimate obtained for false negatives. Thus, ideally independent classifiers should be used to estimate the false negatives in an unlabeled dataset. Experimental results on the spam dataset from the UCI Machine Learning Repository are presented. 1
Injuries
, 2002
"... This report summarizes the most recent data from the U.S. Consumer Product Safety Commission (CPSC) on amusement ride injury and fatality incidents. The report contains hospital emergency roomtreated injury estimates for the period from 1997 to 2001 and fatality data for the period from 1987 to Jul ..."
Abstract
 Add to MetaCart
This report summarizes the most recent data from the U.S. Consumer Product Safety Commission (CPSC) on amusement ride injury and fatality incidents. The report contains hospital emergency roomtreated injury estimates for the period from 1997 to 2001 and fatality data for the period from 1987 to July 2002. For comparison purposes, this report considers both fixedsite and mobile amusement rides. However, CPSC has jurisdiction over mobile rides only. As in previous reports, inflatable rides, such as slides and bounces, are considered separately.
Estimating false negatives for classification problems with cluster structure
"... Estimating the number of false negatives for a classifier when the true outcome of the classification is ascertained only for a limited number of instances is an important problem, with a wide range of applications from epidemiology to computer/network security. The frequently applied method is rand ..."
Abstract
 Add to MetaCart
Estimating the number of false negatives for a classifier when the true outcome of the classification is ascertained only for a limited number of instances is an important problem, with a wide range of applications from epidemiology to computer/network security. The frequently applied method is random sampling. However, when the target (positive) class of the classification is rare, which is often the case with network intrusions and diseases, this simple method results in excessive sampling. In this paper, we propose an approach that exploits the cluster structure of the data to significantly reduce the amount of sampling needed while guaranteeing an estimation accuracy specified by the user. The basic idea is to cluster the data and divide the clusters into a set of “strata”, such that the proportion of positive instances in the stratum is very low, very high or in between, respectively. By taking advantage of the different characteristics of the strata, more efficient estimation strategies can be applied, thereby significantly reducing the amount of required sampling. We also develop a computationally efficient clustering algorithm – referred to as classfocused partitioning – which uses the (imperfect) labels predicted by the classifier as additional guidance. We evaluated our method on the KDDCup network intrusion data set. Our method achieved better precision and accuracy with a 5 % sample than the best trial of simple random sampling with 40 % samples.
Research Track Poster Estimating Missed Actual Positives Using Independent Classifiers ∗
"... Data mining is increasingly being applied in environments having very high rate of data generation like network intrusion detection [7], where routers generate about 300,000 – 500,000 connections every minute. In such rare class data domains, the cost of missing a rareclass instance is much higher ..."
Abstract
 Add to MetaCart
Data mining is increasingly being applied in environments having very high rate of data generation like network intrusion detection [7], where routers generate about 300,000 – 500,000 connections every minute. In such rare class data domains, the cost of missing a rareclass instance is much higher than that of other classes. However, the high cost for manual labeling of instances, the high rate at which data is collected as well as realtime response constraints do not always allow one to determine the actual classes for the collected unlabeled datasets. In our previous work [9], this problem of missed false negatives was explained in context of two different domains – “network intrusion detection ” and “business opportunity classification”. In such cases, an estimate for the number of such missed highcost, rare instances will aid in the evaluation of the performance of the modeling technique (e.g. classification) used.
Collaborative Estimation of the Size of the Used IPv4 and IPv6 Address Spaces
"... Abstract—In order to better understand how the transition from IPv4 to IPv6 will play out, we need to know how much of the allocated IPv4 space is actively used and how many hosts actually use IPv6. This report describes a collaborative, secure, anonymised scheme for estimating IPv4 and IPv6 address ..."
Abstract
 Add to MetaCart
Abstract—In order to better understand how the transition from IPv4 to IPv6 will play out, we need to know how much of the allocated IPv4 space is actively used and how many hosts actually use IPv6. This report describes a collaborative, secure, anonymised scheme for estimating IPv4 and IPv6 address space utilisation based on private datasets of locally observed IP addresses, such as server logs or traffic traces. We are looking for collaborators willing to participate. I.