Results 1 - 10
of
134
Sampling-based estimation of the number of distinct values of an attribute
- In Proc. 21st International Conf. on Very Large Data Bases
, 1995
"... We provide several new sampling-based estima-tors of the number of distinct values of an at-tribute in a relation. We compare these new esti-mators to estimators from the database and sta-tistical literature empirically, using a large num-ber of attribute-value distributions drawn from a variety of ..."
Abstract
-
Cited by 97 (9 self)
- Add to MetaCart
We provide several new sampling-based estima-tors of the number of distinct values of an at-tribute in a relation. We compare these new esti-mators to estimators from the database and sta-tistical literature empirically, using a large num-ber of attribute-value distributions drawn from a variety of real-world databases. This appears to be the first extensive comparison of distinct-value estimators in either the database or statistical lit-erature, and is certainly the first to use highly-skewed data of the sort frequently encountered in database applications. Our experiments indicate that a new “hybrid ” estimator yields the highest precision on average for a given sampling frac-tion. This estimator explicitly takes into account the degree of skew in the data and combines a new “smoothed jackknife ” estimator with an es-timator due to Shlosser. We investigate how the hybrid estimator behaves as we scale up the size of the database. 1
Estimation of Software Reliability by Stratified Sampling
, 1999
"... this article, we have focused on improving the efficiency of software reliability estimation. However, automated analysis of execution profiles might also be useful for revealing software defects. In this context, it could ACM Transactions on Software Engineering and Methodology, Vol. 8, No. 3, July ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
this article, we have focused on improving the efficiency of software reliability estimation. However, automated analysis of execution profiles might also be useful for revealing software defects. In this context, it could ACM Transactions on Software Engineering and Methodology, Vol. 8, No. 3, July 1999. be used to filter out executions with unusual profiles on the assumption that they are most likely to reveal failures
Some Superpopulation Models for Estimating the Number of Population Uniques
, 1997
"... The number of the unique individuals in the population is of great importance in evaluating the disclosure risk of a microdata set. We approach this problem by considering some basic superpopulation models including the gamma-Poisson model of Bethlehem et al. (1990). We introduce Dirichlet-multinomi ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
The number of the unique individuals in the population is of great importance in evaluating the disclosure risk of a microdata set. We approach this problem by considering some basic superpopulation models including the gamma-Poisson model of Bethlehem et al. (1990). We introduce Dirichlet-multinomial model which is closely related but more basic than the gamma-Poisson model, in the sense that binomial distribution is more basic than Poisson distribution. We also discuss the Ewens model and show that it can be obtained from the Dirichlet-multinomial model by a limiting argument similar to the law of small numbers. The multivariate Ewens distribution is a basic mathematical model used in genetics. Estimation of the number of the population uniques is particularly simple under the Ewens model. Although these models might not necessarily well fit actual populations, they can be considered as basic mathematical models for our problem, as binomial and Poisson distributions are considered as...
Object co-identification on the Semantic Web
- in 13th World Wide Web Conference. 2004
"... The Semantic Web seeks integrate data from many different sources. Since different sources often use different names for the same object, we need to map between these names. We first consider the use of keys to do this mapping and discuss some of the associated problems. We introduce the concept of ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The Semantic Web seeks integrate data from many different sources. Since different sources often use different names for the same object, we need to map between these names. We first consider the use of keys to do this mapping and discuss some of the associated problems. We introduce the concept of bootstrapping from some shared names to more shared names and discuss some conditions under which this process is guaranteed to be correct. We describe a probabilistic approach to matching and propose approximations to address the issue of requiring a combinatorially large number of joint probabilities. We report on empirical studies for validating this approach in two interesting domains. Finally, we discuss the implications of better matching techniques for privacy. 1. BACKGROUND The ease with which web sites could link to each other doubtless
Variance Estimation for Spatially Balanced Samples of Environmental Resources
"... To be submitted to ..."
Resampling-based Variance Estimation for Labour Force Surveys
"... Labour force surveys are conducted to estimate quantities such as the unemployment rate and the number of people in work. Interest is typically both in estimates at a given time and in changes between two successive time-points. Calibration of the sample to force agreement with known population marg ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Labour force surveys are conducted to estimate quantities such as the unemployment rate and the number of people in work. Interest is typically both in estimates at a given time and in changes between two successive time-points. Calibration of the sample to force agreement with known population margins results in random weights being assigned to each response, but the usual methods of variance estimation do not account for this. This paper describes how resampling methods --- the jackknife, jackknife linearization, balanced repeated replication, and the bootstrap --- can be used to do so. We also discuss implementation issues, and compare the methods by simulation based on data from the UK Labour Force Survey. The broad conclusions are these: bootstrap and jackknife linearization are less computer-intensive than the other resampling methods for such applications and give better standard errors; `standard' methods can be badly biased downwards; and it is essential to take variability of...
C.: Outlier detection by sampling with accuracy guarantees
- In: KDD (2006
"... An effective approach to detect anomalous points in a data set is distance-based outlier detection. This paper describes a simple sampling algorithm to efficiently detect distancebased outliers in domains where each and every distance computation is very expensive. Unlike any existing algorithms, th ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
An effective approach to detect anomalous points in a data set is distance-based outlier detection. This paper describes a simple sampling algorithm to efficiently detect distancebased outliers in domains where each and every distance computation is very expensive. Unlike any existing algorithms, the sampling algorithm requires a fixed number of distance computations and can return good results with accuracy guarantees. The most computationally expensive aspect of estimating the accuracy of the result is sorting all of the distances computed by the sampling algorithm. This enables interactive-speed performance over the most expensive distance computations. The paper’s algorithms were tested over two domains that require expensive distance functions as well as ten additional real data sets. The experimental study demonstrates both the efficiency and effectiveness of the sampling algorithm in comparison with the state-of-theart algorithm and the reliability of the accuracy guarantees. 1.
Sample Designs for Watershed Assessment
, 1998
"... This article describes two probability sampling approaches to sampling water resources. In the first approach, a complete enumeration of eligible sample units (e.g. stream segments, water bodies, drainage basins) is performed using GIS or paper maps to create a list frame. A probability random sampl ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This article describes two probability sampling approaches to sampling water resources. In the first approach, a complete enumeration of eligible sample units (e.g. stream segments, water bodies, drainage basins) is performed using GIS or paper maps to create a list frame. A probability random sample is then selected from this frame. In the second approach, a two-stage sample is selected. In the first stage, area segments are selected using a probability design, such as stratified sampling. Water resources contained in each segment are enumerated and labeled using field visits or aerial photographs. In a second stage, the water resource or related sampling units (e.g. stream segments, water bodies, measurement points) are selected in each of the segments. Examples of both approaches are presented, and their relative strengths and weaknesses are discussed. Key Words: water resource sampling, GIS frame, area sampling, Markov chain sampling. J.D. Opsomer is Assistant Professor and S.M....
Stream sampling for variance-optimal estimation of subset sums
- In ACM-SIAM Symposium on Discrete Algorithms
, 2009
"... From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time, which is optimal even on the word RAM. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting. 1

