Results 1 - 10
of
16
Sampling-based estimation of the number of distinct values of an attribute
- In Proc. 21st International Conf. on Very Large Data Bases
, 1995
"... We provide several new sampling-based estima-tors of the number of distinct values of an at-tribute in a relation. We compare these new esti-mators to estimators from the database and sta-tistical literature empirically, using a large num-ber of attribute-value distributions drawn from a variety of ..."
Abstract
-
Cited by 97 (9 self)
- Add to MetaCart
We provide several new sampling-based estima-tors of the number of distinct values of an at-tribute in a relation. We compare these new esti-mators to estimators from the database and sta-tistical literature empirically, using a large num-ber of attribute-value distributions drawn from a variety of real-world databases. This appears to be the first extensive comparison of distinct-value estimators in either the database or statistical lit-erature, and is certainly the first to use highly-skewed data of the sort frequently encountered in database applications. Our experiments indicate that a new “hybrid ” estimator yields the highest precision on average for a given sampling frac-tion. This estimator explicitly takes into account the degree of skew in the data and combines a new “smoothed jackknife ” estimator with an es-timator due to Shlosser. We investigate how the hybrid estimator behaves as we scale up the size of the database. 1
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equi-h ..."
Abstract
-
Cited by 91 (11 self)
- Add to MetaCart
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
The Power of Sampling in Knowledge Discovery
, 1993
"... We consider the problem of approximately verifying the truth of sentences of tuple relational calculus in a given relation M by considering only a random sample of M . We define two different measures for the error of a universal sentence in a relation. For a set of n universal sentences each with a ..."
Abstract
-
Cited by 51 (2 self)
- Add to MetaCart
We consider the problem of approximately verifying the truth of sentences of tuple relational calculus in a given relation M by considering only a random sample of M . We define two different measures for the error of a universal sentence in a relation. For a set of n universal sentences each with at most k universal quantifiers, we give upper and lower bounds for the sample sizes required for having a high probability that all the sentences with error at least " can be detected as false by considering the sample. The sample sizes are O((ln n)=") or O((jM j 1\Gamma1=k ln n)="), depending on the error measure used. We also consider universal-existential sentences. Computing Reviews Categories and Subject Descriptors: H.3.3 [Information Systems]: Information Storage and Retrieval -- Information Search and Retrieval F.2.2 [Theory of Computation]: Analysis of Algorithms and Problem Complexity -- Nonnumerical Algorithms and Problems G.3 [Mathematics of Computing]: Probability and Sta...
Random Sampling from Databases - A Survey
- Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including R-trees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
On Sampling and Relational Operators
- BULLETIN OF THE TECHNICAL COMMITTEE ON DATA ENGINEERING
, 1999
"... A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. We highlight the primary difficulties, summarize the results of some recent work in this area, and indicate directions for future work. ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. We highlight the primary difficulties, summarize the results of some recent work in this area, and indicate directions for future work.
Population estimation with performance guarantees
- In Proceedings of IEEE Symposium on Information Theory
, 2007
"... Abstract — We estimate the population size by sampling uniformly from the population. Given an accuracy to which we need to estimate the population with a pre-specified confidence, we provide a simple stopping rule for the sampling process. I. SUMMARY Many applications such as species estimation [1] ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract — We estimate the population size by sampling uniformly from the population. Given an accuracy to which we need to estimate the population with a pre-specified confidence, we provide a simple stopping rule for the sampling process. I. SUMMARY Many applications such as species estimation [1], database sampling [2], and epidemiologic studies [3], [4], [5] call for estimating a population size based on a relatively small sample. We derive a simple, yet nearly optimal, stopping rule for sampling and an estimation formula for alphabet size from uniform samples taken from the population. We will consider an approach outlined for the species estimation problem by Good [6] further on in the summary. For a more complete survey of prior results obtained in the species estimation problem, see [1]. For problems in database sampling see [7], [2]. The results obtained in this paper are also related to capture-recapture problems [3], [4], [5], where the unknown population size is estimated given the number of samples that are recaptured (repetitions) when sampling randomly from the population. Here, we are interested in how many recaptures are necessary to estimate the population to a given accuracy with a specified confidence. Intuitively speaking, the more the number of recaptures, the better the population size can be estimated. Formally, in an n-element sample let m denote the number of distinct elements. Let r = n − m denote the number of repeated elements. For example, in c,g,c,s,g,c,v, there are n = 7 samples, there are m = 4 distinct elements, c,g,s, and v, and r = 7 − 4 = 3 repeated elements, one g and two c ′. In the following, n independent samples are drawn uniformly from a k-element population and M k n and R k n = n − M k n are the random number of distinct and repeated elements observed. We drop the subscripts and superscripts when there is no ambiguity. A. Good’s approach By linearity of expectations, E(M) = k 1 −
Estimating the output cardinality of partial preaggregation with a measure of clusteredness
- In Proc. Int. Conf. on Very Large Data Bases (VLDB
, 2003
"... We introduce a new parameter, the clusteredness of data, and show how it can be used for estimating the output cardinality of a partial preaggregation operator. This provides the query optimizer with an important piece of information for deciding whether the application of partial preaggregation is ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We introduce a new parameter, the clusteredness of data, and show how it can be used for estimating the output cardinality of a partial preaggregation operator. This provides the query optimizer with an important piece of information for deciding whether the application of partial preaggregation is beneficial. Experimental results are very promising, due to the high accuracy of the cardinality estimation based on our measure of clusteredness. 1
Distinct-values estimation over data streams
- In Data Stream Management: Processing High-Speed Data
"... Abstract. In this chapter, we consider the problem of estimating the number of distinct values in a data stream with repeated values. Distinctvalues estimation was one of the first data stream problems studied: In the mid-1980’s, Flajolet and Martin gave an effective algorithm that uses only logarit ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. In this chapter, we consider the problem of estimating the number of distinct values in a data stream with repeated values. Distinctvalues estimation was one of the first data stream problems studied: In the mid-1980’s, Flajolet and Martin gave an effective algorithm that uses only logarithmic space. Recent work has built upon their technique, improving the accuracy guarantees on the estimation, proving lower bounds, and considering other settings such as sliding windows, distributed streams, and sensor networks. 1

