Results 1  10
of
17
Samplingbased estimation of the number of distinct values of an attribute
 In Proc. 21st International Conf. on Very Large Data Bases
, 1995
"... We provide several new samplingbased estimators of the number of distinct values of an attribute in a relation. We compare these new estimators to estimators from the database and statistical literature empirically, using a large number of attributevalue distributions drawn from a variety of ..."
Abstract

Cited by 114 (9 self)
 Add to MetaCart
We provide several new samplingbased estimators of the number of distinct values of an attribute in a relation. We compare these new estimators to estimators from the database and statistical literature empirically, using a large number of attributevalue distributions drawn from a variety of realworld databases. This appears to be the first extensive comparison of distinctvalue estimators in either the database or statistical literature, and is certainly the first to use highlyskewed data of the sort frequently encountered in database applications. Our experiments indicate that a new “hybrid ” estimator yields the highest precision on average for a given sampling fraction. This estimator explicitly takes into account the degree of skew in the data and combines a new “smoothed jackknife ” estimator with an estimator due to Shlosser. We investigate how the hybrid estimator behaves as we scale up the size of the database. 1
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equih ..."
Abstract

Cited by 106 (11 self)
 Add to MetaCart
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equiheight histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for prespecified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
The Power of Sampling in Knowledge Discovery
, 1993
"... We consider the problem of approximately verifying the truth of sentences of tuple relational calculus in a given relation M by considering only a random sample of M . We define two different measures for the error of a universal sentence in a relation. For a set of n universal sentences each with a ..."
Abstract

Cited by 53 (2 self)
 Add to MetaCart
We consider the problem of approximately verifying the truth of sentences of tuple relational calculus in a given relation M by considering only a random sample of M . We define two different measures for the error of a universal sentence in a relation. For a set of n universal sentences each with at most k universal quantifiers, we give upper and lower bounds for the sample sizes required for having a high probability that all the sentences with error at least " can be detected as false by considering the sample. The sample sizes are O((ln n)=") or O((jM j 1\Gamma1=k ln n)="), depending on the error measure used. We also consider universalexistential sentences. Computing Reviews Categories and Subject Descriptors: H.3.3 [Information Systems]: Information Storage and Retrieval  Information Search and Retrieval F.2.2 [Theory of Computation]: Analysis of Algorithms and Problem Complexity  Nonnumerical Algorithms and Problems G.3 [Mathematics of Computing]: Probability and Sta...
Random Sampling from Databases  A Survey
 Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including Rtrees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic
 In Extending database technology (EDBT
, 2008
"... Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estim ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal resources. However, in some modern applications, the accuracy of the estimate is of primal importance, and businesses are willing to trade more resources for better accuracy. Throughout our experience with building a distinct count system at a major search engine, Ask.com, we reviewed the literature of approximating distinct counts, and compared most algorithms in the literature. We deduced that Linear Counting, one of the least used algorithms, has unique and impressive advantages when the accuracy of the distinct count is critical to the business. For other estimators to attain comparable accuracy, they need more space than Linear Counting. We have supported our analytical results through comprehensive experiments. The experimental results highly favor Linear Counting when the number of distinct elements is large and the error tolerance is low. 1.
Distinctvalues estimation over data streams
 In Data Stream Management: Processing HighSpeed Data
"... Abstract. In this chapter, we consider the problem of estimating the number of distinct values in a data stream with repeated values. Distinctvalues estimation was one of the first data stream problems studied: In the mid1980’s, Flajolet and Martin gave an effective algorithm that uses only logarit ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Abstract. In this chapter, we consider the problem of estimating the number of distinct values in a data stream with repeated values. Distinctvalues estimation was one of the first data stream problems studied: In the mid1980’s, Flajolet and Martin gave an effective algorithm that uses only logarithmic space. Recent work has built upon their technique, improving the accuracy guarantees on the estimation, proving lower bounds, and considering other settings such as sliding windows, distributed streams, and sensor networks. 1
On Sampling and Relational Operators
 BULLETIN OF THE TECHNICAL COMMITTEE ON DATA ENGINEERING
, 1999
"... A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. We highlight the primary difficulties, summarize the results of some recent work in this area, and indicate directions for future work. ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. We highlight the primary difficulties, summarize the results of some recent work in this area, and indicate directions for future work.
Population estimation with performance guarantees
 In Proceedings of IEEE Symposium on Information Theory
, 2007
"... Abstract — We estimate the population size by sampling uniformly from the population. Given an accuracy to which we need to estimate the population with a prespecified confidence, we provide a simple stopping rule for the sampling process. I. SUMMARY Many applications such as species estimation [1] ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract — We estimate the population size by sampling uniformly from the population. Given an accuracy to which we need to estimate the population with a prespecified confidence, we provide a simple stopping rule for the sampling process. I. SUMMARY Many applications such as species estimation [1], database sampling [2], and epidemiologic studies [3], [4], [5] call for estimating a population size based on a relatively small sample. We derive a simple, yet nearly optimal, stopping rule for sampling and an estimation formula for alphabet size from uniform samples taken from the population. We will consider an approach outlined for the species estimation problem by Good [6] further on in the summary. For a more complete survey of prior results obtained in the species estimation problem, see [1]. For problems in database sampling see [7], [2]. The results obtained in this paper are also related to capturerecapture problems [3], [4], [5], where the unknown population size is estimated given the number of samples that are recaptured (repetitions) when sampling randomly from the population. Here, we are interested in how many recaptures are necessary to estimate the population to a given accuracy with a specified confidence. Intuitively speaking, the more the number of recaptures, the better the population size can be estimated. Formally, in an nelement sample let m denote the number of distinct elements. Let r = n − m denote the number of repeated elements. For example, in c,g,c,s,g,c,v, there are n = 7 samples, there are m = 4 distinct elements, c,g,s, and v, and r = 7 − 4 = 3 repeated elements, one g and two c ′. In the following, n independent samples are drawn uniformly from a kelement population and M k n and R k n = n − M k n are the random number of distinct and repeated elements observed. We drop the subscripts and superscripts when there is no ambiguity. A. Good’s approach By linearity of expectations, E(M) = k 1 −