Results 1 
8 of
8
Samplingbased estimation of the number of distinct values of an attribute
 In Proc. 21st International Conf. on Very Large Data Bases
, 1995
"... We provide several new samplingbased estimators of the number of distinct values of an attribute in a relation. We compare these new estimators to estimators from the database and statistical literature empirically, using a large number of attributevalue distributions drawn from a variety of ..."
Abstract

Cited by 114 (9 self)
 Add to MetaCart
We provide several new samplingbased estimators of the number of distinct values of an attribute in a relation. We compare these new estimators to estimators from the database and statistical literature empirically, using a large number of attributevalue distributions drawn from a variety of realworld databases. This appears to be the first extensive comparison of distinctvalue estimators in either the database or statistical literature, and is certainly the first to use highlyskewed data of the sort frequently encountered in database applications. Our experiments indicate that a new “hybrid ” estimator yields the highest precision on average for a given sampling fraction. This estimator explicitly takes into account the degree of skew in the data and combines a new “smoothed jackknife ” estimator with an estimator due to Shlosser. We investigate how the hybrid estimator behaves as we scale up the size of the database. 1
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equih ..."
Abstract

Cited by 106 (11 self)
 Add to MetaCart
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equiheight histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for prespecified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Random Sampling from Databases  A Survey
 Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including Rtrees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
Distinct Values Estimators for Power Law Distributions
, 2006
"... The number of distinct values in a relation is an important statistic for database query optimization. As databases have grown in size, scalability of distinct values estimators has become extremely important, since a naïve linear scan through the data is no longer feasible. An approach that scales ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
The number of distinct values in a relation is an important statistic for database query optimization. As databases have grown in size, scalability of distinct values estimators has become extremely important, since a naïve linear scan through the data is no longer feasible. An approach that scales very well involves taking a sample of the data, and performing the estimate on the sample. Unfortunately, it has been shown that obtaining estimators with guaranteed small error bounds requires an extremely large sample size in the worst case. On the other hand, it is typically the case that the data is not worstcase, but follows some form of a Power Law or Zipfian distribution. We exploit data distribution assumptions to devise distinctvalues estimators with analytic error guarantees for Zipfian distributions. Our estimators are the first to have the required number of samples depend only on the number of distinct values present, D, and not the database size, n. This allows the estimators to scale well with the size of the database, particularly if the growth is due to multiple copies of the data. In addition to theoretical analysis, we also provide experimental evidence of the effectiveness of our estimators by benchmarking their performance against previously best known heuristic and analytic estimators on both synthetic and realworld datasets.
Distinct Value Estimation on PeertoPeer Networks
"... PeertoPeer networks have become very popular on the Internet, with millions of peers all over the world sharing large volumes of data. In the assistive healthcare sector, it is likely that P2P networks will develop that interconnect and allow the controlled sharing of patient databases of various ..."
Abstract
 Add to MetaCart
PeertoPeer networks have become very popular on the Internet, with millions of peers all over the world sharing large volumes of data. In the assistive healthcare sector, it is likely that P2P networks will develop that interconnect and allow the controlled sharing of patient databases of various hospitals, clinics, and research laboratories. However, the sheer scale of these networks has made it difficult to gather statistics that could be used for building new features. In this paper, we present a technique to obtain estimations of the number of distinct values matching a query on the network. We evaluate the technique experimentally and provide a set of results that demonstrate its effectiveness, as well as its flexibility in supporting a variety of queries and applications. 1.
Thu Feb 7 02:58:09 2008THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAFETERS
"... Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal ..."
Abstract
 Add to MetaCart
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, noncommercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
On Estimators for Aggregate Relational Algebra Queries
, 1995
"... CASEDB is a relational database management system that allows users to specify time constraints in queries. For an aggregate query AGG(E) where AGG is one of COUNT, SUM and AVERAGE, and E is a relational algebra expression, CASEDB uses statistical estimators to approximate the query. This paper ex ..."
Abstract
 Add to MetaCart
CASEDB is a relational database management system that allows users to specify time constraints in queries. For an aggregate query AGG(E) where AGG is one of COUNT, SUM and AVERAGE, and E is a relational algebra expression, CASEDB uses statistical estimators to approximate the query. This paper extends our earlier work on statistical estimators of CASEDB with the following features: (a) New statistical estimators for COUNT queries with projection. (b) Extending the methodology for SUM and AVERAGE aggregate queries. (c) New sampling plans based on systematic sampling and stratified sampling. We also present performance evaluation experiments of the estimators with the above extensions using correlated and uncorrelated database instances.