Results 1 -
8 of
8
Sampling-based estimation of the number of distinct values of an attribute
- In Proc. 21st International Conf. on Very Large Data Bases
, 1995
"... We provide several new sampling-based estima-tors of the number of distinct values of an at-tribute in a relation. We compare these new esti-mators to estimators from the database and sta-tistical literature empirically, using a large num-ber of attribute-value distributions drawn from a variety of ..."
Abstract
-
Cited by 97 (9 self)
- Add to MetaCart
We provide several new sampling-based estima-tors of the number of distinct values of an at-tribute in a relation. We compare these new esti-mators to estimators from the database and sta-tistical literature empirically, using a large num-ber of attribute-value distributions drawn from a variety of real-world databases. This appears to be the first extensive comparison of distinct-value estimators in either the database or statistical lit-erature, and is certainly the first to use highly-skewed data of the sort frequently encountered in database applications. Our experiments indicate that a new “hybrid ” estimator yields the highest precision on average for a given sampling frac-tion. This estimator explicitly takes into account the degree of skew in the data and combines a new “smoothed jackknife ” estimator with an es-timator due to Shlosser. We investigate how the hybrid estimator behaves as we scale up the size of the database. 1
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equi-h ..."
Abstract
-
Cited by 91 (11 self)
- Add to MetaCart
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Random Sampling from Databases - A Survey
- Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including R-trees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
Distinct Values Estimators for Power Law Distributions
, 2006
"... The number of distinct values in a relation is an important statistic for database query optimization. As databases have grown in size, scalability of distinct values estimators has become extremely important, since a naïve linear scan through the data is no longer feasible. An approach that scales ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The number of distinct values in a relation is an important statistic for database query optimization. As databases have grown in size, scalability of distinct values estimators has become extremely important, since a naïve linear scan through the data is no longer feasible. An approach that scales very well involves taking a sample of the data, and performing the estimate on the sample. Unfortunately, it has been shown that obtaining estimators with guaranteed small error bounds requires an extremely large sample size in the worst case. On the other hand, it is typically the case that the data is not worst-case, but follows some form of a Power Law or Zipfian distribution. We exploit data distribution assumptions to devise distinct-values estimators with analytic error guarantees for Zipfian distributions. Our estimators are the first to have the required number of samples depend only on the number of distinct values present, D, and not the database size, n. This allows the estimators to scale well with the size of the database, particularly if the growth is due to multiple copies of the data. In addition to theoretical analysis, we also provide experimental evidence of the effectiveness of our estimators by benchmarking their performance against previously best known heuristic and analytic estimators on both synthetic and real-world datasets.
Distinct Value Estimation on Peer-to-Peer Networks
"... Peer-to-Peer networks have become very popular on the Internet, with millions of peers all over the world sharing large volumes of data. In the assistive healthcare sector, it is likely that P2P networks will develop that interconnect and allow the controlled sharing of patient databases of various ..."
Abstract
- Add to MetaCart
Peer-to-Peer networks have become very popular on the Internet, with millions of peers all over the world sharing large volumes of data. In the assistive healthcare sector, it is likely that P2P networks will develop that interconnect and allow the controlled sharing of patient databases of various hospitals, clinics, and research laboratories. However, the sheer scale of these networks has made it difficult to gather statistics that could be used for building new features. In this paper, we present a technique to obtain estimations of the number of distinct values matching a query on the network. We evaluate the technique experimentally and provide a set of results that demonstrate its effectiveness, as well as its flexibility in supporting a variety of queries and applications. 1.
Thu Feb 7 02:58:09 2008THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAFETERS
"... Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal ..."
Abstract
- Add to MetaCart
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
On Estimators for Aggregate Relational Algebra Queries
, 1995
"... CASE-DB is a relational database management system that allows users to specify time constraints in queries. For an aggregate query AGG(E) where AGG is one of COUNT, SUM and AVERAGE, and E is a relational algebra expression, CASE-DB uses statistical estimators to approximate the query. This paper ex ..."
Abstract
- Add to MetaCart
CASE-DB is a relational database management system that allows users to specify time constraints in queries. For an aggregate query AGG(E) where AGG is one of COUNT, SUM and AVERAGE, and E is a relational algebra expression, CASE-DB uses statistical estimators to approximate the query. This paper extends our earlier work on statistical estimators of CASE-DB with the following features: (a) New statistical estimators for COUNT queries with projection. (b) Extending the methodology for SUM and AVERAGE aggregate queries. (c) New sampling plans based on systematic sampling and stratified sampling. We also present performance evaluation experiments of the estimators with the above extensions using correlated and uncorrelated database instances.

