## Random Sampling for Histogram Construction: How much is enough? (1998)

Citations: | 115 - 11 self |

### BibTeX

@INPROCEEDINGS{Chaudhuri98randomsampling,

author = {Surajit Chaudhuri and Rajeev Motwani and Vivek Narasayya},

title = {Random Sampling for Histogram Construction: How much is enough?},

booktitle = {},

year = {1998},

pages = {436--447}

}

### Years of Citing Articles

### OpenURL

### Abstract

Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...

### Citations

1923 | Randomized Algorithms
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ...sentially optimal formulas describing the trade-off between the various parameters, notably the number of bucketssk, the error ffi, and the number of random samples r. 3 Refer to Motwani and Raghavan =-=[21]-=- for all basic definitions /results on randomization used in this paper. Theorem 4 Let ffisn k . An equi-height k-histogram for a random sample R of size r from a value set V of size n gives a ffi-dev... |

506 | Access path selection in a relational database management system
- Selinger, Astrahan, et al.
- 1979
(Show Context)
Citation Context ...mber of distinct values is an important subproblem in query optimization, e.g., for estimating projection size or in estimating relative error in join-selectivity estimation formulas used in System R =-=[28]-=-. We are interested in the issue of devising a good estimator for d based on a random sample from V . In the statistics literature, the problem of estimating the number of distinct values (called the ... |

251 | Improved histograms for selectivity estimation of range predicates
- Poosala, Haas, et al.
- 1996
(Show Context)
Citation Context ...e will discuss the relationship between this and our results in Section 3. As regards the problem of estimating the number of distinct values, we have sharpened the intuition obtained in earlier work =-=[10, 26]-=- by providing a negative result which explains why distinct values cannot be approximated reliably. In addition, we present a new estimator that is provably optimal and performs extremely well in our ... |

168 | Fast incremental maintenance of approximate histograms
- Gibbons, Matias, et al.
- 1997
(Show Context)
Citation Context ...s require us to derive a histogram that is reasonably accurate for a large class of queries (preferably all) with high probability. Although the main focus of the work in Gibbons, Matias, and Poosala =-=[8]-=- is on the incremental maintenance of histograms, they do provide a distribution-independent bound on the required amount of sampling. However, our bounds are significantly stronger and lends to ease ... |

156 | Practical selectivity estimation through adaptive sampling
- Lipton, Naughton, et al.
- 1990
(Show Context)
Citation Context ...ing for approximating histograms and distinct values. 1.1 Related Work Random sampling has been proposed and used in many different contexts in databases [23, 24]. In particular, a large body of work =-=[18, 19, 20, 13, 14, 12, 11, 17]-=- has addressed the problem of estimating result size for a given query using random sampling. In contrast, the problem that we have addressed requires us to estimate a histogram using random sampling.... |

149 |
Accurate estimation of the number of tuples satisfying a condition
- Piatatetsky-Shapiro, Connell
- 1984
(Show Context)
Citation Context ...he recommended sampling bounds are based on distribution-specific assumptions and heuristic analysis. The use of random sampling for estimating histogram was proposed by Piatetsky-Shapiro and Connell =-=[27]-=-. They show that given a particular query , only a small sample size is needed to estimate a histogram that is adequate for the query with a high probability. In sharp contrast, the problem of approxi... |

139 | Balancing Histogram Optimality and Practicality for Query Result Size Estimation
- Ioannidis, Poosala
- 1995
(Show Context)
Citation Context ...QL Server. In the full paper [5], we also discuss how the above analysis can be adapted to the related family of compressed histograms. Extending our results to the case of other histogram structures =-=[15, 16]-=- is one of our ongoing research goals. The rest of the paper is organized as follows. In the remainder of this section, we briefly summarize past work in the area. In Section 2, we introduce equi-heig... |

133 | An Efficient, Cost-Driven Index Selection Tool for Microsoft SQL Server - Chaudhuri, Narasayya - 1997 |

122 | Sampling-based estimation of the number of distinct values of an attribute
- Haas, Naughton, et al.
- 1995
(Show Context)
Citation Context ...e will discuss the relationship between this and our results in Section 3. As regards the problem of estimating the number of distinct values, we have sharpened the intuition obtained in earlier work =-=[10, 26]-=- by providing a negative result which explains why distinct values cannot be approximated reliably. In addition, we present a new estimator that is provably optimal and performs extremely well in our ... |

110 | Random Sampling from Databases
- Olken
- 1993
(Show Context)
Citation Context ...ary of our proposal for effective use of sampling for approximating histograms and distinct values. 1.1 Related Work Random sampling has been proposed and used in many different contexts in databases =-=[23, 24]-=-. In particular, a large body of work [18, 19, 20, 13, 14, 12, 11, 17] has addressed the problem of estimating result size for a given query using random sampling. In contrast, the problem that we hav... |

107 |
Non-parametric estimation of the number of classes in a population
- Chao
- 1984
(Show Context)
Citation Context ...okes [10]. As observed by the latter, we obtain fairly poor performance when using the standard statistical estimators for the number of distinct values, e.g., the estimators due to Goodman [9], Chao =-=[4]-=-, and Burnham and Overton [2, 3] which have been used earlier in the database context in the work of Hou, Ozsoyoglu, and Taneja [13, 14] and Ozsoyoglu et al [25]. As Haas et al [10] remark, distinct v... |

90 |
Estimating the number of species: a review
- Bunge, Fitzpatrick
- 1993
(Show Context)
Citation Context ...m sample from V . In the statistics literature, the problem of estimating the number of distinct values (called the problem of estimating the number of species) has received a great deal of attention =-=[1]-=-. Recent work in the database literature includes the results of Naughton and Seshadri [22], and Haas, Naughton, Seshadri, and Stokes [10]. As observed by the latter, we obtain fairly poor performance... |

86 |
Sequential Sampling Procedures for Query Size Estimation
- Haas, Swami
- 1992
(Show Context)
Citation Context ...ing for approximating histograms and distinct values. 1.1 Related Work Random sampling has been proposed and used in many different contexts in databases [23, 24]. In particular, a large body of work =-=[18, 19, 20, 13, 14, 12, 11, 17]-=- has addressed the problem of estimating result size for a given query using random sampling. In contrast, the problem that we have addressed requires us to estimate a histogram using random sampling.... |

83 | Physical Database Design for Relational Databases
- Finkelstein, Schkolnick, et al.
- 1988
(Show Context)
Citation Context ...s of great importance. In addition to greatly enhancing the accuracy of the optimizer's estimates, such statistical information is of great interest in tools that help decide physical database design =-=[7, 6]-=-. In many commercial systems, including Microsoft SQL Server, the optimizer relies on histograms on selected columns to estimate selectivities of queries. In addition to histograms, density as well as... |

69 |
Query size estimation by adaptive sampling
- Lipton, Naughton
- 1995
(Show Context)
Citation Context ...ing for approximating histograms and distinct values. 1.1 Related Work Random sampling has been proposed and used in many different contexts in databases [23, 24]. In particular, a large body of work =-=[18, 19, 20, 13, 14, 12, 11, 17]-=- has addressed the problem of estimating result size for a given query using random sampling. In contrast, the problem that we have addressed requires us to estimate a histogram using random sampling.... |

61 |
Processing aggregate relational queries with hard time constraints
- Hou, Ozsoyoglu, et al.
- 1989
(Show Context)
Citation Context |

56 |
Statistical estimators for relational algebra expressions
- Hou, zsoyoglu, et al.
- 1988
(Show Context)
Citation Context |

41 |
Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65: 623–633. 4526
- Burnham, Overton
- 1978
(Show Context)
Citation Context ... latter, we obtain fairly poor performance when using the standard statistical estimators for the number of distinct values, e.g., the estimators due to Goodman [9], Chao [4], and Burnham and Overton =-=[2, 3]-=- which have been used earlier in the database context in the work of Hou, Ozsoyoglu, and Taneja [13, 14] and Ozsoyoglu et al [25]. As Haas et al [10] remark, distinct value estimation is a hard and re... |

38 |
Efficient Sampling Strategies for Relational Database Operations. Theoretical Computer Science
- Lipton, Naughton, et al.
- 1993
(Show Context)
Citation Context |

37 |
Robust estimation of population size when capture probabilities vary among animals. Ecology 60:927–936
- Burnham, Overton
- 1979
(Show Context)
Citation Context ... latter, we obtain fairly poor performance when using the standard statistical estimators for the number of distinct values, e.g., the estimators due to Goodman [9], Chao [4], and Burnham and Overton =-=[2, 3]-=- which have been used earlier in the database context in the work of Hou, Ozsoyoglu, and Taneja [13, 14] and Ozsoyoglu et al [25]. As Haas et al [10] remark, distinct value estimation is a hard and re... |

25 | Random Sampling from Databases - A Survey
- Olken, Rotem
- 1995
(Show Context)
Citation Context ...ary of our proposal for effective use of sampling for approximating histograms and distinct values. 1.1 Related Work Random sampling has been proposed and used in many different contexts in databases =-=[23, 24]-=-. In particular, a large body of work [18, 19, 20, 13, 14, 12, 11, 17] has addressed the problem of estimating result size for a given query using random sampling. In contrast, the problem that we hav... |

24 |
Error-Constrained COUNT Query Evaluation in Relational Databases
- Hou, Özsoyoglu, et al.
- 1991
(Show Context)
Citation Context |

23 | Histogram-based Solutions to Diverse Database Estimation Problems
- Ioannidis, Poosala
(Show Context)
Citation Context ...QL Server. In the full paper [5], we also discuss how the above analysis can be adapted to the related family of compressed histograms. Extending our results to the case of other histogram structures =-=[15, 16]-=- is one of our ongoing research goals. The rest of the paper is organized as follows. In the remainder of this section, we briefly summarize past work in the area. In Section 2, we introduce equi-heig... |

19 |
On Estimating the Size of Projections
- Naughton, Seshadri
- 1990
(Show Context)
Citation Context ...stinct values (called the problem of estimating the number of species) has received a great deal of attention [1]. Recent work in the database literature includes the results of Naughton and Seshadri =-=[22]-=-, and Haas, Naughton, Seshadri, and Stokes [10]. As observed by the latter, we obtain fairly poor performance when using the standard statistical estimators for the number of distinct values, e.g., th... |

15 |
Behaviour and the Principle of Least Effort
- “Human
- 1949
(Show Context)
Citation Context ...For an integer column this translates to 600 bins. We modified the histogram data structure so that the number of bins could be varied. Data Generation: We generated data using the Zipf distributions =-=[29]-=-. The skewness parameter Z was varied. Although we studied several different values of Z varying from 0 to 4, we present results only for three values of Z --- 0, 2, and 4. The distribution is uniform... |

14 |
An evaluation of sampling-based size estimation methods for selections in database systems
- Ling, Sun
- 1995
(Show Context)
Citation Context |

13 |
On the estimation of the number of classes in a population
- Goodman
- 1949
(Show Context)
Citation Context ...ri, and Stokes [10]. As observed by the latter, we obtain fairly poor performance when using the standard statistical estimators for the number of distinct values, e.g., the estimators due to Goodman =-=[9]-=-, Chao [4], and Burnham and Overton [2, 3] which have been used earlier in the database context in the work of Hou, Ozsoyoglu, and Taneja [13, 14] and Ozsoyoglu et al [25]. As Haas et al [10] remark, ... |

10 |
Using random sampling for histogram construction: how much is enough
- Chaudhuri, Motwani, et al.
- 1998
(Show Context)
Citation Context ... focus on equi-height (or, equi-depth) histograms. We chose equi-height histograms for our analysis since they are commonly used in many commercial optimizers, including SQL Server. In the full paper =-=[5]-=-, we also discuss how the above analysis can be adapted to the related family of compressed histograms. Extending our results to the case of other histogram structures [15, 16] is one of our ongoing r... |

10 |
On Estimating COUNT, SUM, and AVERAGE Relational Algebra Queries
- Özsoyoglu, Du, et al.
- 1991
(Show Context)
Citation Context ...the estimators due to Goodman [9], Chao [4], and Burnham and Overton [2, 3] which have been used earlier in the database context in the work of Hou, Ozsoyoglu, and Taneja [13, 14] and Ozsoyoglu et al =-=[25]-=-. As Haas et al [10] remark, distinct value estimation is a hard and relatively unsolved problem, and few analytic results are available. In this section, we first establish that no estimator for d ca... |