## Improved Histograms for Selectivity Estimation of Range Predicates (1996)

Citations: | 236 - 20 self |

### BibTeX

@INPROCEEDINGS{Poosala96improvedhistograms,

author = {Viswanath Poosala and Peter J. Haas and Yannis E. Ioannidis and Eugene J. Shekita},

title = {Improved Histograms for Selectivity Estimation of Range Predicates},

booktitle = {},

year = {1996},

pages = {294--305}

}

### Years of Citing Articles

### OpenURL

### Abstract

Many commercial database systems maintain histograms to summarize the contents of relations and permit efficient estimation of query result sizes and access plan costs. Although several types of histograms have been proposed in the past, there has never been a systematic study of all histogram aspects, the available choices for each aspect, and the impact of such choices on histogram effectiveness. In this paper, we provide a taxonomy of histograms that captures all previously proposed histogram types and indicates many new possibilities. We introduce novel choices for several of the taxonomy dimensions, and derive new histogram types by combining choices in effective ways. We also show how sampling techniques can be used to reduce the cost of histogram construction. Finally, we present results from an empirical study of the proposed histogram types used in selectivity estimation of range predicates and identify the histogram types that have the best overall performance. 1 Introduction...

### Citations

1497 | Probability inequalities for sums of bounded random variables
- Hoeffding
- 1963
(Show Context)
Citation Context ... to achieve the above accuracy. The above guidelines are conservative in that fewer samples are actually needed in practice to achieve a given degree of accuracy. (For example, Theorem 1 in Hoeffding =-=[Hoe63]-=- implies that only about 190 samples are actually needed to achieve the above accuracy for an equality predicate X = v i in the specific case f(v i ) = 0:2.) In our experiments, we used a sample size ... |

1093 |
A Practical Guide to Splines
- Boor
- 1978
(Show Context)
Citation Context ...-linear approximations is known in numerical analysis as the optimal knot placement problem, which unfortunately, has no efficient solution [dB95]. We have adapted a heuristic algorithm due to deBoor =-=[dB78]-=-. Although rather complicated, the algorithm has very low time and space complexity; a detailed description appears elsewhere [dB78]. 4.5 V-Optimal(F,F) V-optimal(F,F) histograms group contiguous sets... |

324 |
Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology
- Zipf
- 1949
(Show Context)
Citation Context ...frequency of the value will be computed exactly. It is shown in [HS95], for example, that if the attribute values of a relation containing 10 6 tuples are distributed according to a Zipf distribution =-=[Zip49] with para-=-meter z = 0:86 (roughly, an "80-20" law), then with a probability of approximately 99.9% the 10 most frequent values will all appear in a sample of 1000 tuples (i.e., in a 0.1% sample). The ... |

260 | Random sampling with a reservoir
- VITTER
- 1985
(Show Context)
Citation Context ...the quantile values for the sample are used as estimates of the corresponding quantile values for the entire relation. To obtain the random sample, we use a reservoir sampling algorithm due to Vitter =-=[Vit85]-=-. This algorithm (called Algorithm X in Vitter's paper) obtains a random sample of size n during a single pass through the relation. The number of tuples in the relation does not need to be known befo... |

183 | Equi-depth histograms for estimating selectivity factors for multidimensional queries - Muralikrishna, DeWitt - 1988 |

155 | Randomized algorithms for optimizing large join queries - Ioannidis, Kang - 1990 |

154 | Practical selectivity estimation through adaptive sampling - Lipton, Naughton, et al. - 1990 |

144 |
Accurate estimation of the number of tuples satisfying a condition
- Piatetsky-Shapiro, Connell
- 1984
(Show Context)
Citation Context ...ption requires storage of the lowest and highest value in each bucket. Note that, for singleton buckets, this requires storing only one attribute value. Another approach is the point value assumption =-=[PSC84]-=-, where only one attribute value is assumed to be present (usually the lowest among those actually in the bucket). This assumption requires storage of this single attribute value. Finally, the histogr... |

133 | Balancing histogram optimality and practicality for query result size estimation
- Ioannidis, Poosala
- 1995
(Show Context)
Citation Context ...ttribute value is assumed to be present (usually the lowest among those actually in the bucket). This assumption requires storage of this single attribute value. Finally, the histograms considered in =-=[IP95]-=- record every distinct attribute value that appears in each bucket (i.e., no assumptions are made). Such histograms require an auxiliary index for efficient access when estimating the result size of a... |

118 |
On the Propagation of Errors in the Size of Join Results
- Ioannidis, Christodoulakis
- 1991
(Show Context)
Citation Context ...of the optimizer's decisions or render the user interface application unreliable. Earlier work has shown that errors in query result size estimates may increase exponentially with the number of joins =-=[IC91]-=-. This result, in conjunction with the increasing complexity of queries, demonstrates the critical importance of accurate estimation. Several techniques have been proposed in the literature to estimat... |

114 | Sampling-based estimation of the number of distinct values of an attribute
- Haas, Naughton, et al.
- 1995
(Show Context)
Citation Context ... absence of these values does not introduce serious errors into the final querysize estimate. In future work, we will investigate the utility of more sophisticated estimation methods such as those in =-=[HNSS95]-=-. 7.4 Spreads Histograms that use the area as a source and/or sort parameter require computation of the spreads of different attribute values. As with frequencies, spreads are expensive to calculate e... |

104 | Adaptive selectivity estimation using query feedback - Chen, Roussopoulos - 1994 |

97 | Optimization of large join queries - Swami, Gupta - 1988 |

80 | Universality of serial histograms - Ioannidis - 1993 |

78 |
The optimization of queries in relational databases
- Kooi
- 1980
(Show Context)
Citation Context ...omplexity of queries, demonstrates the critical importance of accurate estimation. Several techniques have been proposed in the literature to estimate query result sizes [MCS88], including histograms =-=[Koo80], sampling-=- [LNS90, HS95], and parametric techniques [CR94, SLRD93]. Of these, histograms approximate the frequency distribution of an attribute by grouping attribute values into "buckets" (subsets) an... |

74 | Statistical profile estimation in database systems
- Mannino, Chu, et al.
- 1988
(Show Context)
Citation Context ...junction with the increasing complexity of queries, demonstrates the critical importance of accurate estimation. Several techniques have been proposed in the literature to estimate query result sizes =-=[MCS88]-=-, including histograms [Koo80], sampling [LNS90, HS95], and parametric techniques [CR94, SLRD93]. Of these, histograms approximate the frequency distribution of an attribute by grouping attribute valu... |

55 | Optimal histograms for limiting worst-case error propagation in the size of join results - Ioannidis, Christodoulakis - 1993 |

52 |
The P 2 Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations
- Jain, Chlamtac
- 1985
(Show Context)
Citation Context ...while external algorithms [GS91] incur multiple scans. We therefore focus on algorithms that compute approximate quantiles. One well-known technique is the P 2 algorithm proposed by Jain and Chlamtac =-=[JC85]. The basi-=-c idea behind this one-pass algorithm is to maintain a set of five "markers" that approximate the quantile of interest, the minimum value, the maximum value, and the two additional quantiles... |

28 | An instant and accurate size estimation method for joins and selections in a retrieval-intensive environment - Sun, Ling, et al. - 1993 |

23 |
Confidence limits for an unknown distribution function
- Kolmogorov
- 1941
(Show Context)
Citation Context ...mber of samples needed. A conservative estimate of the sample size required to control the relative sampling error to a desired level can be based on the following bound, originally due to Kolmogorov =-=[Kol41]-=-. Fix ns0 and let U 1 ; U 2 ; : : : ; Un be a collection of independent and identically distributed random variables uniformly distributed on [0; 1]. For 0sxs1, denote by b F (U) n (x) the (random) fr... |

17 |
Sampling-based Selectivity Estimation for Joins using Augmented Frequent Value Statistics
- Haas, Swami
- 1995
(Show Context)
Citation Context ...tuations, it may be possible to obtain a very small "pilot" random sample of the tuples in the relation prior to the complete scan of the relation. Then, by adapting a technique due to Haas =-=and Swami [HS95]-=-, the frequencies of the most frequent values can be obtained exactly with high probability. The idea is to obtain the pilot sample and observe the distinct attribute values that appear in the sample.... |

5 |
The Kolmogorov-Smirnov test for goodness-of-fit
- Massey
- 1951
(Show Context)
Citation Context ...ve that Gn does not depend on either N , the size of the relation, or F , the form of the attribute value distribution function. The distribution Gn has been tabulated for small values of n by Massey =-=[Mas51]-=-; for large values of n (? 100), Gn (x) is well-approximated by G(n 1=2 x), where G(x) = 1 \Gamma 2 exp(\Gamma2x 2 ); cf [Mas51]. Consider a range predicate of the form v isXsv j , where v i ; v j 2 V... |

3 |
Simultaneous estimation of several percentiles. Simulation, 49:159--164
- Raatikainen
- 1987
(Show Context)
Citation Context ...read, the value of each marker is updated using a piecewiseparabolic curve-fitting technique. After the last data value is read, the middle marker is used as the estimate of the quantile. Raatikainen =-=[Raa87]-=- generalized the P 2 procedure to permit simultaneous estimation of more than one quantile during a single pass. As observed both in [Raa87] and in our own work, the accuracy of the procedure can be i... |

1 |
Equidepth partitioning of a data set basedon finding its medians
- Gurajada, Srivastava
- 1991
(Show Context)
Citation Context ...e and exact, it is too expensive to be used for the large relations typically encountered in practice. Internal sorting algorithms require too much main memory and CPU time, while external algorithms =-=[GS91]-=- incur multiple scans. We therefore focus on algorithms that compute approximate quantiles. One well-known technique is the P 2 algorithm proposed by Jain and Chlamtac [JC85]. The basic idea behind th... |