## Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-Size Estimation (1999)

### Cached

### Download Links

- [www.vldb.org]
- [www.acm.org]
- [paris.cs.uni-sb.de]
- DBLP

### Other Repositories/Bibliography

Venue: | VLDB CONFERENCE |

Citations: | 30 - 1 self |

### BibTeX

@INPROCEEDINGS{König99combininghistograms,

author = {Arnd Christian König and Gerhard Weikum},

title = {Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-Size Estimation},

booktitle = {VLDB CONFERENCE},

year = {1999},

pages = {423--434},

publisher = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper aims to improve the accuracy of query result-size estimations in query optimizers by leveraging the dynamic feedback obtained from observations on the executed query workload. To this end, an approximate "synopsis" of data-value distributions is devised that combines histograms with parametric curve fitting, leading to a specific class of linear splines. The approach reconciles the benefits of histograms, simplicity and versatility, with those of parametric techniques especially the adaptivity to statistically biased and dynamically evolving query workloads. The paper

### Citations

1091 | A Practical Guide to Splines - Boor - 1978 |

236 | Improved Histograms for Selectivity Estimation of Range Predicates
- Poosala, Ioannidis, et al.
- 1996
(Show Context)
Citation Context ...n accuracy and practicability is made by MaxDi (V,A) histograms [24] orV-optimal using Sort Parameters other than Frequency [17] . Histograms can be e ciently constructed by sampling-based techniques =-=[25, 10]-=-. Another promising approach are Wavelet-based histograms [21], which so far have only been examined in the context of range queries, however. Parametric Techniques. Parametric techniques (also known ... |

210 | Wavelet-based histograms for selectivity estimation
- Matias, Vitter, et al.
- 1998
(Show Context)
Citation Context ...4] orV-optimal using Sort Parameters other than Frequency [17] . Histograms can be e ciently constructed by sampling-based techniques [25, 10]. Another promising approach are Wavelet-based histograms =-=[21]-=-, which so far have only been examined in the context of range queries, however. Parametric Techniques. Parametric techniques (also known as curve- tting or regression techniques) approximate value di... |

205 |
New sampling-based summary statistics for improving approximate query answers
- Gibbons, Matias
- 1998
(Show Context)
Citation Context .... However, since the sampling itself is typically carried out at the time of the approximation, the resulting overhead prohibits the use of sampling for query optimization. Therefore, in recent works =-=[10, 8]-=- techniques for incremental maintenance of random samples have been developed.sSince our main concern is a compact representation of data, and not its acquisition, we will now concentrate on the prope... |

157 | Fast incremental maintenance of approximate histograms
- Gibbons, Matias, et al.
- 2002
(Show Context)
Citation Context ...ively static representation in that they do not easily adapt themselves to dynamically evolving value distributions (the only exceptional work being the methods for incrementalshistograms proposed by =-=[10], whi-=-ch, however, need to maintain a \backing sample" in addition to the histogram itself and are thus not exactly light-weight either). Furthermore, a histogram is a statistically unbiased representa... |

137 | Optimal histograms with quality guarantees
- JAGADISH, KOUDAS, et al.
- 1998
(Show Context)
Citation Context ... in a bucket, which is impractical in real systems. A good compromise between accuracy and practicability is made by MaxDi (V,A) histograms [24] orV-optimal using Sort Parameters other than Frequency =-=[17]-=- . Histograms can be e ciently constructed by sampling-based techniques [25, 10]. Another promising approach are Wavelet-based histograms [21], which so far have only been examined in the context of r... |

137 | Efficient mid-query re-optimization of suboptimal query execution plans
- Kabra, DeWitt
- 1998
(Show Context)
Citation Context ... our knowledge, have so far been restricted to parametric techniques. 3 Architectural Assumptions and Notation 3.1 Feedback-driven Architecture Following the earlier proposals by [4] and, especially, =-=[18]-=- for adaptive selectivity estimation and dynamic re-optimization of query execution plans, we assume that the database system monitors the sizes (i.e., cardinalities in the sense of bags) of intermedi... |

133 | Balancing histogram optimality and practicality for query result size estimation
- Ioannidis, Poosala
- 1995
(Show Context)
Citation Context ...Oracle, Microsoft SQL Server, Sybase and Teradata. The histogram accuracy depends on the type of histogram used: While V-optimal(F,F) histograms have been proven optimal for equi-joins and selections =-=[15, 13, 16]-=-, they require a list of all attribute values in a bucket, which is impractical in real systems. A good compromise between accuracy and practicability is made by MaxDi (V,A) histograms [24] orV-optima... |

118 |
On the Propagation of Errors in the Size of Join Results
- Ioannidis, Christodoulakis
- 1991
(Show Context)
Citation Context ...6 tionale for this consideration is that estimation errors for the early operators are the most troublesome in that they are the biggest factors in the error propagation for the entire execution plan =-=[14]-=-. In this paper, we will focus on feedback from exact-match selections (typically from index lookups, i.e., not necessarily table scans) and range selections. More general notions of feedback are the ... |

116 | An overview of query optimization in relational systems
- Chaudhuri
(Show Context)
Citation Context ...mials of a given maximum degree), and query-speci c sampling that take samples from the database to statistically estimate the intermediate result sizes for the query at hand (for survey material see =-=[20, 1, 24, 2]-=-). From a generalized perspective, all these approaches can be viewed as constructing an approximate representation, or synopsis [11, 9], of the data for the purpose of estimation (or even giving appr... |

108 | Synopsis data structures for massive data sets
- Gibbons, Matias
- 1999
(Show Context)
Citation Context ...ult sizes for the query at hand (for survey material see [20, 1, 24, 2]). From a generalized perspective, all these approaches can be viewed as constructing an approximate representation, or synopsis =-=[11, 9]-=-, of the data for the purpose of estimation (or even giving approximative query answers, which isnot considered in this paper, however). With modern OLAP tools and other forms of decision-support quer... |

107 | Random sampling for histogram construction: How much is enough
- Chaudhuri, Motwani, et al.
- 1998
(Show Context)
Citation Context ...e sampling on a per query basis. Sampling is still very useful and, in fact, the method of choice for constructing histograms as a query-independent synopsis of the database's datavalue distributions =-=[3]-=-. Histograms are traditionally a relatively static representation in that they do not easily adapt themselves to dynamically evolving value distributions (the only exceptional work being the methods f... |

104 | Adaptive selectivity estimation using query feedback
- Chen, Roussopoulos
- 1994
(Show Context)
Citation Context ...echniques can naturally accommodate biased observations from the actually executed queries, and can be adapted to dynamic shifts in the query patterns relatively easily by recomputing the curve tting =-=[4]-=-. However, this does not mean that parametric techniques are an adequate representation in the rst place. In fact, they seem to be suitable only for data-value distributions that resemble a closed-for... |

81 | The New Jersey data reduction report
- Barbarà, DuMouchel, et al.
- 1997
(Show Context)
Citation Context ...mials of a given maximum degree), and query-speci c sampling that take samples from the database to statistically estimate the intermediate result sizes for the query at hand (for survey material see =-=[20, 1, 24, 2]-=-). From a generalized perspective, all these approaches can be viewed as constructing an approximate representation, or synopsis [11, 9], of the data for the purpose of estimation (or even giving appr... |

80 |
Universality of serial histograms
- Ioannidis
- 1993
(Show Context)
Citation Context ...Oracle, Microsoft SQL Server, Sybase and Teradata. The histogram accuracy depends on the type of histogram used: While V-optimal(F,F) histograms have been proven optimal for equi-joins and selections =-=[15, 13, 16]-=-, they require a list of all attribute values in a bucket, which is impractical in real systems. A good compromise between accuracy and practicability is made by MaxDiff(V,A) histograms [24] or V-opti... |

70 | Histogram-based estimation techniques in databases
- Poosala
- 1997
(Show Context)
Citation Context ...mials of a given maximum degree), and query-speci c sampling that take samples from the database to statistically estimate the intermediate result sizes for the query at hand (for survey material see =-=[20, 1, 24, 2]-=-). From a generalized perspective, all these approaches can be viewed as constructing an approximate representation, or synopsis [11, 9], of the data for the purpose of estimation (or even giving appr... |

68 |
Recursive Estimation and Time Series Analysis
- Young
- 1984
(Show Context)
Citation Context ...iderable overhead when \ tting" long series of query feedbacks. However, since the feedback arrives incrementally, we can use an iterative tting technique know astherecursive least squares regres=-=sion [29]-=-. For this incremental approach, we only need to maintain two m m matrices, as opposed to a k m matrix. These matrices are updated with each feedback (for a detailed description of recursive least squ... |

55 |
Optimal histograms for limiting worst-case error propagation in the size of join results
- Ioannidis, Christodoulakis
- 1993
(Show Context)
Citation Context ...Oracle, Microsoft SQL Server, Sybase and Teradata. The histogram accuracy depends on the type of histogram used: While V-optimal(F,F) histograms have been proven optimal for equi-joins and selections =-=[15, 13, 16]-=-, they require a list of all attribute values in a bucket, which is impractical in real systems. A good compromise between accuracy and practicability is made by MaxDi (V,A) histograms [24] orV-optima... |

39 | LEDA: A Library of Efficient Data Types and Algorithms
- Mehlhorn, Näher
- 1989
(Show Context)
Citation Context ...s initially, and m upon termination), and executes 4 different types of operations: 1. Removing an item from the priority queue Q. We use an implementation of priority queues based on Fibonacci heaps =-=[22]-=-, allowing the removal of an item in a queue of size n in O(log 2 n) time. 2. Merging two buckets, requiring constant time. 3. Calculating the error resulting from a merge, requiring constant time. 4.... |

38 |
Selectivity and cost estimation for joins based on random sampling
- Haas, Naughton, et al.
- 1996
(Show Context)
Citation Context ...ore accurate estimations for frequently queried values or value ranges. Sampling. These techniques compute their estimates by collecting and processing random samples of the data. Sampling techniques =-=[12, 3, 7, 19]-=- offer high accuracy and probabilistic guarantees on the quality of the estimation. However, since the sampling itself is typically carried out at the time of the approximation, the resulting overhead... |

32 |
Dynamic Programming
- Sniedovich
- 1991
(Show Context)
Citation Context ...error E 0 = spline err [lowl,1;highl,1) + P m i=l spline err [low 0 i ;hig0 hi) <E. Because of this property, the problem of nding an optimal partitioning can be seen as a dynamic programming problem =-=[27]-=-. This allows us to formulate a recursive rede nition of formula 6: De ne opt errlow;m := the optimal overall error for tting [vlow;vn) by m buckets. err [low;high) := the approximation error spline e... |

28 |
An instant and accurate size estimation method for joins and selections in a retrieval-intensive environment
- Sun, Ling, et al.
- 1993
(Show Context)
Citation Context ...ribution, this provides an accurate and compact approximation; however, since the shape of the distribution is usually not known beforehand, this is often not the case. To overcome this in exibility, =-=[28, 4]-=- use a general polynomial function and apply least-squares tting to choose its coe cients. [4] additionally uses query feedback; hereby, the approximation is able to adapt to changes in the value dist... |

22 | Aqua: System and techniques for approximate query answering
- Gibbons, Poosala, et al.
- 1998
(Show Context)
Citation Context ...ult sizes for the query at hand (for survey material see [20, 1, 24, 2]). From a generalized perspective, all these approaches can be viewed as constructing an approximate representation, or synopsis =-=[11, 9]-=-, of the data for the purpose of estimation (or even giving approximative query answers, which isnot considered in this paper, however). With modern OLAP tools and other forms of decision-support quer... |

21 |
E cient mid-query re-optimization of sub-optimal query execution plans
- Kabra, DeWitt
(Show Context)
Citation Context ... our knowledge, have so far been restricted to parametric techniques. 3 Architectural Assumptions and Notation 3.1 Feedback-driven Architecture Following the earlier proposals by [4] and, especially, =-=[18]-=- for adaptive selectivity estimation and dynamic re-optimization of query execution plans, we assume that the database system monitors the sizes (i.e., cardinalities in the sense of bags) of intermedi... |

19 |
Curve and surface fitting with splines, Monographs on Numerical Analysis
- Dierckx
- 1993
(Show Context)
Citation Context ... attribute value and a continuous parametric representation with m buckets (m <n, and often even m n) and k parameters to be tted for each bucket, this leads to the so-called \knot placement problem&q=-=uot; [6]-=- which has been intensively studied in numerical mathematics and is known to have intractable complexity in full generality. Our restriction to linear splines and the relaxation to allow discontinuiti... |

13 |
An evaluation of sampling-based size estimation methods for selections in database systems
- Ling, Sun
- 1995
(Show Context)
Citation Context ...ore accurate estimations for frequently queried values or value ranges. Sampling. These techniques compute their estimates by collecting and processing random samples of the data. Sampling techniques =-=[12, 3, 7, 19]-=- offer high accuracy and probabilistic guarantees on the quality of the estimation. However, since the sampling itself is typically carried out at the time of the approximation, the resulting overhead... |

1 |
Bifocal sampling for skew-resistant join-size estimation
- Ganguly, Gibbons, et al.
- 1996
(Show Context)
Citation Context ...ore accurate estimations for frequently queried values or value ranges. Sampling. These techniques compute their estimates by collecting and processing random samples of the data. Sampling techniques =-=[12, 3, 7, 19]-=- offer high accuracy and probabilistic guarantees on the quality of the estimation. However, since the sampling itself is typically carried out at the time of the approximation, the resulting overhead... |

1 |
Univeratility ofSerial Histograms
- Ioannidis
- 1993
(Show Context)
Citation Context ...Oracle, Microsoft SQL Server, Sybase and Teradata. The histogram accuracy depends on the type of histogram used: While V-optimal(F,F) histograms have been proven optimal for equi-joins and selections =-=[15, 13, 16]-=-, they require a list of all attribute values in a bucket, which is impractical in real systems. A good compromise between accuracy and practicability is made by MaxDi (V,A) histograms [24] orV-optima... |

1 |
Statistical Prole Estimation in Databbase Systems
- Mannino, Chu, et al.
- 1988
(Show Context)
Citation Context |

1 |
LEDA: Library of E cient Data types and Algorithms. available via ftp:mpi-sb.mpg.de
- Mehlhorn, Naher, et al.
- 1997
(Show Context)
Citation Context ...ini2 tially, and m upon termination), and executes 4 different types of operations: 1. Removing an item from the priority queue Q. We use an implementation of priority queues based on Fibonacci heaps =-=[22]-=-, allowing the removal of an item in a queue of size n in O(log 2 n) time. 2. Merging two buckets, requiring constant time. 3. Calculating the error resulting from a merge, requiring constant time. 4.... |

1 |
Numerical Receipes in C, The Art of Scienti c Computing
- Press, Teukolsky, et al.
- 1996
(Show Context)
Citation Context ...ing (see Subsection 4.2). Using a linear function f(x) = a1 x + a0 to approximate the frequencies for the values x in a bucket leads to an improvement in accuracy, depending on the linear correlation =-=[26]-=- of the data within a bucket. P 1 high,1 First, we de ne v [low;high) := high,low l=low vl as the average attribute value within [vlow;vhigh); analogously, we de ne the average frequency f [low;high) ... |

1 |
Statistical Profile Estimation in Databbase Systems
- Mannino, Chu, et al.
- 1988
(Show Context)
Citation Context ...als of a given maximum degree) , and query-specific sampling that take samples from the database to statistically estimate the intermediate result sizes for the query at hand (for survey material see =-=[20, 1, 24, 2]-=-). From a generalized perspective, all these approaches can be viewed as constructing an approximate representation, or synopsis [11, 9], of the data for the purpose of estimation (or even giving appr... |