## How to Avoid Building DataBlades That Know the Value of Everything and the Cost of Nothing (1999)

### Cached

### Download Links

- [www.cs.berkeley.edu]
- [www.paulaoki.com]
- [db.cs.berkeley.edu]
- [db.cs.berkeley.edu]
- [gist.cs.berkeley.edu:8000]
- DBLP

### Other Repositories/Bibliography

Venue: | Proc. of SSDBM |

Citations: | 12 - 0 self |

### BibTeX

@INPROCEEDINGS{Aoki99howto,

author = {Paul M. Aoki},

title = {How to Avoid Building DataBlades That Know the Value of Everything and the Cost of Nothing},

booktitle = {Proc. of SSDBM},

year = {1999},

pages = {122--133}

}

### OpenURL

### Abstract

The object-relational database management system (ORDBMS) offers many potential benefits for scientific, multimedia and financial applications. However, work remains in the integration of domain-specific class libraries (data cartridges, extenders, DataBlades ® ) into ORDBMS query processing. A major problem is that the standard mechanisms for query selectivity estimation, taken from relational database systems, rely on properties specific to the standard data types; creation of new mechanisms remains extremely difficult because the software interfaces provided by vendors are relatively low-level. In this paper, we discuss extensions of the generalized search tree, or GiST, to support a higher-level but less type-specific approach. Specifically, we discuss the computation of selectivity estimates with confidence intervals using a variety of index-based approaches and present results from an experimental comparison of these methods with several estimators from the literature. 1. Intro...

### Citations

2355 | R-Trees: A Dynamic Index Structure for Spatial Searching: Memorandum No
- Guttman, Stonebraker
- 1983
(Show Context)
Citation Context ...r, ptr. The subtrees recursively partition the data records. However, they do not necessarily partition the data space. GiST can therefore model unordered, non-space-partitioning trees (e.g., R-trees =-=[19]-=-) as well as ordered, space-partitioning trees (e.g., B + -trees). The original GiST framework consists of a set of common template methods provided by GiST and a set of extension methods provided by ... |

1551 | Probability inequalities of Sums of Bounded Random Variables
- Hoeffding
- 1963
(Show Context)
Citation Context ...ious subsection). The main difference is that our decision process must somehow model the expected benefit of sampling. A simple strategy is to base this decision on conservative confidence intervals =-=[HOEF63]-=-. As traversal proceeds, we compare the most recent decrease in confidence interval width to an approximation of the corresponding decrease for a conservative sampling estimator. (Put another way, we ... |

939 | The Art of Computer Systems Performance Analysis - Jain - 1991 |

593 | The ubiquitous B-Tree
- Comer
- 1979
(Show Context)
Citation Context ...y do not necessarily partition the data space. GiST can therefore model unordered, non-space-partitioning trees (e.g., R-trees [GUTT84]) as well as ordered, space-partitioning trees (e.g., B + -trees =-=[COME79]-=-). The original GiST framework consists of a set of common template methods provided by GiST and a set of extension methods provided by the user. The template methods generally correspond to the funct... |

396 | Multivariate Density Estimation
- Scott
- 1992
(Show Context)
Citation Context ...ata increases. We expect an effect like this due to the ‘‘curse of dimensionality,’’ and analogous degradations have been widely documented elsewhere in the statistics and computer science lit=-=erature [41]-=-; the index-based technique works better with lowerrank projections of the same data set. Third, random sampling performs quite poorly unless the selectivity is large (i.e., >> 1%). This is due to a c... |

326 | Online Aggregation
- Hellerstein, Haas, et al.
- 1997
(Show Context)
Citation Context ...into these extensions rather than writing new ones. For both heap and index sampling, we implemented a variety of running interval estimators for the mean. These estimators were based on conservative =-=[HELL97a]-=-, central limit theorem (CLT) [HAAS97], and non-parametric BC a bootstrap confidence intervals [DICI96]. Conservative techniques are more appropriate than those based on CLTs for the sample sizes unde... |

270 |
Efficient Color Histogram Indexing for Quadratic Form Distance Functions
- Hafner, Sawhney, et al.
- 1995
(Show Context)
Citation Context ...ibrary Project's Blobworld system (D = 20) [CARS97]. The 20 dimensions result from applying the singular value decomposition to 256-bin histogram values in the CIE LUV color space and then truncating =-=[HAFN95]-=-. This represents high-dimensional image database workloads. For each real data set, we also generated uniform random data sets of the same dimensionality and cardinality. Uniform data has two specifi... |

267 | Sampling techniques. 3rd ed - COCHRAN - 1977 |

226 | On Packing R-trees
- Kamel, Faloutsos
- 1993
(Show Context)
Citation Context ...hich represented a class of related algorithms: insertion-load using randomly-ordered records, 9 insertion-load using (Hilbert-)clustered records [JAGA90], bulk-load using (Hilbert-)clustered records =-=[KAME93]-=-, bulk-load using (STR-)tiled records [LEUT97]. Estimators. The traversal and aggregation interfaces of [AOKI98a] allow us to implement estimation using prioritized traversal, breadth-first or level-a... |

220 | Wavelet-based histograms for selectivity estimation
- Matias, Vitter, et al.
- 1998
(Show Context)
Citation Context ...partitioning multidimensional histograms [40, Ch. 9]. (When details and implementations become available, comparisons with more parsimonious non-parametric methods such as wav elet-encoded histograms =-=[35]-=- should be instructive.) 7 Random-centered queries establish the base location (e.g., center point) of a query shape from a probability distribution defined on the underlying space. Object-centered qu... |

219 | Generalized search trees for database systems
- Hellerstein, Naughton, et al.
- 1995
(Show Context)
Citation Context ... widely studied, but rarely in terms of a general framework for extensible database management systems. We describe a set of approaches based on a modification of the generalized search tree, or GiST =-=[22]-=-, which supports flexible tree traversal [5]. Each approach uses approximate cardinality metadata, stored in the index nodes, to produce incrementally-refined selectivity estimates with confidence int... |

202 |
Linear Clustering of Objects with Multiple Attributes
- Jagadish
- 1990
(Show Context)
Citation Context ...ss of an index. We used a variety of loading algorithms, each of which represented a class of related algorithms: insertion-load using randomly-ordered records, insertion-load using Hilbert-clustered =-=[29]-=- records, bulk-load using Hilbert-clustered records, bulkload using STR-clustered [33] records. Estimators. The traversal and aggregation interfaces allow us to implement estimation using prioritized ... |

197 | Shoring up persistent applications
- Carey, DeWitt, et al.
- 1994
(Show Context)
Citation Context ...s.berkeley.edu/. libgist 1.0 implements primary access methods (data records stored in the leaf nodes) on top of a simple storage manager that can be replaced by the SHORE recoverable storage manager =-=[CARE94]-=- at compile-time. Section 3, the worst-case effect of pseudo-ranking on our interval estimates has easily-computed bounds.) Loading algorithm. Loading has a strong effect on the effectiveness of an in... |

191 |
Equi-depth histograms for estimating selectivity factors for multi- dimensional queries
- Dewitt
- 1988
(Show Context)
Citation Context ... limited. • Histograms. Now well-established [40], conventional histograms rely on space-partitioning schemes. Various forms of indexed main-memory multidimensional histograms have also been propose=-=d [36]. -=-Secondary memory histograms and hierarchical estimation are not considered in this work; neither are the problems of space-partitioning. • Index-assisted statistics. Sev eral researchers have noted ... |

162 | Beyond Uniformity and Independence : Analysis of R-trees Using the Concept of Fractal Dimension
- Faloutsos, Kamel
- 1995
(Show Context)
Citation Context ...o random-centered and objectcentered window queries [39]. 7 For random-centered queries, we implemented and compared estimators based on the uniformity assumption, the Hausdorff fractal dimension D 0 =-=[15]-=- and density (expected stabbing number) [43]. For object-centered queries, we also used an estimator based on the correlation fractal dimension D 2 [8]. We chose not to compare our techniques with non... |

151 | A model for the prediction of R-tree performance
- Theodoridis, Sellis
- 1996
(Show Context)
Citation Context ...queries [39]. 7 For random-centered queries, we implemented and compared estimators based on the uniformity assumption, the Hausdorff fractal dimension D 0 [15] and density (expected stabbing number) =-=[43]-=-. For object-centered queries, we also used an estimator based on the correlation fractal dimension D 2 [8]. We chose not to compare our techniques with nonparametric estimators based on space-partiti... |

145 | Optimal histograms with quality guarantees
- Jagadish, Koudas, et al.
- 1998
(Show Context)
Citation Context ... histograms rely on spacepartitioning schemes. When they can be applied, they consitute a very attractive option because of the many recent results on the generation of high-quality histograms (e.g., =-=[JAGA98]-=-). Various forms of indexed main-memory multidimensional histograms have also been proposed (e.g., [MURA88, GROS93]). Secondary memory histograms and hierarchical estimation are not considered in this... |

126 | Estimating the selectivity of spatial queries using the ‘correlation’ fractal dimension
- Belussi, Faloutsos
- 1995
(Show Context)
Citation Context ...assumption, the Hausdorff fractal dimension D 0 [15] and density (expected stabbing number) [43]. For object-centered queries, we also used an estimator based on the correlation fractal dimension D 2 =-=[8]-=-. We chose not to compare our techniques with nonparametric estimators based on space-partitioning for a simple reason: these techniques require summary data that is exponential in the embedding dimen... |

125 |
Better bootstrap confidence intervals
- Efron
- 1987
(Show Context)
Citation Context ... 0.06804 0.06948 0.1904 0.07910 26,021 20 Uni20 20 20 62.66 12.82 7.924 7.430 were based on conservative [23], central limit theorem (CLT) [20], and non-parametric BC a bootstrap confidence intervals =-=[14]-=-. Conservative techniques are more appropriate than those based on CLTs for the sample sizes under study but provide weaker bounds; in terms of useful sample sizes, we have empirically observed that t... |

121 | STR: A simple and efficient algorithm for R-tree packing
- Leutenegger, Lopez, et al.
- 1997
(Show Context)
Citation Context ...lass of related algorithms: insertion-load using randomly-ordered records, insertion-load using Hilbert-clustered [29] records, bulk-load using Hilbert-clustered records, bulkload using STR-clustered =-=[33]-=- records. Estimators. The traversal and aggregation interfaces allow us to implement estimation using prioritized traversal, breadth-first traversal, and A/R index sampling in about 500 lines of C++. ... |

111 | Estimating the efficiency of backtrack programs - Knuth - 1975 |

110 | Region-Based Image Querying
- Carson, Belongie, et al.
- 1997
(Show Context)
Citation Context ...4) [21]. GTSPP is a bathythermograph (ocean temperature) database and represents Earth science workloads. • Image feature vectors from the Berkeley Digital Library Project’s Blobworld system (D = =-=20) [10]-=-. The 20 dimensions result from applying the singular value decomposition to 256-bin histogram values in the CIE LUV color space and then truncating. This represents multimedia workloads. For each rea... |

110 | Adaptive selectivity estimation using query feedback
- Chen, Roussopoulos
- 1994
(Show Context)
Citation Context ...he extender has already created and optimized.) Second, spacepartitioning schemes require storage exponential in D. • Model-fitting techniques. Methods based on regression, wav elets and neural nets=-= [11, 32, 35]-=- have been used to summarize attribute frequency distributions. The proposed techniques have some additional disadvantages. First, like the parametric estimators discussed in this paper, they are all ... |

109 | Random Sampling from Databases
- Olken
- 1993
(Show Context)
Citation Context ...histogram’’ from a tree index is to augment every non-leaf node entry with a cardinality count (i.e., the total number of leaf records in the specified subtree). Such counts are commonly called ra=-=nks [38]-=-. Inserting or deleting a record results in node modifications from leaf to root because any such update changes the cardinality of every subtree containing that record. This is generally considered t... |

109 | The Sequoia 2000 storage benchmark
- Stonebraker, Frew, et al.
- 1993
(Show Context)
Citation Context ...parate real data sets of varying embedding dimensionality, D: . Geographic coordinates from the USGS GNIS data set (D = 2) [USGS95]. This is a "national" version of the Sequoia 2000 storage =-=benchmark [STON93]-=- and represents GIS workloads. . Spatial coordinates plus time from the NOAA GTSPP data set (D = 4) [HAMI94]. GTSPP is a bathythermograph (ocean temperature) database and represents common Earth scien... |

84 | The New Jersey data reduction report - Barbara, DuMouchel, et al. - 1997 |

81 | Towards an analysis of indexing schemes
- Hellerstein, Papadimitriou, et al.
- 1997
(Show Context)
Citation Context ...f an arbitrary data set at an arbitrary resolution. That is, the index recursively divides the indexed data into clusters; these clusters support efficient search, assuming that the data is indexable =-=[24]-=- and the index design is effective. Efficient indexed search over a giv en workload means that we examine a minimal number of extraneous objects over that workload. Second, in the process of implement... |

80 |
Towards an analysis of range query performance in spatial data structures
- PAGEL, SIX, et al.
- 1993
(Show Context)
Citation Context ...parison As our "benchmarks," we selected several parametric point estimators from the literature on spatial databases. Different estimators apply to random-centered and object-centered windo=-=w queries [PAGE93]-=-. 10 For random-centered queries, we implemented and compared estimators based on the uniformity assumption, the Hausdorff fractal dimension D 0 [FALO94] and density (expected stabbing number) [THEO96... |

79 | Statistical profile estimation in database systems
- Mannino, Chu, et al.
- 1988
(Show Context)
Citation Context ...literature. Specifically, we discuss non-parametric selectivity estimation techniques (which relate to tree traversal), estimation using random sampling, and tree condensation. We refer the reader to =-=[MANN88]-=- for background information about selectivity estimation; the references given here are generally incremental with respect to that survey. Additional references are given in [AOKI98b]. Extensible esti... |

71 | Histogram-based estimation techniques in databases
- Poosala
- 1997
(Show Context)
Citation Context ... they are all point estimators and provide no interval bounds. Second, with a few exceptions, the ability to perform dynamic updates of the summary data is limited. • Histograms. Now well-establishe=-=d [40]-=-, conventional histograms rely on space-partitioning schemes. Various forms of indexed main-memory multidimensional histograms have also been proposed [36]. Secondary memory histograms and hierarchica... |

66 |
Widmayer: Towards an Analysis of Range Query Performance
- Pagel, Six, et al.
- 1993
(Show Context)
Citation Context ...arison As our ‘‘benchmarks,’’ we selected several parametric point estimators from the literature on spatial databases. Different estimators apply to random-centered and objectcentered window =-=queries [39]-=-. 7 For random-centered queries, we implemented and compared estimators based on the uniformity assumption, the Hausdorff fractal dimension D 0 [15] and density (expected stabbing number) [43]. For ob... |

63 | The Art of Computer Programming. Vol. III, Sorting and Searching - Knuth - 1973 |

61 |
Processing aggregate relational queries with hard time constraints
- Hou, Ozsoyoglu, et al.
- 1989
(Show Context)
Citation Context ...earch structures than they are to write non-trivial selectivity estimators.sFrom an algorithmic viewpoint, the theme of this work (which is closely related to work on sampling-based estimation, e.g., =-=[26]) is the ‘-=-‘best effort’’ use of an explicit, limited I/O budget in the creation of interval estimates. It contains three main contributions. First, we provide a broad discussion of the ‘‘GiST as histo... |

57 | Simple random sampling from relational databases - Olken, Rotem - 1986 |

39 |
Dynamic query optimization in Rdb/VMS
- Antoshenkov
- 1993
(Show Context)
Citation Context ...uncertainty regions (e.g., leaf nodes), while breadth-first may have visited the children of nodes that were fully subsumed by the query (and therefore had low uncertainty). The split-level heuristic =-=[3]-=-, described in more detail in Appendix B of the full paper, also uses PRIORITY = node depth. By stopping descent when the query predicate is CONSISTENT with more than one node entry in the current nod... |

37 | Large-sample and deterministic confidence intervals for online aggregation
- Haas
- 1997
(Show Context)
Citation Context ...1,167,671 4 Uni4 4 4 171.8 11.28 7.573 4.477 Blob 5.101 1.235 0.06804 0.06948 0.1904 0.07910 26,021 20 Uni20 20 20 62.66 12.82 7.924 7.430 were based on conservative [23], central limit theorem (CLT) =-=[20]-=-, and non-parametric BC a bootstrap confidence intervals [14]. Conservative techniques are more appropriate than those based on CLTs for the sample sizes under study but provide weaker bounds; in term... |

28 |
Random Sampling from Pseudo-Ranked B+ Trees
- Antoshenkov
- 1992
(Show Context)
Citation Context ...trees are one of the example applications enabled by the GiST extensions of [AOKI98a]. Here, we define pseudo-ranking and explain its relevant properties. Much of this discussion and notation follows =-=[ANTO92]. An easy -=-and intuitive way to construct a "hierarchical histogram" from a tree index is to augment every non-leaf node entry with a cardinality count (i.e., the total number of leaf records in the sp... |

26 | Generalizing “search” in generalized search trees
- Aoki
- 1998
(Show Context)
Citation Context ...eral framework for extensible database management systems. We describe a set of approaches based on a modification of the generalized search tree, or GiST [22], which supports flexible tree traversal =-=[5]-=-. Each approach uses approximate cardinality metadata, stored in the index nodes, to produce incrementally-refined selectivity estimates with confidence intervals. Although our approaches apply classi... |

25 | Heuristic sampling: a method for predicting the performance of tree searching programs - Chen - 1992 |

25 |
Random Sampling from B+ Trees
- Olken, Rotem
- 1989
(Show Context)
Citation Context ...ta warehouses, can reduce this cost). A lower-cost alternative is to compute upper bounds on the corresponding subtree's cardinality using a node's height within the tree and simple fanout statistics =-=[OLKE89]-=-. Such bounds may be imprecise if the tree is not full [ROSE93]. Pseudo-ranking balances the cost of rank maintenance against bound imprecision. The amortized space and time costs of pseudo-ranking ar... |

22 | Dynamic Maintenance of Data Distribution for Selectivity Estimation - Whang, Kim, et al. - 1994 |

17 |
Selectivity Estimation
- Acharya, Poosala, et al.
- 1999
(Show Context)
Citation Context ...f geographic, Earth science and multimedia data sets) between our techniques and many of the proposed parametric multidimensional estimators. This is the only comparative study (concurrent work aside =-=[1]-=-) that compares these estimators to anything except the trivial estimator (based on the uniformity assumption). The remainder of the paper is organized as follows. In Section 2, we provide a brief ove... |

14 |
Generalizing "Search" in Generalized Search Trees
- Aoki
- 1998
(Show Context)
Citation Context ... framework for extensible database management systems. We describe a set of approaches based on a modification of the generalized search tree, or GiST [HELL95], which supports flexible tree traversal =-=[AOKI98a]-=-. Although we apply a collection of classic techniquess(e.g., sampling), previous work in this area has been designed with different assumptions in mind or for different goals. As we will see, these d... |

11 |
A Bayesian approach to database query optimization
- Seppi, Barnes, et al.
- 1993
(Show Context)
Citation Context ...e 1. Figure 1(b) shows an example of the prioritized traversal algorithm running to completion on the pseudoranked tree depicted in Figure 1(a). The nodes are visited 5 A decision-theoretic framework =-=[42]-=- might well be possible, but it is not immediately clear how to compute expected utility in a query optimization context.sin an order corresponding to the uncertainty, u, of their corresponding parent... |

10 | A Tree Based Access Method (TBSAM) for Fast Processing of Aggregate Queries - Srivastava, Lum - 1988 |

7 |
Random Sampling from B
- Olken, D
- 1989
(Show Context)
Citation Context ...ta warehouses, can reduce this cost). A lower-cost alternative is to compute upper bounds on the corresponding subtree’s cardinality using a node’s height within the tree and simple fanout statist=-=ics [37]-=-. Such bounds may be imprecise if the tree is not full. Pseudo-ranking balances the cost of rank maintenance against bound imprecision. The amortized space and time costs of pseudo-ranking are low eno... |

6 |
SIAM: Statistics Information Access Method
- Ghosh
- 1986
(Show Context)
Citation Context ...tatistics. Sev eral researchers have noted that balanced tree structures can be viewed as a hierarchy of (approximately) equidepth histograms [3]. Others have used access methods to compute aggregate =-=[18]-=- or density [25] functions. This work does not generally trade off precision against cost. Tr ee traversal: An enormous literature exists on heuristic tree search in artificial intelligence. Much of t... |

6 |
AMASE: An Object-Oriented Meta-database Catalog for Accessing Multi-Mission Astrophysics Data
- Cheung, Leisawitz, et al.
- 1995
(Show Context)
Citation Context ...ng is the ability to integrate libraries of domain-specific data types 1 into the database engine; research ORDBMS applications have already been developed in scientific areas as diverse as astronomy =-=[CHEU95], bioinfor-=-matics [FLAN98], ocean and atmospheric sciences [FARR94, CHI97] and highenergy physics [ATHA97]. In addition, commercial type libraries for "mainstream" but scientifically important types su... |

4 | Algorithms for index-assisted selectivity estimation
- Aoki
- 1998
(Show Context)
Citation Context ...e, procedures and results. Section 6 reviews related work. We conclude in Section 7. Additional algorithmic issues, experimental results and discussions of future work are contained in the full paper =-=[4]-=-. 2. Background and assumptions In this section, we briefly review the concepts and assumptions that underlie our approach. First, we give an overview of the approach. We then describe the specific in... |

4 |
Sampling the Leaves of a Tree with Equal Probabilities
- Rosenbaum
- 1993
(Show Context)
Citation Context ...is to compute upper bounds on the corresponding subtree's cardinality using a node's height within the tree and simple fanout statistics [OLKE89]. Such bounds may be imprecise if the tree is not full =-=[ROSE93]-=-. Pseudo-ranking balances the cost of rank maintenance against bound imprecision. The amortized space and time costs of pseudo-ranking are low enough that it has been incorporated in a high-performanc... |