Results 1  10
of
135
Distance Browsing in Spatial Databases
, 1999
"... Two different techniques of browsing through a collection of spatial objects stored in an Rtree spatial data structure on the basis of their distances from an arbitrary spatial query object are compared. The conventional approach is one that makes use of a knearest neighbor algorithm where k is kn ..."
Abstract

Cited by 293 (19 self)
 Add to MetaCart
Two different techniques of browsing through a collection of spatial objects stored in an Rtree spatial data structure on the basis of their distances from an arbitrary spatial query object are compared. The conventional approach is one that makes use of a knearest neighbor algorithm where k is known prior to the invocation of the algorithm. Thus if m#kneighbors are needed, the knearest neighbor algorithm needs to be reinvoked for m neighbors, thereby possibly performing some redundant computations. The second approach is incremental in the sense that having obtained the k nearest neighbors, the k +1 st neighbor can be obtained without having to calculate the k +1nearest neighbors from scratch. The incremental approach finds use when processing complex queries where one of the conditions involves spatial proximity (e.g., the nearest city to Chicago with population greater than a million), in which case a query engine can make use of a pipelined strategy. A general incremental nearest neighbor algorithm is presented that is applicable to a large class of hierarchical spatial data structures. This algorithm is adapted to the Rtree and its performance is compared to an existing knearest neighbor algorithm for Rtrees [45]. Experiments show that the incremental nearest neighbor algorithm significantly outperforms the knearest neighbor algorithm for distance browsing queries in a spatial database that uses the Rtree as a spatial index. Moreover, the incremental nearest neighbor algorithm also usually outperforms the knearest neighbor algorithm when applied to the knearest neighbor problem for the Rtree, although the improvement is not nearly as large as for distance browsing queries. In fact, we prove informally that, at any step in its execution, the incremental...
Improved Histograms for Selectivity Estimation of Range Predicates
, 1996
"... Many commercial database systems maintain histograms to summarize the contents of relations and permit efficient estimation of query result sizes and access plan costs. Although several types of histograms have been proposed in the past, there has never been a systematic study of all histogram aspec ..."
Abstract

Cited by 237 (20 self)
 Add to MetaCart
Many commercial database systems maintain histograms to summarize the contents of relations and permit efficient estimation of query result sizes and access plan costs. Although several types of histograms have been proposed in the past, there has never been a systematic study of all histogram aspects, the available choices for each aspect, and the impact of such choices on histogram effectiveness. In this paper, we provide a taxonomy of histograms that captures all previously proposed histogram types and indicates many new possibilities. We introduce novel choices for several of the taxonomy dimensions, and derive new histogram types by combining choices in effective ways. We also show how sampling techniques can be used to reduce the cost of histogram construction. Finally, we present results from an empirical study of the proposed histogram types used in selectivity estimation of range predicates and identify the histogram types that have the best overall performance. 1 Introduction...
WaveletBased Histograms for Selectivity Estimation
 in SIGMOD
, 1998
"... Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query P , we need to estimate the fraction of records in the database that satisfy P . Many commercial database systems maintain hist ..."
Abstract

Cited by 212 (16 self)
 Add to MetaCart
Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query P , we need to estimate the fraction of records in the database that satisfy P . Many commercial database systems maintain histograms to approximate the frequency distribution of values in the attributes of relations. In this paper, we present a technique based upon a multiresolution wavelet decomposition for building histograms on the underlying data distributions, with applications to databases, statistics, and simulation. Histograms built on the cumulative data distributions give very good approximations with limited space usage. We give fast algorithms for constructing histograms and using them in an online fashion for selectivity estimation. Our histograms also provide quick approximate answers to OLAP queries when the exact answers are not required. Our method captures the joint distribution of multiple attri...
Selectivity Estimation Without the Attribute Value Independence Assumption
, 1997
"... The result size of a query that involves multiple attributes from the same relation depends on these attributes’joinr data distribution, i.e., the frequencies of all combinations of attribute values. To simplify the estimation of that size, most commercial systems make the artribute value independen ..."
Abstract

Cited by 198 (12 self)
 Add to MetaCart
The result size of a query that involves multiple attributes from the same relation depends on these attributes’joinr data distribution, i.e., the frequencies of all combinations of attribute values. To simplify the estimation of that size, most commercial systems make the artribute value independenceassumption and maintain statistics (typically histograms) on individual attributes only. In reality, this assumption is almost always wrong and the resulting estimations tend to be highly inaccurate. In this paper, we propose two main alternatives to effectively approximate (multidimensional) joint data distributions. (a) Using a multidimensional histogram, (b) Using the Singular Value Decomposition (SVD) technique from linear algebra. An extensive set of experiments demonstrates the advantages and disadvantages of the two approaches and the benefits of both compared to the independence assumption. 1
Beyond uniformity and independence: Analysis of rtrees using the concept of fractal dimension
 In Proc. PODS
, 1994
"... We propose the concept of fractal dimension of a set of points, in order to quantify the deviation from the uniformity distribution. Using measurements on real data sets (road intersections of U.S. counties, star coordinates from NASA’s InfraredUltraviolet Explorer etc.) we provide evidence that re ..."
Abstract

Cited by 157 (19 self)
 Add to MetaCart
We propose the concept of fractal dimension of a set of points, in order to quantify the deviation from the uniformity distribution. Using measurements on real data sets (road intersections of U.S. counties, star coordinates from NASA’s InfraredUltraviolet Explorer etc.) we provide evidence that real data indeed are skewed, and, moreover, we show that they behave as mathematical fractals, with a measurable, noninteger fract al dimension. Armed with this tool, we then show its practical use in predicting the performance of spatial access methods, and specifically of the Rtrees. We provide the jirst analysis of Rtrees for skewed distributions of points: We develop a formula that estimates the number of disk accesses for range queries, given only the fractal dimension of the point set, and its count. Experiments on real data sets show that the formula is very accurate: the relative error is usually below 5%, and it rarely exceeds 10%. We believe that the fractal dimension will help replace the uniformity and independence assumptions, allowing more accurate analysis for any spatial access method, as well as better estimates for query optimization on multiattribute queries. 1
PracticaJ selectivity estimation through adaptive sampling
 In Proc. ,4CM SIGMOD International Conf on Management of Data
, 1990
"... Recently we have proposed an adaptive, random sampling algorithm for general query size estlmatlon In earlier work we analyzed the asymptotic ef’l?clency and accuracy of the algorithm, m this paper we mvestlgate Its practlcahty as applied to selects and Jams First, we extend our previous analysis to ..."
Abstract

Cited by 154 (6 self)
 Add to MetaCart
Recently we have proposed an adaptive, random sampling algorithm for general query size estlmatlon In earlier work we analyzed the asymptotic ef’l?clency and accuracy of the algorithm, m this paper we mvestlgate Its practlcahty as applied to selects and Jams First, we extend our previous analysis to provide agmficantly improved bounds on the amount of samplmg necessary for a given level of accuracy Next, we provide “sanity bounds ” to deal with queries for which the underlying data 1s extremely skewed or the query result 1s very small Finally, we report on the performance of the estlmatlon algorithm as amplemented m a host language on a commercial relational system The results are encouraging, even with this loose couplmg between the estlmatlon algorithm and the DBMS 1
Evaluating Topk Selection Queries
 In VLDB
, 1999
"... In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the "top k" tuples that best match the given attribute values. In this paper, we study the advantages and li ..."
Abstract

Cited by 140 (4 self)
 Add to MetaCart
In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the "top k" tuples that best match the given attribute values. In this paper, we study the advantages and limitations of processing a topk query by translating it into a single range query that traditional relational DBMSs can process e#ciently. In particular, we study how to determine a range query to evaluate a topk query by exploiting the statistics available to a relational DBMS, and the impact of the quality of these statistics on the retrieval e#ciency of the resulting scheme. 1 Introduction Internet Search engines rank the objects in the results of selection queries according to how well these objects match the original selection condition. For such engines, query results are not flat sets of objects that match a given condition. Instead, query results are ranked starting ...
Balancing Histogram Optimality and Practicality for Query Result Size Estimation
, 1995
"... Many current database systems use histograms to approximate the frequency distribution of values in the attributes of relations and based on them estimate query result sizes and access plan costs. In choosing among the various histograms, one has to balance between two conflicting goals: optimality, ..."
Abstract

Cited by 134 (14 self)
 Add to MetaCart
Many current database systems use histograms to approximate the frequency distribution of values in the attributes of relations and based on them estimate query result sizes and access plan costs. In choosing among the various histograms, one has to balance between two conflicting goals: optimality, so that generated estimates have the least error, and practicality, so that histograms can be constructed and maintained efficiently. In this paper, we present both theoretical and experimental results on several issues related to this tradeoff. Our overall conclusion is that the most effective approach is to focus on the class of histograms that accurately maintain the frequencies of a few attribute values and assume the uniform distribution for the rest, and choose for each relation the histogram in that class that is optimal for a selfjoin query. 1 Introduction Query optimizers of relational database systems decide on the most efficient access plan for a given query based on a variety...
DataStreams and Histograms
, 2001
"... Histograms have been used widely to capture data distribution, to represent the data by a small number of step functions. Dynamic programming algorithms which provide optimal construction of these histograms exist, albeit running in quadratic time and linear space. In this paper we provide linear ti ..."
Abstract

Cited by 130 (8 self)
 Add to MetaCart
Histograms have been used widely to capture data distribution, to represent the data by a small number of step functions. Dynamic programming algorithms which provide optimal construction of these histograms exist, albeit running in quadratic time and linear space. In this paper we provide linear time construction of 1 + epsilon approximation of optimal histograms, running in polylogarithmic space. Our results extend to the context of datastreams, and in fact generalize to give 1 + epsilon approximation of several problems in datastreams which require partitioning the index set into intervals. The only assumptions required are that the cost of an interval is monotonic under inclusion (larger interval has larger cost) and that the cost can be computed or approximated in small space. This exhibits a nice class of problems for which we can have near optimal datastream algorithms.
Selftuning histograms: Building histograms without looking at data
, 1999
"... In this paper, we introduce selftuning histograms. Although similar in structure to traditional histograms, these histograms infer data distributions not by examining the data or a sample thereof, but by using feedback from the query execution engine about the actual selectivity of range selection ..."
Abstract

Cited by 130 (13 self)
 Add to MetaCart
In this paper, we introduce selftuning histograms. Although similar in structure to traditional histograms, these histograms infer data distributions not by examining the data or a sample thereof, but by using feedback from the query execution engine about the actual selectivity of range selection operators to progressively refine the histogram. Since the cost of building and maintaining selftuning histograms is independent of the data size, selftuning histograms provide a remarkably inexpensive way to construct histograms for large data sets with little upfront costs. Selftuning histograms are particularly attractive as an alternative to multidimensional traditional histograms that capture dependencies between attributes but are prohibitively expensive to build and maintain. In this paper, we describe the techniques for initializing and refining selftuning histograms. Our experimental results show that selftuning histograms provide a lowcost alternative to traditional multidimensional histograms with little loss of accuracy for data distributions with low to moderate skew. 1.