Results 1  10
of
30
Online Aggregation
, 1997
"... Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually, the final answer is returned. This archaic approach is frustrating to users and has been abandoned in most other area ..."
Abstract

Cited by 311 (44 self)
 Add to MetaCart
Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually, the final answer is returned. This archaic approach is frustrating to users and has been abandoned in most other areas of computing. In this paper we propose a new online aggregation interface that permits users to both observe the progress of their aggregation queries and control execution on the fly. After outlining usability and performance requirements for a system supporting online aggregation, we present a suite of techniques that extend a database system to meet these requirements. These include methods for returning the output in random order, for providing control over the relative rate at which different aggregates are computed, and for computing running confidence intervals. Finally, we report on an initial implementation of online aggregation in postgres. 1 Introduction Aggregation is an incre...
Predicate Migration: Optimizing Queries with Expensive Predicates
, 1993
"... . The traditional focus of relational query optimization schemes has been on the choice of join methods and join orders. Restrictions have typically been handled in query optimizers by "predicate pushdown" rules, which apply restrictions in some random order before as many joins as possible. These r ..."
Abstract

Cited by 142 (7 self)
 Add to MetaCart
. The traditional focus of relational query optimization schemes has been on the choice of join methods and join orders. Restrictions have typically been handled in query optimizers by "predicate pushdown" rules, which apply restrictions in some random order before as many joins as possible. These rules work under the assumption that restriction is essentially a zerotime operation. However, today's extensible and objectoriented database systems allow users to define timeconsuming functions, which may be used in a query's restriction and join predicates. Furthermore, SQL has long supported subquery predicates, which may be arbitrarily timeconsuming to check. Thus restrictions should not be considered zerotime operations, and the model of query optimization must be enhanced. In this paper we develop a theory for moving expensive predicates in a query plan so that the total cost of the plan  including the costs of both joins and restrictions  is minimal. We present an algorithm to implement the theory, as well as results of our implementation in POSTGRES. Our experience with the newly enhanced POSTGRES query optimizer demonstrates that correctly optimizing queries with expensive predicates often produces plans that are orders of magnitude faster than plans generated by a traditional query optimizer. The additional complexity of considering expensive predicates during optimization is found to be manageably small. 1
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equih ..."
Abstract

Cited by 106 (11 self)
 Add to MetaCart
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equiheight histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for prespecified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Optimization techniques for queries with expensive methods
 ACM Transactions on Database Systems (TODS
, 1998
"... ObjectRelational database management systems allow knowledgeable users to de ne new data types, as well as new methods (operators) for the types. This exibility produces an attendant complexity, which must be handled in new ways for an ObjectRelational database management system to be e cient. In ..."
Abstract

Cited by 56 (3 self)
 Add to MetaCart
ObjectRelational database management systems allow knowledgeable users to de ne new data types, as well as new methods (operators) for the types. This exibility produces an attendant complexity, which must be handled in new ways for an ObjectRelational database management system to be e cient. In this paper we study techniques for optimizing queries that contain timeconsuming methods. The focus of traditional query optimizers has been on the choice of join methods and orders; selections have been handled by \pushdown " rules. These rules apply selections in an arbitrary order before as many joins as possible, using the assumption that selection takes no time. However, users of ObjectRelational systems can embed complex methods in selections. Thus selections may take signi cant amounts of time, and the query optimization model must be enhanced. In this paper, we carefully de ne a query cost framework that incorporates both selectivity and cost estimates for selections. We develop an algorithm called Predicate Migration, and prove that it produces optimal plans for queries with expensive methods. We then describe our implementation of Predicate Migration in the commercial ObjectRelational database management system Illustra, and discuss practical issues that a ect our earlier assumptions. We compare Predicate Migration to a variety of simpler optimization techniques, and demonstrate that Predicate Migration is the best general solution to date. The alternative techniques we presentmaybe useful for constrained workloads.
Least expected cost query optimization: An exercise in utility
 In Proceedings of the ACM SIGMOD/SIGACT Conference on Principle of Database System (PODS
, 1999
"... We identify two unreasonable, though standard, assumptions made by database query optimizers that can adversely affect the quality of the chosen evaluation plans. One assumption is that it is enough to optimize for the expected case—that is, the case where various parameters (like available memory) ..."
Abstract

Cited by 36 (1 self)
 Add to MetaCart
We identify two unreasonable, though standard, assumptions made by database query optimizers that can adversely affect the quality of the chosen evaluation plans. One assumption is that it is enough to optimize for the expected case—that is, the case where various parameters (like available memory) take on their expected value. The other assumption is that the parameters are constant throughout the execution of the query. We present an algorithm based on the “System R”style query optimization algorithm that does not rely on these assumptions. The algorithm we present chooses the plan of the least expected cost instead of the plan of least cost given some fixed value of the parameters. In execution environments that exhibit a high degree of variability, our techniques should result in better performance. 1
Bifocal Sampling for SkewResistant Join Size Estimation
 In Proceedings of the 1996 ACM SIGMOD Intl. Conf. on Management of Data
, 1996
"... This paper introduces bifocal sampling, a new technique for estimating the size of an equijoin of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value, Distinct estimation procedures are employed ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
This paper introduces bifocal sampling, a new technique for estimating the size of an equijoin of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value, Distinct estimation procedures are employed that focus on various combinations for joining tuples (e.g., for estimating the number of joining tuples that are dense in both relations). This combination of estimation procedures overcomes some wellknown problems in previous schemes, enabling good estimates with no a priori knowledge about the data distribution. The estimate obtained by the bifocal sampling algorithm is proven to lie with high probability within a small constant factor of the actual join size, regardless of the skew, as long as the join size is f2(n lg n), for relations consisting of n tuples. The algorithm requires a sample of size at most O(W Ig n). By contrast, previous algorithms using a sample of similar size may require the join size to be f2(n/ @ to guarantee an accurate estimate. Experimental results support the theoretical claims and show that bifocal sampling is practical and effective. 1
CostBased Optimization for Magic: Algebra and Implementation
 In Proc. of ACM SIGMOD
, 1996
"... Magic sets rewriting is a wellknown optimization heuristic for complex decisionsupport queries. There can be many variants of this rewriting even for a single query, which differ greatly in execution performance. We propose costbased techniques for selecting an efficient variant from the many cho ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Magic sets rewriting is a wellknown optimization heuristic for complex decisionsupport queries. There can be many variants of this rewriting even for a single query, which differ greatly in execution performance. We propose costbased techniques for selecting an efficient variant from the many choices. Our first contribution is a practical scheme that modelsmagic sets rewriting as a special join method that can be added to any costbased query optimizer. We derive cost formulas that allow an optimizer to choose the best variant of the rewriting and to decide whether it is beneficial. The order of complexity of the optimization process is preserved by limiting the search space in a reasonable manner. We have implemented this technique in IBM's DB2 C/S V2 database system. Our performance measurements demonstrate that the costbasedmagic optimization technique performs well, and that without it, several poor decisions could be made. Our second contribution is a formal algebraic model of ...
AQUA: System and techniques for approximate query answering
, 1998
"... In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This paper presents the Approximate QUery Answering (AQUA) System, for fast, highly accurate approximate answers to queries. Aqua provides approximate answers using small, precomputed synopses (samples, counts, etc.) of the underlying base data. An important feature of Aqua is that it provides accuracy guarantees without any a priori assumptions on either the data distribution, the order in which the base data is loaded, or the layout of the data on the disks. Currently, the system provides fast approximate answers for queries with selects, aggregates, group bys and/or joins (especially, the multiway foreign key joins that are popular in OLAP). We present several new techniques for improving the accuracy of approximate query answers for this class of queries. We show how join sampling can significantly improve the approximation quality. We describe how biased sampling can be used to overcome the problem of group size disparities
Selectivity Estimation for XML Twigs
, 2004
"... Twig queries represent the building blocks of declarative query languages over XML data. A twig query describes a complex traversal of the document graph and generates a set of element tuples based on the intertwined evaluation (i.e., join) of multiple path expressions. Estimating the result cardina ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Twig queries represent the building blocks of declarative query languages over XML data. A twig query describes a complex traversal of the document graph and generates a set of element tuples based on the intertwined evaluation (i.e., join) of multiple path expressions. Estimating the result cardinality of twig queries or, equivalently, the number of tuples in such a structural (pathbased) join, is a fundamental problem that arises in the optimization of declarative queries over XML. It is crucial, therefore, to develop concise synopsis structures that summarize the document graph and enable such selectivity estimates within the time and space constraints of the optimizer. In this paper, we propose novel summarization and estimation techniques for estimating the selectivity of twig queries with complex XPath expressions over treestructured data. Our approach is based on the XSKETCH model, augmented with new types of distribution information for capturing complex correlation patterns across structural joins. Briefly, the key idea is to represent joins as points in a multidimensional space of path counts that capture aggregate information on the contents of the resulting element tuples. We develop a systematic framework that combines distribution information with appropriate statistical assumptions in order to provide selectivity estimates for twig queries over concise XS KETCH synopses and we describe an efficient algorithm for constructing an accurate summary for a given space budget. Implementation results with both synthetic and reallife data sets verify the effectiveness of our approach and demonstrate its benefits over earlier techniques.