Results 1 - 10
of
17
Online Aggregation
, 1997
"... Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually, the final answer is returned. This archaic approach is frustrating to users and has been abandoned in most other area ..."
Abstract
-
Cited by 284 (45 self)
- Add to MetaCart
Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually, the final answer is returned. This archaic approach is frustrating to users and has been abandoned in most other areas of computing. In this paper we propose a new online aggregation interface that permits users to both observe the progress of their aggregation queries and control execution on the fly. After outlining usability and performance requirements for a system supporting online aggregation, we present a suite of techniques that extend a database system to meet these requirements. These include methods for returning the output in random order, for providing control over the relative rate at which different aggregates are computed, and for computing running confidence intervals. Finally, we report on an initial implementation of online aggregation in postgres. 1 Introduction Aggregation is an incre...
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equi-h ..."
Abstract
-
Cited by 91 (11 self)
- Add to MetaCart
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Overcoming Limitations of Sampling for Aggregation Queries
- In ICDE
, 2001
"... We study the problem of approximately answering aggregation queries using sampling. We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is skewed. To address this issue, we introduce a technique called outlier-indexing. Uniform sampling is also ineffect ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
We study the problem of approximately answering aggregation queries using sampling. We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is skewed. To address this issue, we introduce a technique called outlier-indexing. Uniform sampling is also ineffective for queries with low selectivity. We rely on weighted sampling based on workload information to overcome this shortcoming. We demonstrate that a combination of outlier-indexing with weighted sampling can be used to answer aggregation queries with significantly reduced approximation error compared to either uniform sampling or weighted sampling alone. We discuss the implementation of these techniques on Microsoft’s SQL Server, and present experimental results that demonstrate the merits of our techniques. 1
Bifocal Sampling for Skew-Resistant Join Size Estimation
- In Proceedings of the 1996 ACM SIGMOD Intl. Conf. on Management of Data
, 1996
"... This paper introduces bifocal sampling, a new technique for estimating the size of an equi-join of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value, Distinct estimation procedures are employed ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
This paper introduces bifocal sampling, a new technique for estimating the size of an equi-join of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value, Distinct estimation procedures are employed that focus on various combinations for joining tuples (e.g., for estimating the number of joining tuples that are dense in both relations). This combination of estimation procedures overcomes some well-known problems in previous schemes, enabling good estimates with no a priori knowledge about the data distribution. The estimate obtained by the bifocal sampling algorithm is proven to lie with high probability within a small constant factor of the actual join size, regardless of the skew, as long as the join size is f2(n lg n), for relations consisting of n tuples. The algorithm requires a sample of size at most O(W Ig n). By contrast, previous algorithms using a sample of similar size may require the join size to be f2(n/ @ to guarantee an accurate estimate. Experimental results support the theoretical claims and show that bifocal sampling is practical and effective. 1
AQUA: System and Techniques for Approximate Query Answering
, 1998
"... In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses ..."
Abstract
-
Cited by 22 (5 self)
- Add to MetaCart
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This paper presents the Approximate QUery Answering (AQUA) System, for fast, highlyaccurate approximate answers to queries. Aqua provides approximate answers using small, precomputed synopses (samples, counts, etc.) of the underlying base data. An important feature of Aqua is that it provides accuracy guarantees without any a priori assumptions on either the data distribution, the order in which the base data is loaded, or the layout of the data on the disks. Currently, the system provides fast approximate answers for queries with selects, aggregates, group bys and/or joins (especially, the multi-way foreign key joins that are popular in OLAP). We present several ne...
Aqua project white paper
, 1997
"... Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding o ..."
Abstract
-
Cited by 16 (10 self)
- Add to MetaCart
Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This white paper describes the Approximate QUery Answering (AQUA) Project underway in the Information Sciences Research Center at Bell Labs. We present a framework for an approximate query engine that observes new data as it arrives and maintains small synopsis data structures on that data. These data structures are used to provide fast, approximate answers to a broad class of queries. We describe metrics for evaluating approximate query answers. We also present new synopsis data structures, and new techniques for approximate query answers. We report on the goals and status of the Aqua project, and plans for future work.
Selectivity Estimation for Joins Using Systematic Sampling
- IN PROCEEDINGS OF INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEM APPLICATIONS
, 1997
"... We propose a new approach to the estimation of join selectivity. The technique, which we have called "systematic sampling", is a novel variant of the sampling-based approach. Systematic sampling works as follows: Given a relation R of N tuples, with a join attribute that can be accessed in ascending ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We propose a new approach to the estimation of join selectivity. The technique, which we have called "systematic sampling", is a novel variant of the sampling-based approach. Systematic sampling works as follows: Given a relation R of N tuples, with a join attribute that can be accessed in ascending/descending order via an index, if n is the number of tuples to be sampled from R, select a tuple at random from the first k = d N n e tuples of R and every kth tuple thereafter. We first develop a theoretical foundation for systematic sampling which suggests that the method gives a more representative sample than the traditional simple random sampling. Subsequent experimental analysis on a range of synthetic relations confirms that the quality of sample relations (participating in a join) yielded by systematic sampling is higher than those produced by the traditional simple random sampling. To ensure that the sample relations produced by the systematic sampling indeed assist in computat...
On Sampling and Relational Operators
- BULLETIN OF THE TECHNICAL COMMITTEE ON DATA ENGINEERING
, 1999
"... A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. We highlight the primary difficulties, summarize the results of some recent work in this area, and indicate directions for future work. ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. We highlight the primary difficulties, summarize the results of some recent work in this area, and indicate directions for future work.
Semantic Cardinality Estimation for Queries over Objects
- In Persistent Object Systems: Principles and Practice. The Seventh Int'l Workshop on Persistent Object Systems, Cape May, NJ
, 1997
"... In this paper, we address the problem of estimating cardinalities of queries over sets of objects. We base our estimates on a knowledge of subset relationships between sets (e.g., Students form a subset of People). Previous work on cardinality estimation has assumed that every subset is a representa ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this paper, we address the problem of estimating cardinalities of queries over sets of objects. We base our estimates on a knowledge of subset relationships between sets (e.g., Students form a subset of People). Previous work on cardinality estimation has assumed that every subset is a representative sample of the set it is taken from. We maintain (offline) information for the cases where the subset is not a uniform representative, and use this information to improve the accuracy of our cardinality estimates. We present cardinality estimation techniques for image sets and select sets. Image sets are created by applying a function to an existing set. Select sets contain the elements that satisfy some predicate in an existing set. Empirically we show that we can obtain estimates within a factor of four for reasonable image sets on our test database while maintaining offline statistics requiring space linear in the number of named sets in the database and the number of attributes defin...

