Results 1  10
of
19
Online Aggregation
, 1997
"... Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually, the final answer is returned. This archaic approach is frustrating to users and has been abandoned in most other area ..."
Abstract

Cited by 311 (44 self)
 Add to MetaCart
Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually, the final answer is returned. This archaic approach is frustrating to users and has been abandoned in most other areas of computing. In this paper we propose a new online aggregation interface that permits users to both observe the progress of their aggregation queries and control execution on the fly. After outlining usability and performance requirements for a system supporting online aggregation, we present a suite of techniques that extend a database system to meet these requirements. These include methods for returning the output in random order, for providing control over the relative rate at which different aggregates are computed, and for computing running confidence intervals. Finally, we report on an initial implementation of online aggregation in postgres. 1 Introduction Aggregation is an incre...
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equih ..."
Abstract

Cited by 106 (11 self)
 Add to MetaCart
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equiheight histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for prespecified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Overcoming Limitations of Sampling for Aggregation Queries
 In ICDE
, 2001
"... We study the problem of approximately answering aggregation queries using sampling. We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is skewed. To address this issue, we introduce a technique called outlierindexing. Uniform sampling is also ineffect ..."
Abstract

Cited by 42 (6 self)
 Add to MetaCart
We study the problem of approximately answering aggregation queries using sampling. We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is skewed. To address this issue, we introduce a technique called outlierindexing. Uniform sampling is also ineffective for queries with low selectivity. We rely on weighted sampling based on workload information to overcome this shortcoming. We demonstrate that a combination of outlierindexing with weighted sampling can be used to answer aggregation queries with significantly reduced approximation error compared to either uniform sampling or weighted sampling alone. We discuss the implementation of these techniques on Microsoft’s SQL Server, and present experimental results that demonstrate the merits of our techniques. 1
Bifocal Sampling for SkewResistant Join Size Estimation
 In Proceedings of the 1996 ACM SIGMOD Intl. Conf. on Management of Data
, 1996
"... This paper introduces bifocal sampling, a new technique for estimating the size of an equijoin of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value, Distinct estimation procedures are employed ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
This paper introduces bifocal sampling, a new technique for estimating the size of an equijoin of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value, Distinct estimation procedures are employed that focus on various combinations for joining tuples (e.g., for estimating the number of joining tuples that are dense in both relations). This combination of estimation procedures overcomes some wellknown problems in previous schemes, enabling good estimates with no a priori knowledge about the data distribution. The estimate obtained by the bifocal sampling algorithm is proven to lie with high probability within a small constant factor of the actual join size, regardless of the skew, as long as the join size is f2(n lg n), for relations consisting of n tuples. The algorithm requires a sample of size at most O(W Ig n). By contrast, previous algorithms using a sample of similar size may require the join size to be f2(n/ @ to guarantee an accurate estimate. Experimental results support the theoretical claims and show that bifocal sampling is practical and effective. 1
AQUA: System and techniques for approximate query answering
, 1998
"... In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This paper presents the Approximate QUery Answering (AQUA) System, for fast, highly accurate approximate answers to queries. Aqua provides approximate answers using small, precomputed synopses (samples, counts, etc.) of the underlying base data. An important feature of Aqua is that it provides accuracy guarantees without any a priori assumptions on either the data distribution, the order in which the base data is loaded, or the layout of the data on the disks. Currently, the system provides fast approximate answers for queries with selects, aggregates, group bys and/or joins (especially, the multiway foreign key joins that are popular in OLAP). We present several new techniques for improving the accuracy of approximate query answers for this class of queries. We show how join sampling can significantly improve the approximation quality. We describe how biased sampling can be used to overcome the problem of group size disparities
Aqua project white paper
, 1997
"... Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding o ..."
Abstract

Cited by 18 (10 self)
 Add to MetaCart
Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This white paper describes the Approximate QUery Answering (AQUA) Project underway in the Information Sciences Research Center at Bell Labs. We present a framework for an approximate query engine that observes new data as it arrives and maintains small synopsis data structures on that data. These data structures are used to provide fast, approximate answers to a broad class of queries. We describe metrics for evaluating approximate query answers. We also present new synopsis data structures, and new techniques for approximate query answers. We report on the goals and status of the Aqua project, and plans for future work.
Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic
 In Extending database technology (EDBT
, 2008
"... Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estim ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal resources. However, in some modern applications, the accuracy of the estimate is of primal importance, and businesses are willing to trade more resources for better accuracy. Throughout our experience with building a distinct count system at a major search engine, Ask.com, we reviewed the literature of approximating distinct counts, and compared most algorithms in the literature. We deduced that Linear Counting, one of the least used algorithms, has unique and impressive advantages when the accuracy of the distinct count is critical to the business. For other estimators to attain comparable accuracy, they need more space than Linear Counting. We have supported our analytical results through comprehensive experiments. The experimental results highly favor Linear Counting when the number of distinct elements is large and the error tolerance is low. 1.
Selectivity Estimation for Joins Using Systematic Sampling
 IN PROCEEDINGS OF INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEM APPLICATIONS
, 1997
"... We propose a new approach to the estimation of join selectivity. The technique, which we have called "systematic sampling", is a novel variant of the samplingbased approach. Systematic sampling works as follows: Given a relation R of N tuples, with a join attribute that can be accessed in ascending ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
We propose a new approach to the estimation of join selectivity. The technique, which we have called "systematic sampling", is a novel variant of the samplingbased approach. Systematic sampling works as follows: Given a relation R of N tuples, with a join attribute that can be accessed in ascending/descending order via an index, if n is the number of tuples to be sampled from R, select a tuple at random from the first k = d N n e tuples of R and every kth tuple thereafter. We first develop a theoretical foundation for systematic sampling which suggests that the method gives a more representative sample than the traditional simple random sampling. Subsequent experimental analysis on a range of synthetic relations confirms that the quality of sample relations (participating in a join) yielded by systematic sampling is higher than those produced by the traditional simple random sampling. To ensure that the sample relations produced by the systematic sampling indeed assist in computat...
On Sampling and Relational Operators
 BULLETIN OF THE TECHNICAL COMMITTEE ON DATA ENGINEERING
, 1999
"... A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. We highlight the primary difficulties, summarize the results of some recent work in this area, and indicate directions for future work. ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. We highlight the primary difficulties, summarize the results of some recent work in this area, and indicate directions for future work.