Results 1  10
of
130
Online Aggregation
, 1997
"... Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually, the final answer is returned. This archaic approach is frustrating to users and has been abandoned in most other area ..."
Abstract

Cited by 378 (44 self)
 Add to MetaCart
(Show Context)
Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually, the final answer is returned. This archaic approach is frustrating to users and has been abandoned in most other areas of computing. In this paper we propose a new online aggregation interface that permits users to both observe the progress of their aggregation queries and control execution on the fly. After outlining usability and performance requirements for a system supporting online aggregation, we present a suite of techniques that extend a database system to meet these requirements. These include methods for returning the output in random order, for providing control over the relative rate at which different aggregates are computed, and for computing running confidence intervals. Finally, we report on an initial implementation of online aggregation in postgres. 1 Introduction Aggregation is an incre...
An Algorithm for MultiRelational Discovery of Subgroups
, 1997
"... We consider the problem of finding statistically unusual subgroups in a multirelation database, and extend previous work on singlerelation subgroup discovery. We give a precise definition of the multirelation subgroup discovery task, propose a specific form of declarative bias based on foreign ..."
Abstract

Cited by 191 (8 self)
 Add to MetaCart
We consider the problem of finding statistically unusual subgroups in a multirelation database, and extend previous work on singlerelation subgroup discovery. We give a precise definition of the multirelation subgroup discovery task, propose a specific form of declarative bias based on foreign links as a means of specifying the hypothesis space, and show how propositional evaluation functions can be adapted to the multirelation setting. We then describe an algorithm for this problem setting that uses optimistic estimate and minimal support pruning, an optimal refinement operator and sampling to ensure efficiency and can easily be parallelized.
Ripple Joins for Online Aggregation
"... We present a new family of join algorithms, called ripple joins, for online processing of multitable aggregation queries in a relational database management system (dbms). Such queries arise naturally in interactive exploratory decisionsupport applications. Traditional offline join algorithms are ..."
Abstract

Cited by 182 (11 self)
 Add to MetaCart
We present a new family of join algorithms, called ripple joins, for online processing of multitable aggregation queries in a relational database management system (dbms). Such queries arise naturally in interactive exploratory decisionsupport applications. Traditional offline join algorithms are designed to minimize the time to completion of the query. In contrast, ripple joins are designed to minimize the time until an acceptably precise estimate of the query result is available, as measured by the length of a confidence interval. Ripple joins are adaptive, adjusting their behavior during processing in accordance with the statistical properties of the data. Ripple joins also permit the user to dynamically trade off the two key performance factors of online aggregation: the time between successive updates of the running aggregate, and the amount by which the confidenceinterval length decreases at each update. We show how ripple joins can be implemented in an existing dbms using iterators, and we give an overview of the methods used to compute confidence intervals and to adaptively optimize the ripple join "aspectratio" parameters. In experiments with an initial implementation of our algorithms in the postgres dbms, the time required to produce reasonably precise online estimates was up to two orders of magnitude smaller than the time required for the best offline join algorithms to produce exact answers.
Computing on Data Streams
, 1998
"... In this paper we study the space requirement of algorithms that make only one (or a small number of) pass(es) over the input data. We study such algorithms under a model of data streams that we introduce here. We give a number of upper and lower bounds for problems stemming from queryprocessing, ..."
Abstract

Cited by 181 (3 self)
 Add to MetaCart
In this paper we study the space requirement of algorithms that make only one (or a small number of) pass(es) over the input data. We study such algorithms under a model of data streams that we introduce here. We give a number of upper and lower bounds for problems stemming from queryprocessing, invoking in the process tools from the area of communication complexity.
Computing iceberg queries efficiently
 In Proc. of the 24th VLDB Conf
, 1998
"... Many applications compute aggregate functions... ..."
SamplingBased Estimation of the Number of Distinct Values of an Attribute
 In Proc. 21st International Conference on Very Large Databases
, 1995
"... ..."
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context ..."
Abstract

Cited by 125 (11 self)
 Add to MetaCart
(Show Context)
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equiheight histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for prespecified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Tracking join and selfjoin sizes in limited storage
, 2002
"... This paper presents algorithms for tracking (approximate) join and selfjoin sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and selfjoin sizes without an expensive recomputation from the base data, and without the lar ..."
Abstract

Cited by 120 (0 self)
 Add to MetaCart
This paper presents algorithms for tracking (approximate) join and selfjoin sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and selfjoin sizes without an expensive recomputation from the base data, and without the large space overhead required to maintain such sizes exactly. Query optimizers rely on fast, highquality estimates of join sizes in order to select between various join plans, and estimates of selfjoin sizes are used to indicate the degree of skew in the data. For selfjoins, we considertwo approaches proposed in [Alon, Matias, and Szegedy. The Space Complexity of Approximating the Frequency Moments. JCSS, vol. 58, 1999, p.137147], which we denote tugofwar and samplecount. Wepresent fast algorithms for implementing these approaches, and extensions to handle deletions as well as insertions. We also report on the rst experimental study of the two approaches, on a range of synthetic and realworld data sets. Our study shows that tugofwar provides more accurate estimates for a given storage limit than samplecount, which in turn is far more accurate than a standard samplingbased approach. For example, tugofwar needed only 4{256 memory words, depending on the data set, in order to estimate the selfjoin size
Synopsis Data Structures for Massive Data Sets
"... Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a f ..."
Abstract

Cited by 116 (13 self)
 Add to MetaCart
Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a framework for evaluating such work. We consider the use of "synopsis" data structures, which use very little space and provide fast (typically approximated) answers to queries. The design and analysis of effective synopsis data structures o er many algorithmic challenges. We discuss a number of concrete examples of synopsis data structures, and describe fast algorithms for keeping them uptodate in the presence of online updates to the data sets.
BOAT  Optimistic Decision Tree Construction
, 1999
"... Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision ..."
Abstract

Cited by 114 (2 self)
 Add to MetaCart
(Show Context)
Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all mainmemory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any differen...