Results 11 - 20
of
230
Join synopses for approximate query answering
- In SIGMOD
, 1999
"... In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for join-queries using only statistic ..."
Abstract
-
Cited by 128 (10 self)
- Add to MetaCart
In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for join-queries using only statistics (in particular, samples) from the base relations. We propose join synopses (join samples) as an effective solution for this problem and show how precomputing just one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins. We present optimal strategies for allocating the available space among the various join synopses when the query work load is known and identify heuristics for the common case when the work load is not known. We also present efficient algorithms for incrementally maintaining join synopses in the presence of updates to the base relations. One of our key contributions is a detailed analysis of the error bounds obtained for approximate answers that demonstrates the trade-offs in various methods, as well as the advantages in certain scenarios of a new subsampling method we propose. Our extensive set of experiments on the TPC-D benchmark database show the effectiveness of join synopses and various other techniques proposed in this paper. 1
Computing iceberg queries efficiently
- In Proc. of the 24th VLDB Conf
, 1998
"... Many applications compute aggregate functions... ..."
Supporting Aggregate Queries over Ad-Hoc Wireless Sensor Networks
- In Workshop on Mobile Computing and Systems Applications
, 2002
"... We show how the database community's notion of a generic query interface for data aggregation can be applied to ad-hoc networks of sensor devices. As has been noted in the sensor network literature, aggregation is important as a data-reduction tool; networking approaches, however, have focused on ap ..."
Abstract
-
Cited by 117 (4 self)
- Add to MetaCart
We show how the database community's notion of a generic query interface for data aggregation can be applied to ad-hoc networks of sensor devices. As has been noted in the sensor network literature, aggregation is important as a data-reduction tool; networking approaches, however, have focused on application specific solutions, whereas our innetwork aggregation approach is driven by a general purpose, SQL-style interface that can execute queries over any type of sensor data while providing opportunities for significant optimization. We present a variety of techniques to improve the reliability and performance of our solution. We also show how grouped aggregates can be efficiently computed and offer a comparison to related systems and database projects.
Finding interesting associations without support pruning
- In ICDE
, 2000
"... Abstract Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a-priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applicat ..."
Abstract
-
Cited by 111 (13 self)
- Add to MetaCart
Abstract Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a-priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, such as data mining, identification of similar web documents, clustering, and collaborative filtering, where the rules of interest have comparatively few instances in the data. In these cases, we must look for highly correlated items, or possibly even causal relationships between infrequent items. We develop a family of algorithms for solving this problem, employing a combination of random sampling and hashing techniques. We provide analysis of the algorithms developed, and conduct experiments on real and synthetic data to obtain a comparative performance analysis.
Incremental Distance Join Algorithms for Spatial Databases
, 1998
"... Two new spatial join operations, distance join and distance semijoin, are introduced where the join output is ordered by the distance between the spatial attribute values of the joined tuples. Incremental algorithms are presented for computing these operations, which can be used in a pipelined fashi ..."
Abstract
-
Cited by 97 (9 self)
- Add to MetaCart
Two new spatial join operations, distance join and distance semijoin, are introduced where the join output is ordered by the distance between the spatial attribute values of the joined tuples. Incremental algorithms are presented for computing these operations, which can be used in a pipelined fashion, thereby obviating the need to wait for their completion when only a few tuples are needed. The algorithms can be used with a large class of hierarchical spatial data structures and arbitrary spatial data types in any dimensions. In addition, any distance metric may be employed. A performance study using Rtrees shows that the incremental algorithms outperform non-incremental approaches by an order of magnitude if only a small part of the result is needed, while the penalty, if any, for the incremental processing is modest if the entire join result is required.
Synopsis Data Structures for Massive Data Sets
"... Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a f ..."
Abstract
-
Cited by 96 (13 self)
- Add to MetaCart
Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a framework for evaluating such work. We consider the use of "synopsis" data structures, which use very little space and provide fast (typically approximated) answers to queries. The design and analysis of effective synopsis data structures o er many algorithmic challenges. We discuss a number of concrete examples of synopsis data structures, and describe fast algorithms for keeping them up-to-date in the presence of online updates to the data sets.
Tracking join and self-join sizes in limited storage
, 2002
"... This paper presents algorithms for tracking (approximate) join and self-join sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and self-join sizes without an expensive recomputation from the base data, and without the lar ..."
Abstract
-
Cited by 89 (0 self)
- Add to MetaCart
This paper presents algorithms for tracking (approximate) join and self-join sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and self-join sizes without an expensive recomputation from the base data, and without the large space overhead required to maintain such sizes exactly. Query optimizers rely on fast, high-quality estimates of join sizes in order to select between various join plans, and estimates of self-join sizes are used to indicate the degree of skew in the data. For self-joins, we considertwo approaches proposed in [Alon, Matias, and Szegedy. The Space Complexity of Approximating the Frequency Moments. JCSS, vol. 58, 1999, p.137-147], which we denote tug-of-war and sample-count. Wepresent fast algorithms for implementing these approaches, and extensions to handle deletions as well as insertions. We also report on the rst experimental study of the two approaches, on a range of synthetic and real-world data sets. Our study shows that tug-of-war provides more accurate estimates for a given storage limit than sample-count, which in turn is far more accurate than a standard sampling-based approach. For example, tug-of-war needed only 4{256 memory words, depending on the data set, in order to estimate the self-join size
Design and Evaluation of a Conit-Based Continuous Consistency Model for Replicated Services
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 2002
"... ..."
Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data
, 2000
"... Strict consistency of replicated data is infeasible or not required by many distributed applications, so current systems often permit stale replication,inwhich cached copies of data values are allowed to become out of date. Queries over cached data return an answer quickly, but the stale answer ..."
Abstract
-
Cited by 80 (8 self)
- Add to MetaCart
Strict consistency of replicated data is infeasible or not required by many distributed applications, so current systems often permit stale replication,inwhich cached copies of data values are allowed to become out of date. Queries over cached data return an answer quickly, but the stale answer may be unboundedly imprecise. Alternatively, queries over remote master data return a precise answer, but with potentially poor performance. To bridge the gap between these two extremes, we propose a new class of replication systems called TRAPP (Tradeoff in Replication Precision and Performance). TRAPP systems give each user fine-grained control over the tradeoff between precision and performance: Caches store ranges that are guaranteed to bound the current data values, instead of storing stale exact values. Users supply a quantitative precision constraint along with each query. To answer a query, TRAPP systems automatically select a combination of locally cached bounds and exact master data stored remotely to deliver a bounded answer consisting of a range that is no wider than the specified precision constraint, that is guaranteed to contain the precise answer, and that is computed as quickly as possible. This paper defines the architecture of TRAPP replication systems and covers some mechanics of caching data ranges. It then focuses on queries with aggregation, presenting optimization algorithms for answering queries with precision constraints, and reporting on performance experiments that demonstrate the fine-grained control of the precision-performance tradeoff offered by TRAPP systems.

