Results 1 - 10
of
36
Trio: a system for integrated management of data, accuracy, and lineage
- PRESENTED AT CIDR 2005
, 2005
"... Trio is a new database system that manages not only data, butalsotheaccuracy and lineage of the data. Inexact (uncertain, probabilistic, fuzzy, approximate, incomplete, and imprecise!) databases have been proposed in the past, and the lineage problem also has been studied. The goals of the Trio proj ..."
Abstract
-
Cited by 174 (11 self)
- Add to MetaCart
Trio is a new database system that manages not only data, butalsotheaccuracy and lineage of the data. Inexact (uncertain, probabilistic, fuzzy, approximate, incomplete, and imprecise!) databases have been proposed in the past, and the lineage problem also has been studied. The goals of the Trio project are to combine and distill previous work into a simple and usable model, design a query language as an understandable extension to SQL, and most importantly build a working system—a system that augments conventional data management with both accuracy and lineage as an integral part of the data. This paper provides numerous motivating applications for Trio and lays out preliminary plans for the data model, query language, and prototype system.
Join synopses for approximate query answering
- In SIGMOD
, 1999
"... In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for join-queries using only statistic ..."
Abstract
-
Cited by 128 (10 self)
- Add to MetaCart
In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for join-queries using only statistics (in particular, samples) from the base relations. We propose join synopses (join samples) as an effective solution for this problem and show how precomputing just one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins. We present optimal strategies for allocating the available space among the various join synopses when the query work load is known and identify heuristics for the common case when the work load is not known. We also present efficient algorithms for incrementally maintaining join synopses in the presence of updates to the base relations. One of our key contributions is a detailed analysis of the error bounds obtained for approximate answers that demonstrates the trade-offs in various methods, as well as the advantages in certain scenarios of a new subsampling method we propose. Our extensive set of experiments on the TPC-D benchmark database show the effectiveness of join synopses and various other techniques proposed in this paper. 1
Data-Streams and Histograms
, 2001
"... Histograms have been used widely to capture data distribution, to represent the data by a small number of step functions. Dynamic programming algorithms which provide optimal construction of these histograms exist, albeit running in quadratic time and linear space. In this paper we provide linear ti ..."
Abstract
-
Cited by 121 (8 self)
- Add to MetaCart
Histograms have been used widely to capture data distribution, to represent the data by a small number of step functions. Dynamic programming algorithms which provide optimal construction of these histograms exist, albeit running in quadratic time and linear space. In this paper we provide linear time construction of 1 + epsilon approximation of optimal histograms, running in polylogarithmic space. Our results extend to the context of data-streams, and in fact generalize to give 1 + epsilon approximation of several problems in data-streams which require partitioning the index set into intervals. The only assumptions required are that the cost of an interval is monotonic under inclusion (larger interval has larger cost) and that the cost can be computed or approximated in small space. This exhibits a nice class of problems for which we can have near optimal data-stream algorithms.
Distinct sampling for highly-accurate answers to distinct values queries and event reports
- In Proceedings of the 27th International Conference on Very Large Data Bases
"... Estimating the number of distinct values is a wellstudied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinct-values estimates based on sampling (or other techniques that exa ..."
Abstract
-
Cited by 73 (5 self)
- Add to MetaCart
Estimating the number of distinct values is a wellstudied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinct-values estimates based on sampling (or other techniques that examine only part of the input data). We present an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data. In contrast to the previous negative results, our small Distinct Samples are guaranteed to accurately estimate the number of distinct values. The samples can be incrementally maintained up-to-date in the presence of data insertions and deletions, with minimal time and memory overheads, so that the full scan may be performed only once. Moreover, a stored Distinct Sample can be used to accurately estimate the number of distinct values within any range specified by the query, or within any other subset of the data satisfying a query predicate. We present an extensive experimental study of distinct sampling. Using synthetic and real-world data sets, we show that distinct sampling gives distinct-values estimates to within 0%–10 % relative error, whereas previous methods typically incur 50%–250 % relative error. Next, we show how distinct sampling can provide fast, highlyaccurate approximate answers for “report ” queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc. For a commercial call center environment, we show that a 1 % Distinct Sample
ICICLES: Self-tuning Samples for Approximate Query Answering
- VLDB
, 2000
"... Approximate query answering systems provide very fast alternatives to OLAP systems when applications are tolerant to small errors in query answers. ..."
Abstract
-
Cited by 44 (0 self)
- Add to MetaCart
Approximate query answering systems provide very fast alternatives to OLAP systems when applications are tolerant to small errors in query answers.
Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation
- In ICDE
, 2002
"... Obtaining fast and good quality approximations to data distributions is a problem of central interest to database management. A variety of popular database applications including, approximate querying, similarity searching and data mining in most application domains, rely on such good quality approx ..."
Abstract
-
Cited by 43 (5 self)
- Add to MetaCart
Obtaining fast and good quality approximations to data distributions is a problem of central interest to database management. A variety of popular database applications including, approximate querying, similarity searching and data mining in most application domains, rely on such good quality approximations. Histogram based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space efficient manner. In this paper, we place the problem of histogram construction into perspective and we generalize it by raising the requirement of a finite data set and/or known data set size. We consider the case of an infinite data set on which data arrive continuously forming an infinite data stream. In this context, we present the first single pass algorithms capable of constructing histograms of provable good quality. We present algorithms for the fixed window variant of the basic histogram construction problem, supporting incremental maintenance of the histograms. The proposed algorithms trade accuracy for speed and allow for a graceful tradeoff between the two, based on application requirements. In the case of approximate queries on infinite data streams, we present a detailed experimental evaluation comparing our algorithms with other applicable techniques using real data sets, demonstrating the superiority of our proposal. 1
Crossing the Structure Chasm
- IN CIDR
, 2003
"... It has frequently been observed that most of the world's data lies outside database systems. The reason is that database systems focus on structured data, leaving the unstructured realm to others. The world of unstructured data has several very appealing properties, such as ease of authoring, query ..."
Abstract
-
Cited by 42 (15 self)
- Add to MetaCart
It has frequently been observed that most of the world's data lies outside database systems. The reason is that database systems focus on structured data, leaving the unstructured realm to others. The world of unstructured data has several very appealing properties, such as ease of authoring, querying and data sharing. In contrast, authoring, querying and sharing structured data require significant effort, albeit with the benefit of rich query languages and exact answers. We argue
Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data
, 2001
"... We investigate the problem of generating fast approximate answers to queries for large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. I ..."
Abstract
-
Cited by 34 (6 self)
- Add to MetaCart
We investigate the problem of generating fast approximate answers to queries for large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce a novel technique for building probabilistic models from frequent itemsets. The itemsets are treated as constraints on the distribution of the query variables and the maximum entropy principle is used online to build a joint probability model for attributes in the query. We show that the resulting probability model defines a Markov random field (MRF) and that the time taken to answer a query scales exponentially as a function of the induced width of the associated MRF graph. We empirically compare the MRF model to other probabilistic models, such as the independence model, the Chow-Liu tree model, the Bernoulli mixture model, and the ADTree model. Experimental resu...
REHIST: Relative error histogram construction algorithms
- In VLDB
, 2004
"... Histograms and Wavelet synopses provide useful tools in query optimization and approximate query answering. Traditional histogram construction algorithms, such as V-Optimal, optimize absolute error measures for which the error in estimating a true value of 10 by 20 has the same effect of estimating ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
Histograms and Wavelet synopses provide useful tools in query optimization and approximate query answering. Traditional histogram construction algorithms, such as V-Optimal, optimize absolute error measures for which the error in estimating a true value of 10 by 20 has the same effect of estimating a true value of 1000 by 1010. However, several researchers have recently pointed out the drawbacks of such schemes and proposed wavelet based schemes to minimize relative error measures. None of these schemes provide satisfactory guarantees – and we provide evidence that the difficulty may lie in the choice of wavelets as the representation scheme. In this paper, we consider histogram construction for the known relative error measures. We develop optimal as well as fast approximation algorithms. We provide a comprehensive theoretical analysis and demonstrate the effectiveness of these algorithms in providing significantly more accurate answers through synthetic and real life data sets.
Optimal and Approximate Computation of Summary Statistics for Range Aggregates
- In Proc. of ACM PODS
, 2001
"... Fast estimates for aggregate queries are useful in database query optimization, approximate query answering and online query processing. Hence, there has been a lot of focus on "selectivity estimation", that is, computing summary statistics on the underlying data and using that to answer aggrega ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
Fast estimates for aggregate queries are useful in database query optimization, approximate query answering and online query processing. Hence, there has been a lot of focus on "selectivity estimation", that is, computing summary statistics on the underlying data and using that to answer aggregate queries fast and to a reasonable approximation. We present two sets of results for range aggregate queries, which are amongst the most common queries. First, we focus on a histogram as summary statistics and present algorithms for constructing histograms that are provably optimal (or provably approximate) for range queries; these algorithms take (pseudo-) polynomial time. These are the first known optimality or approximation results for arbitrary range queries; previously known results were optimal only for restricted range queries (such as equality queries, hierarchical or prefix range queries). Second, we focus on wavelet-based representations as summary statistics and present fast algorithms for picking wavelet statistics that are provably optimal for range queries. No previously-knownwavelet-based methods have this property. We perform an experimental study of the various summary representations show the benefits of our algorithms over the known methods. AT&T Labs---Research, Florham Park, NJ. fagilbert, kotidis, muthu, mstraussg@research.att.com. 1

