Results 1  10
of
10
Join synopses for approximate query answering
 In SIGMOD
, 1999
"... In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for joinqueries using only statistic ..."
Abstract

Cited by 142 (9 self)
 Add to MetaCart
In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for joinqueries using only statistics (in particular, samples) from the base relations. We propose join synopses (join samples) as an effective solution for this problem and show how precomputing just one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins. We present optimal strategies for allocating the available space among the various join synopses when the query work load is known and identify heuristics for the common case when the work load is not known. We also present efficient algorithms for incrementally maintaining join synopses in the presence of updates to the base relations. One of our key contributions is a detailed analysis of the error bounds obtained for approximate answers that demonstrates the tradeoffs in various methods, as well as the advantages in certain scenarios of a new subsampling method we propose. Our extensive set of experiments on the TPCD benchmark database show the effectiveness of join synopses and various other techniques proposed in this paper. 1
Synopsis Data Structures for Massive Data Sets
"... Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a f ..."
Abstract

Cited by 108 (13 self)
 Add to MetaCart
Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a framework for evaluating such work. We consider the use of "synopsis" data structures, which use very little space and provide fast (typically approximated) answers to queries. The design and analysis of effective synopsis data structures o er many algorithmic challenges. We discuss a number of concrete examples of synopsis data structures, and describe fast algorithms for keeping them uptodate in the presence of online updates to the data sets.
AQUA: System and techniques for approximate query answering
, 1998
"... In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This paper presents the Approximate QUery Answering (AQUA) System, for fast, highly accurate approximate answers to queries. Aqua provides approximate answers using small, precomputed synopses (samples, counts, etc.) of the underlying base data. An important feature of Aqua is that it provides accuracy guarantees without any a priori assumptions on either the data distribution, the order in which the base data is loaded, or the layout of the data on the disks. Currently, the system provides fast approximate answers for queries with selects, aggregates, group bys and/or joins (especially, the multiway foreign key joins that are popular in OLAP). We present several new techniques for improving the accuracy of approximate query answers for this class of queries. We show how join sampling can significantly improve the approximation quality. We describe how biased sampling can be used to overcome the problem of group size disparities
Aqua project white paper
, 1997
"... Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding o ..."
Abstract

Cited by 18 (10 self)
 Add to MetaCart
Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This white paper describes the Approximate QUery Answering (AQUA) Project underway in the Information Sciences Research Center at Bell Labs. We present a framework for an approximate query engine that observes new data as it arrives and maintains small synopsis data structures on that data. These data structures are used to provide fast, approximate answers to a broad class of queries. We describe metrics for evaluating approximate query answers. We also present new synopsis data structures, and new techniques for approximate query answers. We report on the goals and status of the Aqua project, and plans for future work.
Optimal static range reporting in one dimension
 IN PROC. 33RD ACM SYMPOSIUM ON THEORY OF COMPUTING (STOC'01)
, 2001
"... ..."
Approximate Indexed Lists
 Journal of Algorithms
, 1998
"... Let the position of a list element in a list be the number of elements preceding it plus one. An indexed list supports the following operations on a list: Insert; delete; return the position of an element; and return the element at a certain position. The order in which the elements appear in the li ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Let the position of a list element in a list be the number of elements preceding it plus one. An indexed list supports the following operations on a list: Insert; delete; return the position of an element; and return the element at a certain position. The order in which the elements appear in the list is completely determined by where the insertions take place; we do not require the presence of any keys that induce the ordering. We consider approximate indexed lists, and show that a tiny relaxation in precision of the query operations allows a considerable improvement in time complexity. The new data structure has applications in two other problems; namely, list labeling and subset rank. 1 Introduction An indexed list [5] is a list abstract data type that supports the following operations: Insert(x; y): Insert list element y immediately after list element x, which may be a list header; Delete(x): Delete list element x; Pos(x): Return the position of list element x, that is, one plu...
Optimal Parallel Approximation Algorithms for Prefix Sums and Integer Sorting (Extended Abstract)
"... Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications t ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications that have been shown to be solvable using only an approximate version of the prefix sums problem. In this paper we resolve the issue of approximating parallel prefix by introducing an algorithm that runs in O(lg n) time with very high probability, using n= lg n processors, which is optimal in terms of both work and running time. Our approximate prefix sums are guaranteed to come within a factor of (1 + ffl) of the values of the true sums in a "consistent fashion", where ffl is o(1). We achieve this result through the use of a number of interesting new techniques, such as overcertification and estimatefocusing, as well ...
Workloadbased wavelet synopses
, 2003
"... This paper introduces workloadbased wavelet synopses, which exploit query workload information to significantly boost accuracy in approximate query processing. We show that wavelet synopses can adapt effectively to workload information, and that they have significant advantages over previous approa ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
This paper introduces workloadbased wavelet synopses, which exploit query workload information to significantly boost accuracy in approximate query processing. We show that wavelet synopses can adapt effectively to workload information, and that they have significant advantages over previous approaches. An important aspect of our approach is optimizing synopses constructions toward error metrics defined by workload information, rather than based on some uniform metrics. We present an adaptive greedy algorithm which is simple and efficient. It is runtime competitive to previous, nonworkload based algorithms, and constructs workloadbased wavelet synopses that are significantly more accurate than previous synopses. The algorithm also obtains improved accuracy for nonworkload case when the error metric is the mean relative error. We also present a selftuning algorithm that adapts the workloadbased synopses to changes in the workload. All algorithms are extended to workloadbased multidimensional wavelet synopses with improved performance over previous algorithms. Experimental results demonstrate the effectiveness of workloadbased wavelet synopses for different types of data sets and query workloads, and show significant improvement in accuracy even with very small training sets. 1
New SamplingBased Summary Statistics for Improving Approximate Query Answers
, 1998
"... In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Before DBMSs providing highlyaccurate approximate answers can become a reality, many new techniques for summarizing data and for estimating answers from ..."
Abstract
 Add to MetaCart
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Before DBMSs providing highlyaccurate approximate answers can become a reality, many new techniques for summarizing data and for estimating answers from summarized data must be developed. This paper introduces two new samplingbased summary statistics, concise samples and counting samples, and presents new techniques for their fast incremental maintenance regardless of the data distribution. We quantify their advantages over standard sample views in terms of the number of additional sample points for the same view size, and hence in providing more accurate query answers. Finally, we consider their application to providing fast approximate answers to hot list queries. Our algorithms maintain their accuracy in the presence of ongoing insertions to the data warehouse. 1 Introduction In large data recording and warehousing environments, it is ...
Author manuscript, published in "String Processing and Information Retrieval, Cartagena de Indias: Colombia (2012)" DOI: 10.1007/9783642341090_32 Computing Discriminating and Generic Words
, 2013
"... Abstract. We study the following three problems of computing generic or discriminating words for a given collection of documents. Given a pattern P and a threshold d, we want to report (i) all longest extensions of P which occur in at least d documents, (ii) all shortest extensions of P which occur ..."
Abstract
 Add to MetaCart
Abstract. We study the following three problems of computing generic or discriminating words for a given collection of documents. Given a pattern P and a threshold d, we want to report (i) all longest extensions of P which occur in at least d documents, (ii) all shortest extensions of P which occur in less than d documents, and (iii) all shortest extensions of P which occur only in d selected documents. For these problems, we propose efficient algorithms based on suffix trees and using advanced data structure techniques. For problem (i), we propose an optimal solution with constant running time per output word. 1