Results 1 - 10
of
29
Gordian: efficient and scalable discovery of composite keys. In: VLDB,
, 2006
"... ..."
(Show Context)
Consistent selectivity estimation via maximum entropy
, 2007
"... Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics to improve information about the joint distribution of attribute values in a table. The joint distrib ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics to improve information about the joint distribution of attribute values in a table. The joint distribution for all columns is almost always too large to store completely, and the resulting use of partial distribution information raises the possibility that multiple, non-equivalent selectivity estimates may be available for a given predicate. Current optimizers use cumbersome ad hoc methods to ensure that selectivities are estimated in a consistent manner. These methods ignore valuable information and tend to bias the optimizer toward query plans for which the least information is available, often yielding poor results. In this paper we present a novel method for consistent selectivity estimation based on the
A Sampling-Based Approach to Information Recovery †
"... Abstract — There has been a recent resurgence of interest in research on noisy and incomplete data. Many applications require information to be recovered from such data. Ideally, an approach for information recovery should have the following features. First, it should be able to incorporate prior kn ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
(Show Context)
Abstract — There has been a recent resurgence of interest in research on noisy and incomplete data. Many applications require information to be recovered from such data. Ideally, an approach for information recovery should have the following features. First, it should be able to incorporate prior knowledge about the data, even if such knowledge is in the form of complex distributions and constraints for which no close-form solutions exist. Second, it should be able to capture complex correlations and quantify the degree of uncertainty in the recovered data, and further support queries over such data. The database community has developed a number of approaches for information recovery, but none is general enough to offer all above features. To overcome the limitations, we take a significantly more general approach to information recovery based on sampling. We apply sequential importance sampling, a technique from statistics that works for complex distributions and dramatically outperforms naive sampling when data is constrained. We illustrate the generality and efficiency of this approach in two application scenarios: cleansing RFID data, and recovering information from published data that has been summarized and randomized for privacy. I.
Collecting and Maintaining Just-in-time statistics
- IN PROCEEDINGS OF ICDE 2007
, 2007
"... Traditional DBMSs decouple statistics collection and query optimization both in space and time. Decoupling in time may lead to outdated statistics. Decoupling in space may cause statistics not to be available at the desired granularity needed to optimize a particular query, or some important statist ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Traditional DBMSs decouple statistics collection and query optimization both in space and time. Decoupling in time may lead to outdated statistics. Decoupling in space may cause statistics not to be available at the desired granularity needed to optimize a particular query, or some important statistics may not be available at all. Overall, this decoupling often leads to large cardinality estimation errors and, in consequence, to the selection of suboptimal plans for query execution. In this paper, we present JITS, a system for proactively collecting query-specific statistics during query compilation. The system employs a lightweight sensitivity analysis to choose which statistics to collect by making use of previously collected statistics and database activity patterns. The collected statistics are materialized and incrementally updated for later reuse. We present the basic concepts, architecture, and key features of JITS. We demonstrate its benefits through an extensive experimental study on a prototype inside the IBM DB2 engine.
Understanding Cardinality Estimation using Entropy Maximization
"... Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality estimation in a principled framework: given a set of ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality estimation in a principled framework: given a set of statistical assertions about the number of tuples returned by a fixed set of queries, predict the number of tuples returned by a new query. We model this problem using the probability space, over possible worlds, that satisfies all provided statistical assertions and maximizes entropy. We call this the Entropy Maximization model for statistics (MaxEnt). In this paper we develop the mathematical techniques needed to use the MaxEnt model for predicting the cardinality of conjunctive queries.
Worst-case Optimal Join Algorithms
- PODS'12
, 2012
"... Efficient join processing is one of the most fundamental and wellstudied tasks in database research. In this work, we examine algorithms for natural join queries over many relations and describe a novel algorithm to process these queries optimally in terms of worst-case data complexity. Our result b ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Efficient join processing is one of the most fundamental and wellstudied tasks in database research. In this work, we examine algorithms for natural join queries over many relations and describe a novel algorithm to process these queries optimally in terms of worst-case data complexity. Our result builds on recent work by Atserias, Grohe, and Marx, who gave bounds on the size of a full conjunctive query in terms of the sizes of the individual relations in the body of the query. These bounds, however, are not constructive: they rely on Shearer’s entropy inequality which is informationtheoretic. Thus, the previous results leave open the question of whether there exist algorithms whose running time achieve these optimal bounds. An answer to this question may be interesting to database practice, as we show in this paper that any projectjoin plan is polynomially slower than the optimal bound for some queries. We construct an algorithm whose running time is worstcase optimal for all natural join queries. Our result may be of independent interest, as our algorithm also yields a constructive proof of the general fractional cover bound by Atserias, Grohe, and Marx without using Shearer’s inequality. In addition, we show that this bound is equivalent to a geometric inequality by Bollobás and Thomason, one of whose special cases is the famous Loomis-Whitney inequality. Hence, our results algorithmically prove these inequalities as well. Finally, we discuss how our algorithm can be used to compute a relaxed notion of joins.
A Black-Box Approach to Query Cardinality Estimation
- IN PROC. CIDR
, 2007
"... We present a “black-box” approach to estimating query cardinality that has no knowledge of query execution plans and data distribution, yet provides accurate estimates. It does so by grouping queries into syntactic families and learning the cardinality distribution of that group directly from points ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We present a “black-box” approach to estimating query cardinality that has no knowledge of query execution plans and data distribution, yet provides accurate estimates. It does so by grouping queries into syntactic families and learning the cardinality distribution of that group directly from points in a high-dimensional input space constructed from the query’s attributes, operators, function arguments, aggregates, and constants. We envision an increasing need for such an approach in applications in which query cardinality is required for resource optimization and decision-making at locations that are remote from the data sources. Our primary case study is the Open SkyQuery federation of Astronomy archives, which uses a scheduling and caching mechanism at the mediator for execution of federated queries at remote sources. Experiments using real workloads show that the black-box approach produces accurate estimates and is frugal in its use of space and in computation resources. Also, the black-box approach provides dramatic improvements in the performance of caching in Open SkyQuery.
Cost-based Optimization of Complex Scientific Queries
- In Proc. SSDBM07
"... High energy physics scientists analyze large amounts of data looking for interesting events when particles collide. These analyses are easily expressed using complex queries that filter events. We developed a cost model for aggregation operators and other functions used in such queries and show that ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
High energy physics scientists analyze large amounts of data looking for interesting events when particles collide. These analyses are easily expressed using complex queries that filter events. We developed a cost model for aggregation operators and other functions used in such queries and show that it substantially improves performance. However, the query optimizer still produces suboptimal plans because of estimate errors. Furthermore, the optimization is very slow because of the large query size. We improved the optimization by a profiled grouping strategy where the scientific query is first automatically fragmented into subqueries based on application knowledge. Each fragment is then independently profiled on a sample of events to measure real execution cost and cardinality. An optimized fragmented query is shown to execute faster than a query optimized with the cost model alone. Furthermore, the total optimization time, including fragmentation and profiling, is substantially improved. 1.
Detecting Attribute Dependencies from Query Feedback ABSTRACT
"... Real-world datasets exhibit a complex dependency structure among the data attributes. Learning this structure is a key task in automatic statistics configuration for query optimizers, as well as in data mining, metadata discovery, and system management. In this paper, we provide a new method for dis ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Real-world datasets exhibit a complex dependency structure among the data attributes. Learning this structure is a key task in automatic statistics configuration for query optimizers, as well as in data mining, metadata discovery, and system management. In this paper, we provide a new method for discovering dependent attribute pairs based on query feedback. Our approach avoids the problem of searching through a combinatorially large space of candidate attribute pairs, automatically focusing system resources on those pairs of demonstrable interest to users. Unlike previous methods, our technique combines all of the pertinent feedback for a specified pair of attributes in a principled and robust manner, while being simple and fast enough to be incorporated into current commercial products. The method is similar in spirit to the CORDS algorithm, which proactively collects frequencies of data values and computes a chi-squared statistic from the resulting contingency table. In the reactive query-feedback setting, many entries of the contingency table are missing, and a key contribution of this paper is a variant of classical chi-squared theory that handles this situation. Because we typically discover a large number of dependent attribute pairs, we provide novel methods for ranking the pairs based on degree of dependency. Such ranking information, e.g., enables a database system to avoid exceeding the space budget for the system catalog by storing only the currently most important multivariate statistics. Experiments indicate that our dependency rankings are stable even in the presence of relatively few feedback records. 1.
Consistent Histograms In The Presence of Distinct Value
"... Self-tuning histograms have been proposed in the past as an attempt to leverage feedback from query execution. However, the focus thus far has been on histograms that only store cardinalities. In this paper, we study consistent histogram construction from query feedback that also takes distinct valu ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Self-tuning histograms have been proposed in the past as an attempt to leverage feedback from query execution. However, the focus thus far has been on histograms that only store cardinalities. In this paper, we study consistent histogram construction from query feedback that also takes distinct value counts into account. We first show how the entropy maximization (EM) principle can be leveraged to identify a distribution that approximates the data given the execution feedback making the least additional assumptions. This EM model that takes both distinct value counts and cardinalities into account. However, we find that it is computationally prohibitively expensive. We thus consider an alternative formulation for consistency – for a given query workload, the goal is to minimize the L2 distance between the true and estimated cardinalities. This approach also handles both cardinalities and distinct values counts. We propose an efficient one-pass algorithm with several theoretical properties modeling this formulation. Our experiments show that this approach produces similar improvements in accuracy as the EM based approach while being computationally significantly more efficient. 1.