Results 1  10
of
17
ReadOnce Functions and Query Evaluation in Probabilistic Databases
"... Probabilistic databases hold promise of being a viable means for largescale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient exact query evaluat ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
Probabilistic databases hold promise of being a viable means for largescale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient exact query evaluation in probabilistic databases has largely concentrated on querycentric formulations (e.g., safe plans, hierarchical queries), in that, they only consider characteristics of the query and not the data in the database. It is easy to construct examples where a supposedly hard query run on an appropriate database gives rise to a tractable query evaluation problem. In this paper, we develop efficient query evaluation techniques that leverage characteristics of both the query and the data in the database. We focus on tupleindependent databases where the query evaluation problem is equivalent to computing marginal probabilities of Boolean formulas associated with the result tuples. Query evaluation is easy if the Boolean formulas can be factorized into a form that has every variable appearing at most once (called readonce); this suggests a naive approach that incorporates previously developed Boolean formula factorization algorithms into the query evaluation. We then develop novel, more efficient factorization algorithms that work for a large subclass of queries (specifically, conjunctive queries without selfjoins), by exploiting the unique structure of the result tuple Boolean formulas. We empirically demonstrate that our proposed techniques are (1) orders of magnitude faster than generic inference algorithms when used to evaluate general readonce functions, and (2) for the special case of hierarchical queries, they rival the efficiency of prior techniques specifically designed to handle such queries. 1.
The Analytical Bootstrap: a New Method for Fast Error Estimation in Approximate Query Processing
"... Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)—an area of research that is now made more critical by the need for timely and costeffective analytics over “Big Data”. Assessing the quality (i.e., estimating the error) of approximate answers is essential ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)—an area of research that is now made more critical by the need for timely and costeffective analytics over “Big Data”. Assessing the quality (i.e., estimating the error) of approximate answers is essential for meaningful AQP, and the two main approaches used in the past to address this problem are based on either (i) analytic error quantification or (ii) the bootstrap method. The first approach is extremely efficient but lacks generality, whereas the second is quite general but suffers from its high computational overhead. In this paper, we introduce a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between these two traditional approaches. Based on our probabilistic framework, we develop efficient algorithms to predict the error distribution of the approximation results. These enable the computation of any bootstrapbased quality measure for a large class of SQL queries via a singleround evaluation of a slightly modified query. Extensive experiments on both synthetic and realworld datasets show that our method has superior prediction accuracy for bootstrapbased quality measures, and is several orders of magnitude faster than bootstrap.
Aggregation in Probabilistic Databases via Knowledge Compilation
"... This paper presents a query evaluation technique for positive relational algebra queries with aggregates on a representation system for probabilistic data based on the algebraic structures of semiring and semimodule. The core of our evaluation technique is a procedure that compiles semimodule and se ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
This paper presents a query evaluation technique for positive relational algebra queries with aggregates on a representation system for probabilistic data based on the algebraic structures of semiring and semimodule. The core of our evaluation technique is a procedure that compiles semimodule and semiring expressions into socalled decomposition trees, for which the computation of the probability distribution can be done in polynomial time in the size of the tree and of the distributions represented by its nodes. We give syntactic characterisations of tractable queries with aggregates by exploiting the connection between query tractability and polynomialtime decomposition trees. A prototype of the technique is incorporated in the probabilistic database engine SPROUT. We report on performance experiments with custom datasets and TPCH data. 1.
Querying uncertain data with aggregate constraints
 in SIGMOD Conference, 2011
"... Data uncertainty arises in many situations. A common approach to query processing uncertain data is to sample many “possible worlds ” from the uncertain data and to run queries against the possible worlds. However, sampling is not a trivial task, as a randomly sampled possible world may not satisfy ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Data uncertainty arises in many situations. A common approach to query processing uncertain data is to sample many “possible worlds ” from the uncertain data and to run queries against the possible worlds. However, sampling is not a trivial task, as a randomly sampled possible world may not satisfy known constraints imposed on the data. In this paper, we focus on an important category of constraints, the aggregate constraints. An aggregate constraint is placed on a set of records instead of on a single record, and a reallife system usually has a large number of aggregate constraints. It is a challenging task to find qualified possible worlds in this scenario, since tuple by tuple sampling is extremely inefficient because it rarely leads to a qualified possible world. In this paper, we introduce two approaches for querying uncertain data with aggregate constraints: constraint aware sampling and MCMC sampling. Our experiments show that the new approaches lead to high quality query results with reasonable cost.
Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations
"... Uncertain data streams are increasingly common in realworld deployments and monitoring applications require the evaluation of complex queries on such streams. In this paper, we consider complex queries involving conditioning (e.g., selections and group by’s) and aggregation operations on uncertain ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Uncertain data streams are increasingly common in realworld deployments and monitoring applications require the evaluation of complex queries on such streams. In this paper, we consider complex queries involving conditioning (e.g., selections and group by’s) and aggregation operations on uncertain data streams. To characterize the uncertainty of answers to these queries, one generally has to compute the full probability distribution of each operation used in the query. Computing distributions of aggregates given conditioned tuple distributions is a hard, unsolved problem. Our work employs a new evaluation framework that includes a general data model, approximation metrics, and approximate representations. Within this framework we design fast datastream algorithms, both deterministic and randomized, for returning approximate distributions with bounded errors as answers to those complex queries. Our experimental results demonstrate the accuracy and efficiency of our approximation techniques and offer insights into the strengths and limitations of deterministic and randomized algorithms. 1.
Queries with Difference on Probabilistic Databases
"... We study the feasibility of the exact and approximate computation of the probability of relational queries with difference on tupleindependent databases. We show that even the difference between two “safe ” conjunctive queries without selfjoins is “unsafe ” for exact computation. We turn to approx ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
We study the feasibility of the exact and approximate computation of the probability of relational queries with difference on tupleindependent databases. We show that even the difference between two “safe ” conjunctive queries without selfjoins is “unsafe ” for exact computation. We turn to approximation and design an FPRAS for a large class of relational queries with difference, limited by how difference is nested and by the nature of the subtracted subqueries. We give examples of inapproximable queries outside this class. 1.
A dichotomy for nonrepeating queries with negation in probabilistic databases
, 2014
"... This paper shows that any nonrepeating conjunctive relational query with negation has either polynomial time or #Phard data complexity on tupleindependent probabilistic databases. This result extends a dichotomy by Dalvi and Suciu for nonrepeating conjunctive queries to queries with negation. ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
This paper shows that any nonrepeating conjunctive relational query with negation has either polynomial time or #Phard data complexity on tupleindependent probabilistic databases. This result extends a dichotomy by Dalvi and Suciu for nonrepeating conjunctive queries to queries with negation. The tractable queries with negation are precisely the hierarchical ones and can be recognised efficiently. 1.
Anytime approximation in probabilistic databases
, 2013
"... This article describes an approximation algorithm for computing the probability of propositional formulas over discrete random variables. It incrementally refines lower and upper bounds on the probability of the formulas until the desired absolute or relative error guarantee is reached. This algori ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This article describes an approximation algorithm for computing the probability of propositional formulas over discrete random variables. It incrementally refines lower and upper bounds on the probability of the formulas until the desired absolute or relative error guarantee is reached. This algorithm is used by the SPROUT query engine to approximate the probabilities of results to relational algebra queries on expressive probabilistic databases.
Similarity Query Processing for Probabilistic Sets
"... Abstract — Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract — Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model sizes or significant similarity evaluation cost, and hence is only applicable to small probabilistic sets. In this paper, we propose a simple yet expressive model that supports many applications where one probabilistic set may have thousands of elements. We define two types of similarities between two probabilistic sets using the possible world semantics; they complement each other in capturing the similarity distributions in the cross product of possible worlds. We design efficient dynamic programmingbased algorithms to calculate both types of similarities. Novel individual and batch pruning techniques based on upper bounding the similarity values are also proposed. To accommodate extremely large probabilistic sets, we also design samplingbased approximate query processing methods with strong probabilistic guarantees. We have conducted extensive experiments using both synthetic and real datasets, and demonstrated the effectiveness and efficiency of our proposed methods. I.