Results 1  10
of
64
Efficient topk query evaluation on probabilistic data
 in ICDE
, 2007
"... Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed ..."
Abstract

Cited by 139 (26 self)
 Add to MetaCart
Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed approximate probabilities, or did not scale, and it was shown recently that precise query evaluation is theoretically hard. In this paper we describe a novel approach, which computes and ranks efficiently the topk answers to a SQL query on a probabilistic database. The restriction to topk answers is natural, since imprecisions in the data often lead to a large number of answers of low quality, and users are interested only in the answers with the highest probabilities. The idea in our algorithm is to run in parallel several MonteCarlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the topk answers. The algorithms is in a certain sense provably optimal and scales to large databases: we have measured running times of 5 to 50 seconds for complex SQL queries over a large database (10M tuples of which 6M probabilistic). Additional contributions of the paper include several optimization techniques, and a simple data model for probabilistic data that achieves completeness by using SQL views. 1
A Unified Approach to Ranking in Probabilistic Databases
"... The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decisionmaking over such data. In this paper, we present a unified approach to ranking and topk query processing in ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decisionmaking over such data. In this paper, we present a unified approach to ranking and topk query processing in probabilistic databases by viewing it as a multicriteria optimization problem, and by deriving a set of features that capture the key properties of a probabilistic dataset that dictate the ranked result. We contend that a single, specific ranking function may not suffice for probabilistic databases, and we instead propose two parameterized ranking functions, called P RF ω and P RF e, that generalize or can approximate many of the previously proposed ranking functions. We present novel generating functionsbased algorithms for efficiently ranking large datasets according to these ranking functions, even if the datasets exhibit complex correlations modeled using probabilistic and/xor trees or Markov networks. We further propose that the parameters of the ranking function be learned from user preferences, and we develop an approach to learn those parameters. Finally, we present a comprehensive experimental study that illustrates the effectiveness of our parameterized ranking functions, especially P RF e, at approximating other ranking functions and the scalability of our proposed algorithms for exact or approximate ranking. 1.
On the expressiveness of probabilistic XML models
, 2009
"... Various known models of probabilistic XML can be represented as instantiations of the abstract notion of pdocuments. In addition to ordinary nodes, pdocuments have distributional nodes that specify the possible worlds and their probabilistic distribution. Particular families of pdocuments are de ..."
Abstract

Cited by 36 (23 self)
 Add to MetaCart
Various known models of probabilistic XML can be represented as instantiations of the abstract notion of pdocuments. In addition to ordinary nodes, pdocuments have distributional nodes that specify the possible worlds and their probabilistic distribution. Particular families of pdocuments are determined by the types of distributional nodes that can be used as well as by the structural constraints on the placement of those nodes in a pdocument. Some of the resulting families provide natural extensions and combinations of previously studied probabilistic XML models. The focus of the paper is on the expressive power of families of pdocuments. In particular, two main issues are studied.
Worldset decompositions: Expressiveness and efficient algorithms
 In Proc. ICDT
, 2007
"... Abstract. Uncertain information is commonplace in realworld data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of worldset decompositions (WSD ..."
Abstract

Cited by 35 (12 self)
 Add to MetaCart
Abstract. Uncertain information is commonplace in realworld data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of worldset decompositions (WSDs) provides a spaceefficient representation for uncertain data that also supports scalable processing. WSDs are complete for finite worldsets in that they can represent any finite set of possible worlds. For possibly infinite worldsets, we show that a natural generalization of WSDs precisely captures the expressive power of ctables. We then show that several important problems are efficiently solvable on WSDs while they are NPhard on ctables. Finally, we give a polynomialtime algorithm for factorizing WSDs, i.e. an efficient algorithm for minimizing such representations. 1
From Complete to Incomplete Information and Back
 In Proc. SIGMOD
"... Incomplete information arises naturally in numerous data management applications. Recently, several researchers have studied query processing in the context of incomplete information. Most work has combined the syntax of a traditional query language like relational algebra with a nonstandard semanti ..."
Abstract

Cited by 31 (11 self)
 Add to MetaCart
Incomplete information arises naturally in numerous data management applications. Recently, several researchers have studied query processing in the context of incomplete information. Most work has combined the syntax of a traditional query language like relational algebra with a nonstandard semantics such as certain or ranked possible answers. There are now also languages with special features to deal with uncertainty. However, to the standards of the data management community, to date no language proposal has been made that can be considered a natural analog to SQL or relational algebra for the case of incomplete information. In this paper we propose such a language, Worldset Algebra, which satisfies the robustness criteria and analogies to relational algebra that we expect. The language supports the contemplation on alternatives and can thus map from a complete database to an incomplete one comprising several possible worlds. We show that Worldset Algebra is conservative over relational algebra in the sense that any query that maps from a complete database to a complete database (a completetocomplete query) is equivalent to a relational algebra query. Moreover, we give an efficient algorithm for effecting this translation. We then study algebraic query optimization of such queries. We argue that query languages with explicit constructs for handling uncertainty allow for the more natural and simple expression of many realworld decision support queries. The results of this paper not only suggest a language for specifying queries in this way, but also allow for their efficient evaluation in any relational database management system.
and J.Huang. “Using OBDDs for Efficient Query Evaluation on Probabilistic Databases
 In Proc. SUM
, 2008
"... Abstract. We consider the problem of query evaluation for tuple independent probabilistic databases and Boolean conjunctive queries with inequalities but without selfjoins. We approach this problem as a construction problem for ordered binary decision diagrams (OBDDs): Given a query q and a probabi ..."
Abstract

Cited by 30 (13 self)
 Add to MetaCart
Abstract. We consider the problem of query evaluation for tuple independent probabilistic databases and Boolean conjunctive queries with inequalities but without selfjoins. We approach this problem as a construction problem for ordered binary decision diagrams (OBDDs): Given a query q and a probabilistic database D, we construct in polynomial time an OBDD such that the probability of q(D) can be computed linearly in the size of that OBDD. This approach is applicable to a large class of queries, including the hierarchical queries, i.e., the Boolean conjunctive queries without selfjoins that admit PTIME evaluation on any tupleindependent probabilistic database, hierarchical queries extended with inequalities, and nonhierarchical queries on restricted databases. 1
Probabilistic data exchange
 In Proc. ICDT
, 2010
"... The work reported here lays the foundations of data exchange in the presence of probabilistic data. This requires rethinking the very basic concepts of traditional data exchange, such as solution, universal solution, and the certain answers of target queries. We develop a framework for data exchange ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
The work reported here lays the foundations of data exchange in the presence of probabilistic data. This requires rethinking the very basic concepts of traditional data exchange, such as solution, universal solution, and the certain answers of target queries. We develop a framework for data exchange over probabilistic databases, and make a case for its coherence and robustness. This framework applies to arbitrary schema mappings, and finite or countably infinite probability spaces on the source and target instances. After establishing this framework and formulating the key concepts, we study the application of the framework to a concrete and practical setting where probabilistic databases are compactly encoded by means of annotations formulated over random Boolean variables. In this setting, we study the problems of testing for the existence of solutions and universal solutions, materializing such solutions, and evaluating target queries (for unions of conjunctive queries) in both the exact sense and the approximate sense. For each of the problems, we carry out a complexity analysis based on properties of the annotation, in various classes of dependencies. Finally, we show that the framework and results easily and completely generalize to allow not only the data, but also the schema mapping itself to be probabilistic.
Materialized views in probabilistic databases for information exchange and query optimization
 IN PROCEEDINGS OF VLDB
, 2007
"... Views over probabilistic data contain correlations between tuples, and the current approach is to capture these correlations using explicit lineage. In this paper we propose an alternative approach to materializing probabilistic views, by giving conditions under which a view can be represented by a ..."
Abstract

Cited by 27 (9 self)
 Add to MetaCart
Views over probabilistic data contain correlations between tuples, and the current approach is to capture these correlations using explicit lineage. In this paper we propose an alternative approach to materializing probabilistic views, by giving conditions under which a view can be represented by a blockindependent disjoint (BID) table. Not all views can be represented as BID tables and so we propose a novel partial representation that can represent all views but may not define a unique probability distribution. We then give conditions on when a query’s value on a partial representation will be uniquely defined. We apply our theory to two applications: query processing using views and information exchange using views. In query processing on probabilistic data, we can ignore the lineage and use materialized views to more efficiently answer queries. By contrast, if the view has explicit lineage, the query evaluation must reprocess the lineage to compute the query resulting in dramatically slower execution. The second application is information exchange when we do not wish to disclose the entire lineage, which otherwise may result in shipping the entire database. The paper contains several theoretical results that completely solve the problem of deciding whether a conjunctive view can be represented as a BID and whether a query on a partial representation is uniquely determined. We validate our approach experimentally showing that representable views exist in real and synthetic workloads and show over three magnitudes of improvement in query processing versus a lineage based approach.
Efficient evaluation of HAVING queries on probabilistic databases
 IN PROCEEDINGS OF DBPL
, 2007
"... We study the evaluation of positive conjunctive queries with Boolean aggregate tests (similar to HAVING queries in SQL) on probabilistic databases. Our motivation is to handle aggregate queries over imprecise data resulting from information integration or information extraction. More precisely, we ..."
Abstract

Cited by 24 (6 self)
 Add to MetaCart
We study the evaluation of positive conjunctive queries with Boolean aggregate tests (similar to HAVING queries in SQL) on probabilistic databases. Our motivation is to handle aggregate queries over imprecise data resulting from information integration or information extraction. More precisely, we study conjunctive queries with predicate aggregates using MIN, MAX, COUNT, SUM, AVG or COUNT(DISTINCT) on probabilistic databases. Computing the precise output probabilities for positive conjunctive queries (without HAVING) is ♯Phard, but is in P for a restricted class of queries called safe queries. Further, for queries without selfjoins either a query is safe or its data complexity is ♯PHard, which shows that safe queries exactly capture tractable queries without selfjoins. In this paper, for each aggregate above, we find a class of queries that exactly capture efficient evaluation for HAVING queries without selfjoins. Our algorithms use a novel technique to compute the marginal distributions of elements in a semiring, which may be of independent interest.
Approximate Lineage for Probabilistic Databases
"... In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can b ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can become huge: in one public database (the Gene Ontology) we often observed 10MB of lineage (provenance) data for a single tuple. In this paper we propose to use approximate lineage, which is a much smaller formula keeping track of only the most important derivations, which the system can use to process queries and provide explanations. We discuss in detail two specific kinds of approximate lineage: (1) a conservative approximation called sufficient lineage that records the most important derivations for each tuple, and (2) polynomial lineage, which is more aggressive and can provide higher compression ratios, and which is based on Fourier approximations of Boolean expressions. In this paper we define approximate lineage formally, describe algorithms to compute approximate lineage and prove formally their error bounds, and validate our approach experimentally on a real data set. 1.