Results 1 - 10
of
43
Top-k query processing in uncertain databases
- In ICDE
, 2007
"... Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score and uncertainty makes traditional techniques inapplicable. We introduce new probabilistic formulations for top-k queries. Our formulations are based on ..."
Abstract
-
Cited by 70 (8 self)
- Add to MetaCart
Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score and uncertainty makes traditional techniques inapplicable. We introduce new probabilistic formulations for top-k queries. Our formulations are based on “marriage ” of traditional top-k semantics and possible worlds semantics. In the light of these formulations, we construct a framework that encapsulates a state space model and efficient query processing techniques to tackle the challenges of uncertain data settings. We prove that our techniques are optimal in terms of the number of accessed tuples and materialized search states. Our experiments show the efficiency of our techniques under different data distributions with orders of magnitude improvement over naïve materialization of possible worlds. 1
Fast and Simple Relational Processing of Uncertain Data
"... Abstract — This paper introduces U-relations, a succinct and purely relational representation system for uncertain databases. U-relations support attribute-level uncertainty using vertical partitioning. If we consider positive relational algebra extended by an operation for computing possible answer ..."
Abstract
-
Cited by 50 (4 self)
- Add to MetaCart
Abstract — This paper introduces U-relations, a succinct and purely relational representation system for uncertain databases. U-relations support attribute-level uncertainty using vertical partitioning. If we consider positive relational algebra extended by an operation for computing possible answers, a query on the logical level can be translated into, and evaluated as, a single relational algebra query on the U-relational representation. The translation scheme essentially preserves the size of the query in terms of number of operations and, in particular, number of joins. Standard techniques employed in off-the-shelf relational database management systems are effective for optimizing and processing queries on U-relations. In our experiments we show that query evaluation on U-relations scales to large amounts of data with high degrees of uncertainty.
10^(10^6) Worlds and Beyond: Efficient Representation and Processing of Incomplete Information
, 2006
"... Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this pa ..."
Abstract
-
Cited by 46 (6 self)
- Add to MetaCart
Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this paper, we present a decomposition-based approach to addressing this problem. We introduce world-set decompositions (WSDs), a space-efficient formalism for representing any finite set of possible worlds over relational databases. WSDs are therefore a strong representation system for any relational query language. We study the problem of efficiently evaluating relational algebra queries on sets of worlds represented by WSDs. We also evaluate our technique experimentally in a large census data scenario and show that it is both scalable and efficient.
World-set decompositions: Expressiveness and efficient algorithms
- In Proc. ICDT
, 2007
"... Abstract. Uncertain information is commonplace in real-world data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of worldset decompositions (WSD ..."
Abstract
-
Cited by 32 (12 self)
- Add to MetaCart
Abstract. Uncertain information is commonplace in real-world data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of worldset decompositions (WSDs) provides a space-efficient representation for uncertain data that also supports scalable processing. WSDs are complete for finite world-sets in that they can represent any finite set of possible worlds. For possibly infinite world-sets, we show that a natural generalization of WSDs precisely captures the expressive power of c-tables. We then show that several important problems are efficiently solvable on WSDs while they are NP-hard on c-tables. Finally, we give a polynomial-time algorithm for factorizing WSDs, i.e. an efficient algorithm for minimizing such representations. 1
Conditioning Probabilistic Databases
"... Past research on probabilistic databases has studied the problem of answering queries on a static database. Application scenarios of probabilistic databases however often involve the conditioning of a database using additional information in the form of new evidence. The conditioning problem is thus ..."
Abstract
-
Cited by 32 (13 self)
- Add to MetaCart
Past research on probabilistic databases has studied the problem of answering queries on a static database. Application scenarios of probabilistic databases however often involve the conditioning of a database using additional information in the form of new evidence. The conditioning problem is thus to transform a probabilistic database of priors into a posterior probabilistic database which is materialized for subsequent query processing or further refinement. It turns out that the conditioning problem is closely related to the problem of computing exact tuple confidence values. It is known that exact confidence computation is an NPhard problem. This has lead researchers to consider approximation techniques for confidence computation. However, neither conditioning nor exact confidence computation can be solved using such techniques. In this paper we present efficient techniques for both problems. We study several problem decomposition methods and heuristics that are based on the most successful search techniques from constraint satisfaction, such as the variable elimination rule of the Davis-Putnam algorithm. We complement this with a thorough experimental evaluation of the algorithms proposed. Our experiments show that our exact algorithms scale well to realistic database sizes and can in some scenarios compete with the most efficient previous approximation algorithms.
From Complete to Incomplete Information and Back
- In Proc. SIGMOD
"... Incomplete information arises naturally in numerous data management applications. Recently, several researchers have studied query processing in the context of incomplete information. Most work has combined the syntax of a traditional query language like relational algebra with a nonstandard semanti ..."
Abstract
-
Cited by 27 (11 self)
- Add to MetaCart
Incomplete information arises naturally in numerous data management applications. Recently, several researchers have studied query processing in the context of incomplete information. Most work has combined the syntax of a traditional query language like relational algebra with a nonstandard semantics such as certain or ranked possible answers. There are now also languages with special features to deal with uncertainty. However, to the standards of the data management community, to date no language proposal has been made that can be considered a natural analog to SQL or relational algebra for the case of incomplete information. In this paper we propose such a language, World-set Algebra, which satisfies the robustness criteria and analogies to relational algebra that we expect. The language supports the contemplation on alternatives and can thus map from a complete database to an incomplete one comprising several possible worlds. We show that World-set Algebra is conservative over relational algebra in the sense that any query that maps from a complete database to a complete database (a complete-to-complete query) is equivalent to a relational algebra query. Moreover, we give an efficient algorithm for effecting this translation. We then study algebraic query optimization of such queries. We argue that query languages with explicit constructs for handling uncertainty allow for the more natural and simple expression of many real-world decision support queries. The results of this paper not only suggest a language for specifying queries in this way, but also allow for their efficient evaluation in any relational database management system.
Probabilistic Top-k and Ranking-Aggregate Queries
, 2008
"... Ranking and aggregation queries are widely used in data exploration, data analysis, and decision-making scenarios. While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications involve data that is unclean or uncertain. Ranking and ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Ranking and aggregation queries are widely used in data exploration, data analysis, and decision-making scenarios. While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications involve data that is unclean or uncertain. Ranking and aggregating uncertain (probabilistic) data raises new challenges in query semantics and processing, making conventional methods inapplicable. Furthermore, uncertainty imposes probability as a new ranking dimension that does not exist in the traditional settings. In this article we introduce new probabilistic formulations for top-k and ranking-aggregate queries in probabilistic databases. Our formulations are based on marriage of traditional top-k semantics with possible worlds semantics. In the light of these formulations, we construct a generic processing framework supporting both query types, and leveraging existing query processing and indexing capabilities in current RDBMSs. The framework encapsulates a state space model and efficient search algorithms to compute query answers. Our proposed techniques minimize the number of accessed tuples and the size of materialized search space to compute query answers. Our experimental study shows the efficiency of our techniques under different data distributions with
Query Processing over Incomplete Autonomous Databases
"... Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical missing attributes, even if they wind up being relevant to a user query. Ideally we would like the mediator to retrieve such possible answers and gauge their relevance by accessing their likelihood of being pertinent answers to the query. The autonomous nature of web databases poses several challenges in realizing this objective. Such challenges include the restricted access privileges imposed on the data, the limited support for query patterns, and the bounded pool of database and network resources in the web environment. We introduce a novel query rewriting and optimization framework QPIAD that tackles these challenges. Our technique involves reformulating the user query based on mined correlations among the database attributes. The reformulated queries are aimed at retrieving the relevant possible answers in addition to the certain answers. QPIAD is able to gauge the relevance of such queries allowing tradeoffs in reducing the costs of database query processing and answer transmission. To support this framework, we develop methods for mining attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in the form of Naïve Bayes Classifiers), and selectivity estimates. We present empirical studies to demonstrate that our approach is able to effectively retrieve relevant possible answers with high precision, high recall, and manageable cost. 1.
Source-aware entity matching: A compositional approach
- Dogmatix tracks down duplicates in XML. In SIGMOD-05. [35
, 2006
"... Entity matching (a.k.a. record linkage) plays a crucial role in integrating multiple data sources, and numerous matching solutions have been developed. However, the solutions have largely exploited only information available in the mentions and employed a single matching technique. We show how to ex ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Entity matching (a.k.a. record linkage) plays a crucial role in integrating multiple data sources, and numerous matching solutions have been developed. However, the solutions have largely exploited only information available in the mentions and employed a single matching technique. We show how to exploit information about data sources to significantly improve matching accuracy. In particular, we observe that different sources often vary substantially in their level of semantic ambiguity, thus requiring different matching techniques. In addition, it is often beneficial to group and match mentions in related sources first, before considering other sources. These observations lead to a large space of matching strategies, analogous to the space of query evaluation plans considered by a relational optimizer. We propose viewing entity matching as a composition of basic steps into a “match execution plan”. We analyze formal properties of the plan space, and show how to find a good match plan. To do so, we employ ideas from social network analysis to infer the ambiguity and relatedness of data sources. We conducted extensive experiments on several real-world data sets on the Web and in the domain of personal information management (PIM). The results show that our solution significantly outperforms current best matching methods. 1.
Query processing over incomplete autonomous databases
- In Proc. ICDE
, 2006
"... Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical missing attributes, even if they wind up being relevant to a user query. Ideally we would like the mediator to retrieve such relevant uncertain answers and gauge their relevance by accessing their likelihood of being relevant answers to the query. The autonomous nature of the databases poses several challenges in realizing this idea. Such challenges include the restricted access privileges, limited query patterns and sensitivity of database and network resource consumption in the web environment. We introduce a novel query rewriting and optimization framework that tackles these challenges. Our technique involves reformulating the user query based on approximate functional dependencies (AFDs) among the database attributes. The reformulated queries are aimed at retrieving the relevant uncertain answers in addition to the certain answers. Our query processing framework QPIAD is able to gauge the relevance of such reformulated queries to manage the cost of database query processing and answer transmission. To support this framework, we develop methods for mining attribute correlations (in terms of AFDs) and value distributions(using Naïve Bayes Classifiers). We present empirical studies to demonstrate that our approach is effective in retrieving relevant uncertain answers with high precision, high recall and manageable cost. 1

