Results 11 - 20
of
241
Clean answers over dirty databases: A probabilistic approach
- In Proc. ICDE
, 2006
"... The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such tuples, the merging or elimination of duplicates can be a difficult task that relies on ad-hoc and often manual solution ..."
Abstract
-
Cited by 49 (2 self)
- Add to MetaCart
The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such tuples, the merging or elimination of duplicates can be a difficult task that relies on ad-hoc and often manual solutions. We propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. We rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database. Our rewritten queries are sensitive to the semantics of duplication and help a user understand which query answers are most likely to be present in the clean database. The semantics that we adopt is independent of the way the probabilities are produced, but is able to effectively exploit them during query answering. In the absence of external knowledge that associates each database tuple with a probability, we offer a technique, based on tuple summaries, that automates this task. We experimentally study the performance of our rewritten queries. Our studies show that the rewriting does not introduce a significant overhead in query execution time. This work is done in the context of the ConQuer project at the University of Toronto, which focuses on the efficient management of inconsistent and dirty databases. 1
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach
- Computer Science Department, Florida State University
, 2008
"... Uncertain data is inherent in a few important applications such as environmental surveillance and mobile object tracking. Top-k queries (also known as ranking queries) are often natural and useful in analyzing uncertain data in those applications. In this paper, we study the problem of answering pro ..."
Abstract
-
Cited by 47 (10 self)
- Add to MetaCart
Uncertain data is inherent in a few important applications such as environmental surveillance and mobile object tracking. Top-k queries (also known as ranking queries) are often natural and useful in analyzing uncertain data in those applications. In this paper, we study the problem of answering probabilistic threshold top-k queries on uncertain data, which computes uncertain records taking a probability of at least p to be in the top-k list where p is a user specified probability threshold. We present an efficient exact algorithm, a fast sampling algorithm, and a Poisson approximation based algorithm. An empirical study using real and synthetic data sets verifies the effectiveness of probabilistic threshold top-k queries and the efficiency of our methods.
10^(10^6) Worlds and Beyond: Efficient Representation and Processing of Incomplete Information
, 2006
"... Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this pa ..."
Abstract
-
Cited by 46 (6 self)
- Add to MetaCart
Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this paper, we present a decomposition-based approach to addressing this problem. We introduce world-set decompositions (WSDs), a space-efficient formalism for representing any finite set of possible worlds over relational databases. WSDs are therefore a strong representation system for any relational query language. We study the problem of efficiently evaluating relational algebra queries on sets of worlds represented by WSDs. We also evaluate our technique experimentally in a large census data scenario and show that it is both scalable and efficient.
Data integration with uncertainties
- In Proc. of VLDB
, 2007
"... This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels, and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approxim ..."
Abstract
-
Cited by 41 (2 self)
- Add to MetaCart
This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels, and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approximate because there may be too many of them to be created and maintained or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. Second, queries to the system may be posed with keywords rather than in a structured form. Third, the data from the sources may be extracted using information extraction techniques and so may yield imprecise data. As a first step to building such a system, we introduce the concept of probabilistic schema mappings and analyze their formal foundations. We show that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but we don’t know what it is; by-tuple semantics assumes that the correct mapping may depend on the particular tuple in the source data. We present the query complexity and algorithms for answering queries in the presence of approximate schema mappings, and we describe an algorithm for efficiently computing the top-k answers to queries in such a setting. 1.
Community information management
, 2006
"... We introduce Cimple, a joint project between the University of Illinois and the University of Wisconsin. Cimple aims to develop a software platform that can be rapidly deployed and customized to manage data-rich online communities. We first describe the envisioned working of such a software platform ..."
Abstract
-
Cited by 39 (9 self)
- Add to MetaCart
We introduce Cimple, a joint project between the University of Illinois and the University of Wisconsin. Cimple aims to develop a software platform that can be rapidly deployed and customized to manage data-rich online communities. We first describe the envisioned working of such a software platform and our prototype, DBLife, which is a community portal being developed for the database research community. We then describe the technical challenges in Cimple and our solution approach. Finally, we discuss managing uncertainty and provenance, a crucial task in making our software platform practical. 1
Probabilistic skylines on uncertain data
- In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), Viena
, 2007
"... Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this pap ..."
Abstract
-
Cited by 39 (10 self)
- Add to MetaCart
Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this paper, we tackle the problem of skyline analysis on uncertain data. We propose a novel probabilistic skyline model where an uncertain object may take a probability to be in the skyline, and a p-skyline contains all the objects whose skyline probabilities are at least p. Computing probabilistic skylines on large uncertain data sets is challenging. We develop two efficient algorithms. The bottom-up algorithm computes the skyline probabilities of some selected instances of uncertain objects, and uses those instances to prune other instances and uncertain objects effectively. The top-down algorithm recursively partitions the instances of uncertain objects into subsets, and prunes subsets and objects aggressively. Our experimental results on both the real NBA player data set and the benchmark synthetic data sets show that probabilistic skylines are interesting and useful, and our two algorithms are efficient on large data sets, and complementary to each other in performance. 1.
The dichotomy of conjunctive queries on probabilistic structures
- In PODS
, 2007
"... We show that for every conjunctive query, the complexity of evaluating it on a probabilistic database is either PTIME or #P-complete, and we give an algorithm for deciding whether a given conjunctive query is PTIME or #P-complete. The dichotomy property is a fundamental result on query evaluation on ..."
Abstract
-
Cited by 35 (12 self)
- Add to MetaCart
We show that for every conjunctive query, the complexity of evaluating it on a probabilistic database is either PTIME or #P-complete, and we give an algorithm for deciding whether a given conjunctive query is PTIME or #P-complete. The dichotomy property is a fundamental result on query evaluation on probabilistic databases and it gives a complete classification of the complexity of conjunctive queries. 1. PROBLEM STATEMENT Fix a relational vocabulary R1,..., Rk, denoted R. A tuple-independent probabilistic structure is a pair (A, p) where A = (A, R A 1,..., R A k) is first order structure and p is a function that associates to each tuple t in A a rational number p(t) ∈ [0, 1]. A probabilistic structure (A,p) induces a probability distribution on the set of substructures B of A by: p(B) = kY ( Y p(t) × i=1 t∈RB i Y t∈R A i −RB i (1 − p(t))) (1) where B ⊆ A, more precisely B = (A, R B 1,..., B B k) is s.t. R B i ⊆ R A i for i = 1, k. A conjunctive query, q, is a sentence of the form ∃¯x.(ϕ1 ∧... ∧ϕm), where each ϕi is a positive atomic predicate R(t), called a sub-goal, and the tuple t consists of variables and/or constants. As usual, we drop the existential quantifiers and the ∧, writing q = ϕ1, ϕ2,..., ϕm. A conjunctive property is a property on structures defined by a conjunctive query q, and its probability on a probabilistic structure (A, p) is defined as: p(q) = X p(B) (2)
Online filtering, smoothing and probabilistic modeling of streaming data
- in ICDE
, 2008
"... In this paper, we address the problem of extending a relational database system to facilitate efficient real-time application of dynamic probabilistic models to streaming data. We use the recently proposed abstraction of model-based views for this purpose, by allowing users to declaratively specify ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
In this paper, we address the problem of extending a relational database system to facilitate efficient real-time application of dynamic probabilistic models to streaming data. We use the recently proposed abstraction of model-based views for this purpose, by allowing users to declaratively specify the model to be applied, and by presenting the output of the models to the user as a probabilistic database view. We support declarative querying over such views using an extended version of SQL that allows for querying probabilistic data. Underneath we use particle filters, a class of sequential Monte Carlo algorithms commonly used to implement dynamic probabilistic models, to represent the present and historical states of the model as sets of weighted samples (particles) that are kept up-to-date as new readings arrive. We develop novel techniques to convert the queries on the model-based view directly into queries over particle tables, enabling highly efficient query processing. Finally, we present experimental evaluation of our prototype implementation over sensor data from the Intel Lab dataset that demonstrates the feasibility of online modeling of streaming data using our system and establishes the advantages of such tight integration between dynamic probabilistic models and database systems. 1
AVATAR information extraction system
- IEEE Data Engineering Bulletin
, 2006
"... Abstract The AVATAR Information Extraction System (IES) at the IBM Almaden Research Center enables high-precision, rule-based, information extraction from text-documents. Drawing from our experience we propose the use of probabilistic database techniques as the formal underpinnings of information ex ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
Abstract The AVATAR Information Extraction System (IES) at the IBM Almaden Research Center enables high-precision, rule-based, information extraction from text-documents. Drawing from our experience we propose the use of probabilistic database techniques as the formal underpinnings of information extrac-tion systems so as to maintain high precision while increasing recall. This involves building a framework where rule-based annotators can be mapped to queries in a database system. We use examplesfrom AVATAR IES to describe the challenges in achieving this goal. Finally, we show that deriving precision estimates in such a database system presents a significant challenge for probabilistic databasesystems.
World-set decompositions: Expressiveness and efficient algorithms
- In Proc. ICDT
, 2007
"... Abstract. Uncertain information is commonplace in real-world data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of worldset decompositions (WSD ..."
Abstract
-
Cited by 32 (12 self)
- Add to MetaCart
Abstract. Uncertain information is commonplace in real-world data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of worldset decompositions (WSDs) provides a space-efficient representation for uncertain data that also supports scalable processing. WSDs are complete for finite world-sets in that they can represent any finite set of possible worlds. For possibly infinite world-sets, we show that a natural generalization of WSDs precisely captures the expressive power of c-tables. We then show that several important problems are efficiently solvable on WSDs while they are NP-hard on c-tables. Finally, we give a polynomial-time algorithm for factorizing WSDs, i.e. an efficient algorithm for minimizing such representations. 1

