## Representing and querying correlated tuples in probabilistic databases (2007)

### Cached

### Download Links

- [drum.lib.umd.edu]
- [www.cs.umd.edu]
- [www.cs.duke.edu]
- [www.cs.duke.edu]
- [linqs.cs.umd.edu]
- [www.cs.duke.edu]
- [www.cs.umd.edu]
- [www.cs.duke.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In ICDE |

Citations: | 123 - 11 self |

### BibTeX

@INPROCEEDINGS{Sen07representingand,

author = {Prithviraj Sen and Amol Deshpande},

title = {Representing and querying correlated tuples in probabilistic databases},

booktitle = {In ICDE},

year = {2007}

}

### Years of Citing Articles

### OpenURL

### Abstract

Probabilistic databases have received considerable attention recently due to the need for storing uncertain data produced by many real world applications. The widespread use of probabilistic databases is hampered by two limitations: (1) current probabilistic databases make simplistic assumptions about the data (e.g., complete independence among tuples) that make it difficult to use them in applications that naturally produce correlated data, and (2) most probabilistic databases can only answer a restricted subset of the queries that can be expressed using traditional query languages. We address both these limitations by proposing a framework that can represent not only probabilistic tuples, but also correlations that may be present among them. Our proposed framework naturally lends itself to the possible world semantics thus preserving the precise query semantics extant in current probabilistic databases. We develop an efficient strategy for query evaluation over such probabilistic databases by casting the query processing problem as an inference problem in an appropriately constructed probabilistic graphical model. We present several optimizations specific to probabilistic databases that enable efficient query evaluation. We validate our approach by presenting an experimental evaluation that illustrates the effectiveness of our techniques at answering various queries using real and synthetic datasets. 1

### Citations

7556 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...three dependent random variables each with a binary domain: (i) factored representation (ii) resulting joint probability distribution (iii) graphical model representation. correlated random variables =-=[31, 10]-=-. The key idea underlying these approaches is the use of factored representations for modeling the correlations. Let X denote a random variable with a domain dom(X) and let P r(X) denote a probability... |

872 | An introduction to variational methods for graphical models
- Jordan, Ghahramani, et al.
- 1998
(Show Context)
Citation Context ... variety of approximate inference algorithms that perform well in practice in a variety of cases each varying in speed and accuracy e.g., Markov Chain Monte Carlo techniques [20], Variational Methods =-=[26]-=- etc. Depending on the user’s requirements we can easily switch between inference algorithms in our approach. 5 Experimental Study In this section, we present an experimental evaluation demonstrating ... |

673 |
Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks
- Cowell, Dawid, et al.
- 2007
(Show Context)
Citation Context ...three dependent random variables each with a binary domain: (i) factored representation (ii) resulting joint probability distribution (iii) graphical model representation. correlated random variables =-=[31, 10]-=-. The key idea underlying these approaches is the use of factored representations for modeling the correlations. Let X denote a random variable with a domain dom(X) and let P r(X) denote a probability... |

610 |
The computational complexity of probabilistic inference using Bayesian belief networks
- Cooper
- 1990
(Show Context)
Citation Context ...tion efficiently. After that we discuss ways to store probabilistic databases with correlated tuples. 4.1 Inference in Graphical Models Exact probabilistic inference is known to be NP-hard in general =-=[9]-=-. However, many applications provide graphical models with a graph structure that allow efficient probabilistic computation [37]. Variable elimination (VE), also known as bucket elimination, [37, 15] ... |

539 | Learning probabilistic relational models
- Getoor, Friedman, et al.
- 2001
(Show Context)
Citation Context ... data arises naturally, this area has seen renewed interest in recent years (see [20] for a survey of the ongoing research). From the uncertainty in artificial intelligence community, Friedman et. al =-=[18]-=- (PRM) address the problem of learning a probabilistic model from a given database. While PRMs can represent uncertainty in databases, Getoor et. al. [21] explore the application of PRMs to answering ... |

376 | Efficient query evaluation on probabilistic databases
- Dalvi, Suciu
- 2006
(Show Context)
Citation Context ...tore and retrieve; they have to help the user sift through the uncertainty and find the results most likely to be the answer. Numerous approaches have been proposed to handle uncertainty in databases =-=[2, 7, 16, 6, 17, 19, 12, 29]-=-. Among these, tuple-level uncertainty models [6, 17, 19, 12, 29], that associate existence probabilities with tuples, are considered more attractive for various reasons: (a) they typically result in ... |

370 | Model-Driven Data Acquisition in Sensor Networks
- Deshpande, Guestrin, et al.
- 2004
(Show Context)
Citation Context ...tore and retrieve; they have to help the user sift through the uncertainty and find the results most likely to be the answer. Numerous approaches have been proposed to handle uncertainty in databases =-=[2, 7, 16, 6, 17, 19, 12, 29]-=-. Among these, tuple-level uncertainty models [6, 17, 19, 12, 29], that associate existence probabilities with tuples, are considered more attractive for various reasons: (a) they typically result in ... |

364 |
Incomplete Information in Relational Databases
- Imieliński, Lipski
- 1984
(Show Context)
Citation Context ...problem that we plan to study in future. 6 Related Work The database community has seen a lot of work on managing probabilistic, uncertain, incomplete, and/or fuzzy data in database systems (see e.g. =-=[32, 27, 2, 29, 30, 24, 19, 5, 7, 12, 4]-=-). With a rapid increase in the number of application domains such as data integration, pervasive computing etc., where uncertain data arises naturally, this area has seen renewed interest in recent y... |

335 | External memory algorithms and data structures: Dealing with MASSIVE data
- Vitter
(Show Context)
Citation Context ...cted from the various partitions and each reference is connected to all other references within the same partition via edges. As part of our future work we aim to use external memory graph algorithms =-=[36]-=- for these tasks. When exact probabilistic inference turns out to be too expensive we have the flexibility of switching to approximate inference techniques depending on the user’s requirements. Just l... |

293 | Bucket elimination: A unifying framework for probabilistic inference algorithms
- Dechter
- 1996
(Show Context)
Citation Context ...neral [9]. However, many applications provide graphical models with a graph structure that allow efficient probabilistic computation [37]. Variable elimination (VE), also known as bucket elimination, =-=[37, 15]-=- is an exact inference algorithm that has the ability to exploit this structure. VE can be used to compute the marginal probabilities of a single random variable from a joint distribution. The main ad... |

262 | Uldbs: Databases with uncertainty and lineage
- Benjelloun, Sarma, et al.
- 2006
(Show Context)
Citation Context ...encoding of tuple interdependences. Cheng et al [7] associate (continuous) probability distributions with attributes, and propose several query evaluation and indexing techniques over such data. Trio =-=[14, 3]-=- aims to provide a unified treatment of data uncertainty by studying the issues of completeness and closure under various alternative models for representing uncertainty. Their recent work [3] combine... |

250 | CiteSeer: An automatic citation indexing system
- Giles, Bollacker, et al.
- 1998
(Show Context)
Citation Context ...ne assuming complete independence among tuples (IND DB) and another that models the dependencies (MUTEX DB). We ran the query on an extraction of 860 publications from the real-world CiteSeer dataset =-=[19]-=-. We report results across various settings of σ. Figure 10 shows the top three results obtained from the two databases at three different settings of σ (we also list the author names to aid the reade... |

235 | Trio: A System for Integrated Management of Data, Accuracy, and Lineage
- Widom
- 2005
(Show Context)
Citation Context ...ation may result in relations containing duplicate tuples that refer to the same entity; such tuples must be modeled as mutually exclusive [4, 1]. Real-world datasets such as the Christmas Bird Count =-=[34, 12]-=- naturally contain complex correlations among tuples. Data generated by sensor networks is typically highly correlated, both in time and space [14]. Data produced through use of machine learning techn... |

233 | Evaluation of probabilistic queries over imprecise data in constantly-evolving environments
- Cheng, Kalashnikov, et al.
(Show Context)
Citation Context ...tore and retrieve; they have to help the user sift through the uncertainty and find the results most likely to be the answer. Numerous approaches have been proposed to handle uncertainty in databases =-=[2, 7, 16, 6, 17, 19, 12, 29]-=-. Among these, tuple-level uncertainty models [6, 17, 19, 12, 29], that associate existence probabilities with tuples, are considered more attractive for various reasons: (a) they typically result in ... |

229 |
The management of probabilistic data
- BARBARA, GARCIA-MOLINA, et al.
- 1992
(Show Context)
Citation Context |

190 | A probabilistic relational algebra for the integration of information retrieval and database systems
- Fuhr, Rölleke
- 1997
(Show Context)
Citation Context |

182 | ProbView: A flexible probabilistic database system
- Lakshmanan, Leone, et al.
- 1997
(Show Context)
Citation Context |

168 | Approximate string matching with q-grams and maximal matches - Ukkonen - 1992 |

160 | Exploiting causal independence in bayesian network inference
- Zhang, Poole
- 1996
(Show Context)
Citation Context ...of factors involving a large number of random variables. Projection and aggregate operations can produce large factors but we can easily reduce the size of these factors by exploiting decomposability =-=[38, 33]-=-. This allows us to break any large projection factor into numerous (linear in the number of tuples involved in the projection) constant-sized 3-argument factors. Figure 8 (ii) shows the pictorial rep... |

152 |
Working models for uncertain data
- Sarma, Benjelloun, et al.
- 2006
(Show Context)
Citation Context ...ation may result in relations containing duplicate tuples that refer to the same entity; such tuples must be modeled as mutually exclusive [6, 1]. Real-world datasets such as the Christmas Bird Count =-=[14]-=- naturally contain complex correlations among tuples. Data generated by sensor networks is typically highly correlated, both in time and space [16]. Data produced through use of machine learning techn... |

136 | An algebra for probabilistic databases
- Pittarelli
- 1994
(Show Context)
Citation Context |

96 |
A probabilistic relational model and algebra
- DEY, SARKAR
- 1996
(Show Context)
Citation Context |

87 |
Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries
- Prade, Testemal
- 1984
(Show Context)
Citation Context ...problem that we plan to study in future. 6 Related Work The database community has seen a lot of work on managing probabilistic, uncertain, incomplete, and/or fuzzy data in database systems (see e.g. =-=[32, 27, 2, 29, 30, 24, 19, 5, 7, 12, 4]-=-). With a rapid increase in the number of application domains such as data integration, pervasive computing etc., where uncertain data arises naturally, this area has seen renewed interest in recent y... |

83 | D.: Selectivity estimation using probabilistic models
- Getoor, Taskar, et al.
- 2001
(Show Context)
Citation Context ...icial intelligence community, Friedman et. al [18] (PRM) address the problem of learning a probabilistic model from a given database. While PRMs can represent uncertainty in databases, Getoor et. al. =-=[21]-=- explore the application of PRMs to answering selectivity estimation queries but not queries expressed in standard database languages. We will briefly discuss some of the closely related work in the a... |

82 | A simple approach to Bayesian network computations
- Zhang, Poole
- 1994
(Show Context)
Citation Context ...l Models Exact probabilistic inference is known to be NP-hard in general [9]. However, many applications provide graphical models with a graph structure that allow efficient probabilistic computation =-=[37]-=-. Variable elimination (VE), also known as bucket elimination, [37, 15] is an exact inference algorithm that has the ability to exploit this structure. VE can be used to compute the marginal probabili... |

54 | PXML: A probabilistic semistructured data model and algebra
- Hung, Getoor, et al.
- 2003
(Show Context)
Citation Context ...h their notion of uncertainty does not currently include probabilities. In recent years, there has also been renewed interest in probabilistic extensions of object-relational [16] and XML data models =-=[29, 24]-=-. 7 Conclusions There is an increasing need for database solutions for efficiently managing and querying uncertain data exhibiting complex correlation patterns. In this paper, we presented a simple an... |

50 | ProTDB: Probabilistic data in XML
- Nierman, Jagadish
- 2002
(Show Context)
Citation Context ...h their notion of uncertainty does not currently include probabilities. In recent years, there has also been renewed interest in probabilistic extensions of object-relational [16] and XML data models =-=[29, 24]-=-. 7 Conclusions There is an increasing need for database solutions for efficiently managing and querying uncertain data exhibiting complex correlation patterns. In this paper, we presented a simple an... |

32 |
Clean Answers over Dirty Databases
- Andritsos, Fuxman, et al.
- 2006
(Show Context)
Citation Context ... naturally produce correlated data. For instance, data integration may result in relations containing duplicate tuples that refer to the same entity; such tuples must be modeled as mutually exclusive =-=[6, 1]-=-. Real-world datasets such as the Christmas Bird Count [14] naturally contain complex correlations among tuples. Data generated by sensor networks is typically highly correlated, both in time and spac... |

31 |
An extended Relational Database Model for Uncertain and Imprecise Information
- Lee
- 1992
(Show Context)
Citation Context ...problem that we plan to study in future. 6 Related Work The database community has seen a lot of work on managing probabilistic, uncertain, incomplete, and/or fuzzy data in database systems (see e.g. =-=[32, 27, 2, 29, 30, 24, 19, 5, 7, 12, 4]-=-). With a rapid increase in the number of application domains such as data integration, pervasive computing etc., where uncertain data arises naturally, this area has seen renewed interest in recent y... |

25 |
A Fuzzy Model for Relational Databases
- Buckles, Petry
- 1982
(Show Context)
Citation Context |

20 |
Approximate probabilistic reasoning in bayesian belief networks is np-hard
- Dagum, Luby
- 1993
(Show Context)
Citation Context ...o expensive we have the flexibility of switching to approximate inference techniques depending on the user’s requirements. Just like exact inference, approximate inference is also known to be NP-hard =-=[11]-=-. However there exist a fairly large variety of approximate inference algorithms that perform well in practice in a variety of cases each varying in speed and accuracy e.g., Markov Chain Monte Carlo t... |

20 | A data model and algebra for probabilistic complex values
- Eiter, Lukasiewicz, et al.
- 2001
(Show Context)
Citation Context ...enting uncertainty, though their notion of uncertainty does not currently include probabilities. In recent years, there has also been renewed interest in probabilistic extensions of object-relational =-=[16]-=- and XML data models [29, 24]. 7 Conclusions There is an increasing need for database solutions for efficiently managing and querying uncertain data exhibiting complex correlation patterns. In this pa... |

15 |
Tables - An Efficient Tool for Handling Incomplete Information in Databases
- Horn
- 1989
(Show Context)
Citation Context |

11 |
Efficient Reasoning in Graphical Models
- Rish
- 1999
(Show Context)
Citation Context ...e required result. The complexity of VE depends on some natural parameters relating to the connectivity of the graph underlying the graphical model corresponding to the joint probability distribution =-=[33]-=-. The inference problem is easy if the graphical model is or closely resembles a tree and the problem becomes progressively harder as the graphical model deviates more from being a tree. Interestingly... |

10 |
About projection-selection-join queries addressed to possibilistic relational databases
- Bosc, Pivert
- 2005
(Show Context)
Citation Context |

7 |
Indexing continuously changing data with mean-variance tree
- Xia, Prabhakar, et al.
(Show Context)
Citation Context ...ls with tuples in their ProbView system. Their model also supports various conjunction and disjunction strategies that allow a limited encoding of tuple interdependences. Cheng et al. [5], Xia et al. =-=[35]-=- associate (continuous) probability distributions with attributes, and propose several query evaluation and indexing techniques over such data. Trio [34, 12] aims to provide a unified treatment of dat... |

5 |
Handling uncertainty and ignorance in databases: A rule to combine dependent data
- Choenni, Blok, et al.
- 2006
(Show Context)
Citation Context ... resulting relations are not in 1NF and further the semantics of some of the query operators are messy, both of which seriously limit the applicability of their approach. More recently, Choenni et al =-=[8]-=- discuss conceptually how to make the query semantics more consistent through use of Dempster-Schafer theory. Cavallo et al [6] and Dey et al [17] propose and study tuple-level uncertainty models that... |

4 |
An analysis of first-order logics for reasoning about probability
- Halpern
- 1990
(Show Context)
Citation Context ...bability 0.4). Such a probabilistic database can be interpreted as a probability distribution over the set of all possible deterministic database instances, called possible worlds (denoted by pwd(D)) =-=[25, 19, 12]-=-. Each deterministic instance (world) contains a subset of the tuples present in the probabilistic database, and the probability associated with it can be calculated directly using the independence as... |

3 |
Query answering using statistics and probabilistic views
- Dalvi, Suciu
- 2005
(Show Context)
Citation Context ...bilities to model uncertain data in information retrieval context, and present the intensional and extensional query evaluation techniques discussed in Section 2. Extending this work, Dalvi and Suciu =-=[12, 13]-=- define safe query plans to be those for which extensional and intensional query evaluation produces identical results, and show how to generate a safe query plan for a query if one exists. Tuple inde... |

2 | Driven Data Acquisition in Sensor Networks - Hong - 2004 |

1 | Nevin Lianwen Zhang and David Poole. A simple approach to bayesian network computations - Kaufmann - 1988 |

1 |
The Apache Derby Project
- orgderby
(Show Context)
Citation Context ...of our techniques at modeling such correlations and evaluating queries over them. This evaluation was done using a prototype system that we are currently building on top of the Apache Derby Java DBMS =-=[26]-=-. Our system supports the query execution strategies discussed in the previous section. Following Fuhr et al [19] and Dalvi et al [12], we generate probabilistic relations by issuing similarity predic... |