## Efficient top-k query evaluation on probabilistic data (2007)

### Cached

### Download Links

Venue: | in ICDE |

Citations: | 138 - 26 self |

### BibTeX

@INPROCEEDINGS{Ré07efficienttop-k,

author = {Christopher Ré and Nilesh Dalvi and Dan Suciu},

title = {Efficient top-k query evaluation on probabilistic data},

booktitle = {in ICDE},

year = {2007},

pages = {886--895}

}

### Years of Citing Articles

### OpenURL

### Abstract

Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed approximate probabilities, or did not scale, and it was shown recently that precise query evaluation is theoretically hard. In this paper we describe a novel approach, which computes and ranks efficiently the top-k answers to a SQL query on a probabilistic database. The restriction to top-k answers is natural, since imprecisions in the data often lead to a large number of answers of low quality, and users are interested only in the answers with the highest probabilities. The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers. The algorithms is in a certain sense provably optimal and scales to large databases: we have measured running times of 5 to 50 seconds for complex SQL queries over a large database (10M tuples of which 6M probabilistic). Additional contributions of the paper include several optimization techniques, and a simple data model for probabilistic data that achieves completeness by using SQL views. 1

### Citations

421 |
The Complexity of Enumeration and Reliability Problems
- Valiant
- 1979
(Show Context)
Citation Context ... case P(t.E) = 0). Next, partition ET by the GROUP-BY attributes B: ET = G1 ∪ G2 ∪ . . . ∪ Gn. For each group G ∈ {G1, . . . , Gn} define following DNF boolean expression: G.E = � t.E (3) t∈G Valiant =-=[17]-=- has shown that computing the probability P(G.E) of a DNF formula like (3) is #P-complete in general. For a group G, denote G.B the tuple ¯ b = t.B for some t ∈ G (it is independent on the choice of t... |

397 |
A Theory for Record Linkage
- Fellegi, Sunter
- 1969
(Show Context)
Citation Context ...The Three Stooges: Who Done it. The problem of detecting when two representations denote the same object has been intensively studied, and referred to as deduplication, record linkage, or merge-purge =-=[18, 19, 10, 7, 11, 9, 1, 3]-=-. Perfect object matching is sometimes impossible, and when it is possible it is often very costly, since it requires specialized, domain specific algorithms. Our approach is to rely on existing domai... |

347 | Efficient query evaluation on probabilistic databases
- Dalvi, Suciu
- 2004
(Show Context)
Citation Context ...y tuple is thus a probabilistic event and tuples may be correlated events. The major difficulty in probabilistic databases consist of evaluating the queries correctly and efficiently. Dalvi and Suciu =-=[4]-=- have shown recently that most SQL queries have #P-complete data complexity, which rules out efficient algorithms for exact probabilities. Our approach in this paper is to combine top-k style queries ... |

335 | Model-driven data acquisition in sensor networks
- Deshpande, Guestrin, et al.
- 2004
(Show Context)
Citation Context ... be expressed much more concisely. We also do not handle continuous attribute values, e.g. we cannot handle the case where the attribute temperature has a normal distribution with mean 40, as done in =-=[6]-=-. Organization Sec. 2 describes the basic probabilistic data model; the MS algorithm is described in Sections 3 and the two optimizations are in Sec. 4. We extend the data model in Sec. 5, report expe... |

332 |
Incomplete information in relational databases
- Imielinski, Jr
- 1984
(Show Context)
Citation Context ...e model used by Dalvi and Suciu [4] is a strict subset of Barbara’s (allows only for independent tuples) and thus is also incomplete. Fuhr and Roellke [8] describe a complete model, based on c-tables =-=[12]-=-, but it is complex and impractical. Widom at al. [5] advance the hypothesis that every representation formalism is either incomplete or unintuitive (hence impractical), and proposes a two layered app... |

300 | The Merge/Purge Problem for Large Databases
- Hernandez, Stolfo
- 1995
(Show Context)
Citation Context ...The Three Stooges: Who Done it. The problem of detecting when two representations denote the same object has been intensively studied, and referred to as deduplication, record linkage, or merge-purge =-=[18, 19, 10, 7, 11, 9, 1, 3]-=-. Perfect object matching is sometimes impossible, and when it is possible it is often very costly, since it requires specialized, domain specific algorithms. Our approach is to rely on existing domai... |

221 |
The management of probabilistic data
- Barbara, Garcia-Molina, et al.
- 1992
(Show Context)
Citation Context ...A complete representation formalism for probabilistic database is one that can represent any probabilistic data, according to Def. 2.1. The formalism we described in Sec. 2 is based on Barbara et al. =-=[2]-=-, and is incomplete, because it represents only independent or exclusive tuples. Other formalisms described in the literature are either incomplete, or too complex to be practical. For example the mod... |

220 | Evaluation of probabilistic queries over imprecise data in constantly-evolving environments
- Cheng, Kalashnikov, et al.
(Show Context)
Citation Context ...; in other cases complete removal is not even possible. A recent approach to manage imprecisions is with a probabilistic database, which uses probabilities to represent the uncertainty about the data =-=[5, 6, 7, 8, 21]-=-. A simplistic definition is that every tuple belongs to the database with some probability, whose value is between 0 and 1, and, as a consequence, every tuple returned by a SQL query will have some p... |

218 | The state of record linkage and current research problems
- Winkler
- 1999
(Show Context)
Citation Context ...The Three Stooges: Who Done it. The problem of detecting when two representations denote the same object has been intensively studied, and referred to as deduplication, record linkage, or merge-purge =-=[18, 19, 10, 7, 11, 9, 1, 3]-=-. Perfect object matching is sometimes impossible, and when it is possible it is often very costly, since it requires specialized, domain specific algorithms. Our approach is to rely on existing domai... |

212 | Trio: A System for Integrated Management of Data Accuracy
- Widom
- 2005
(Show Context)
Citation Context ...; in other cases complete removal is not even possible. A recent approach to manage imprecisions is with a probabilistic database, which uses probabilities to represent the uncertainty about the data =-=[5, 6, 7, 8, 21]-=-. A simplistic definition is that every tuple belongs to the database with some probability, whose value is between 0 and 1, and, as a consequence, every tuple returned by a SQL query will have some p... |

174 | A probabilistic relational algebra for the integration of information retrieval and database systems
- Fuhr, Rölleke
- 1997
(Show Context)
Citation Context ...al degenerates to [0, 1] for most directors. One can still use this method in order to rank the outputs, (by ordering them based on their intervals’ midpoints) but this results in low precision. Fuhr =-=[8]-=- uses an exponential time algorithm that essentially iterates over all possible worlds that support a given answer. This is again impractical in our setting. Finally Dalvi [4] only considers “safe” qu... |

170 | Probview: A flexible probabilistic database system
- Lakshmanan, Leone, et al.
- 1997
(Show Context)
Citation Context ... for each pair of movies that satisfies the criterion: in our example each of the 1415 directors would have occurred on average 234.8 times. This makes it impossible to rank the directors. Lakshmanan =-=[14]-=- computes probability intervals instead of exact probabilities. However, unlike Luby and Karp’s algorithm which can approximate the probabilities to an arbitrary precision, the precision in [14] canno... |

155 | Robust and efficient fuzzy match for online data cleaning
- Chaudhuri, Ganjam, et al.
- 2006
(Show Context)
Citation Context ...probability that the two objects match. The similarity scores are stored in the table TitleMatchp , Fig. 2. There is a rich literature on record linkage (also known as de-duplication, or merge-purge) =-=[1, 4, 9, 11, 14, 15, 22, 23]-=-, that offers an excellent collection of techniques for computing these similarity scores. However, the traditional way of using these scores is to compare them to a threshold and classify objects int... |

139 | Working Models for Uncertain Data - Sarma, Benjelloun, et al. - 2006 |

122 |
Monte-carlo algorithms for enumeration and reliability problems
- Karp, Luby
- 1983
(Show Context)
Citation Context ...a number k, and we have to return to the user the k highest ranked answers sorted by their output probabilities. To compute the probabilities we use Luby and Karp’s Monte Carlo simulation algorithm 1 =-=[13]-=- (MC), which can compute an approximation to any desired precision. A naive application of MC would be to run it a sufficiently large number of steps on each query answer and compute its probability w... |

111 | Eliminating fuzzy duplicates in data warehouses
- Ananthakrishna, Chaudhuri, et al.
- 2002
(Show Context)
Citation Context |

100 | Declarative Data Cleaning: Language, Model and Algorithms
- Galharda, Florescu, et al.
- 2001
(Show Context)
Citation Context ...probability that the two objects match. The similarity scores are stored in the table TitleMatchp , Fig. 2. There is a rich literature on record linkage (also known as de-duplication, or merge-purge) =-=[1, 4, 9, 11, 14, 15, 22, 23]-=-, that offers an excellent collection of techniques for computing these similarity scores. However, the traditional way of using these scores is to compare them to a threshold and classify objects int... |

95 | Top-k query processing in uncertain databases
- Soliman, Ilyas, et al.
- 2009
(Show Context)
Citation Context ...or ranking purposes. Fuhr [10] uses an exponential time algorithm that essentially iterates over all possible worlds that support a given answer. This is again impractical in our setting. Recent work =-=[19]-=-, handles the problem of computing Top-K queries when the uncertain data is specified by “generation rules”, in our case the complexity from the complexity of queries over the uncertain data. Finally,... |

93 | A Hybrid Discriminative/Generative Approach for Modelling Human Activities
- Lester, Choudhury, et al.
(Show Context)
Citation Context ...a) the movie and (b) the rating. The imprecisions here were generated by information extraction tools. In the third application we used human activity recognition data obtained from body-worn sensors =-=[15]-=-. The data was first collected from eight different sensors (accelerometer, audio, IR/visible light, high-frequency light, barometric pressure, humidity, temperature, and compass) in a shoulder mounte... |

62 | Models for incomplete and probabilistic information
- Green, Tannen
- 2006
(Show Context)
Citation Context ...(since our n is in the range of thousands and k = 10 . . . 50). 2 Preliminaries 2.1 Probabilistic Databases We introduce here a basic probabilistic data model. It corresponds to ?-sets and or-sets in =-=[7, 13]-=-. Possible Worlds Fix a relational schema S, consisting of relation names R1, R2, . . . , Rm, a set of attributes Attr(Ri) and a key Key(Ri) ⊆ Attr(Ri) for each i = TitleMatch p asin mid p t1 a282 m89... |

57 |
Selecting the best system
- KIM, NELSON
(Show Context)
Citation Context ...[8, 5]. Other Related Work The statistics literature has considered the statistical selection problem, where the problem is to find the “best” (i.e. highest mean) of a finite set of alternatives: see =-=[3, 12]-=- for recent surveys: this corresponds to our setting with k = 1 and small n (say 2 . . . 5). The focus of that work is on tight probabilistic guarantees for a variety of concrete probabilistic distrib... |

36 | Record linkage: Current practice and future directions
- Gu, Baxter, et al.
- 2003
(Show Context)
Citation Context |

31 |
A theory for record linkage
- Felligi, Sunter
- 1969
(Show Context)
Citation Context ...probability that the two objects match. The similarity scores are stored in the table TitleMatchp , Fig. 2. There is a rich literature on record linkage (also known as de-duplication, or merge-purge) =-=[1, 4, 9, 11, 14, 15, 22, 23]-=-, that offers an excellent collection of techniques for computing these similarity scores. However, the traditional way of using these scores is to compare them to a threshold and classify objects int... |

17 |
Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning
- Chaudhuri, Ganjam
- 2003
(Show Context)
Citation Context |

13 |
New developments in ranking and selection: an empirical comparison of the three main approaches
- Branke, Chick, et al.
(Show Context)
Citation Context ...[8, 5]. Other Related Work The statistics literature has considered the statistical selection problem, where the problem is to find the “best” (i.e. highest mean) of a finite set of alternatives: see =-=[3, 12]-=- for recent surveys: this corresponds to our setting with k = 1 and small n (say 2 . . . 5). The focus of that work is on tight probabilistic guarantees for a variety of concrete probabilistic distrib... |

11 |
Eric Simon, and CristianAugustin Saita. Declarative data cleaning: Language, model, and algorithms
- Galhardas, Florescu, et al.
- 2001
(Show Context)
Citation Context |

5 |
Pradeep Ravikumar, “A comparison of string distance metrics for name-matching tasks
- Cohen
- 2003
(Show Context)
Citation Context |

1 |
N.Dalvi, and D.Suciu. Evaluation of having queries on probabilistic databases
- Re
- 2006
(Show Context)
Citation Context ...), . . . (1) FROM R WHERE C GROUP-BY B The aggregate operators can be sum, count (which is sum(1)), min and max; we do not support avg. We do not discuss here a HAVING clause: we study that elsewhere =-=[16]-=-. Semantics We define now the meaning of the query q on a probabilistic database ({W1, . . . , Wn} , P). Intuitively, the answer to query is a table like this: B1 B2 . . . agg1(A1) agg2(A2) . . . p b1... |