Results 1  10
of
122
Efficient topk query evaluation on probabilistic data
 in ICDE
, 2007
"... Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed ..."
Abstract

Cited by 172 (31 self)
 Add to MetaCart
(Show Context)
Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed approximate probabilities, or did not scale, and it was shown recently that precise query evaluation is theoretically hard. In this paper we describe a novel approach, which computes and ranks efficiently the topk answers to a SQL query on a probabilistic database. The restriction to topk answers is natural, since imprecisions in the data often lead to a large number of answers of low quality, and users are interested only in the answers with the highest probabilities. The idea in our algorithm is to run in parallel several MonteCarlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the topk answers. The algorithms is in a certain sense provably optimal and scales to large databases: we have measured running times of 5 to 50 seconds for complex SQL queries over a large database (10M tuples of which 6M probabilistic). Additional contributions of the paper include several optimization techniques, and a simple data model for probabilistic data that achieves completeness by using SQL views. 1
A Survey of Topk Query Processing Techniques in Relational Database Systems
"... Efficient processing of topk queries is a crucial requirement in many interactive environments that involve massive amounts of data. In particular, efficient topk processing in domains such as the Web, multimedia search and distributed systems has shown a great impact on performance. In this surve ..."
Abstract

Cited by 153 (6 self)
 Add to MetaCart
Efficient processing of topk queries is a crucial requirement in many interactive environments that involve massive amounts of data. In particular, efficient topk processing in domains such as the Web, multimedia search and distributed systems has shown a great impact on performance. In this survey, we describe and classify topk processing techniques in relational databases. We discuss different design dimensions in the current techniques including query models, data access methods, implementation levels, data and query certainty, and supported scoring functions. We show the implications of each dimension on the design of the underlying techniques. We also discuss topk queries in XML domain, and show their connections to relational approaches.
Probabilistic skylines on uncertain data
 In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), Viena
, 2007
"... Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this pap ..."
Abstract

Cited by 94 (20 self)
 Add to MetaCart
Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this paper, we tackle the problem of skyline analysis on uncertain data. We propose a novel probabilistic skyline model where an uncertain object may take a probability to be in the skyline, and a pskyline contains all the objects whose skyline probabilities are at least p. Computing probabilistic skylines on large uncertain data sets is challenging. We develop two efficient algorithms. The bottomup algorithm computes the skyline probabilities of some selected instances of uncertain objects, and uses those instances to prune other instances and uncertain objects effectively. The topdown algorithm recursively partitions the instances of uncertain objects into subsets, and prunes subsets and objects aggressively. Our experimental results on both the real NBA player data set and the benchmark synthetic data sets show that probabilistic skylines are interesting and useful, and our two algorithms are efficient on large data sets, and complementary to each other in performance. 1.
A Unified Approach to Ranking in Probabilistic Databases
"... The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decisionmaking over such data. In this paper, we present a unified approach to ranking and topk query processing in ..."
Abstract

Cited by 63 (3 self)
 Add to MetaCart
(Show Context)
The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decisionmaking over such data. In this paper, we present a unified approach to ranking and topk query processing in probabilistic databases by viewing it as a multicriteria optimization problem, and by deriving a set of features that capture the key properties of a probabilistic dataset that dictate the ranked result. We contend that a single, specific ranking function may not suffice for probabilistic databases, and we instead propose two parameterized ranking functions, called P RF ω and P RF e, that generalize or can approximate many of the previously proposed ranking functions. We present novel generating functionsbased algorithms for efficiently ranking large datasets according to these ranking functions, even if the datasets exhibit complex correlations modeled using probabilistic and/xor trees or Markov networks. We further propose that the parameters of the ranking function be learned from user preferences, and we develop an approach to learn those parameters. Finally, we present a comprehensive experimental study that illustrates the effectiveness of our parameterized ranking functions, especially P RF e, at approximating other ranking functions and the scalability of our proposed algorithms for exact or approximate ranking. 1.
Semantics of ranking queries for probabilistic data and expected ranks
 In Proc. of ICDE’09
, 2009
"... Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditi ..."
Abstract

Cited by 61 (1 self)
 Add to MetaCart
(Show Context)
Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditional database settings. The importance of the topk is perhaps even greater in probabilistic databases, where a relation can encode exponentially many possible worlds. There have been several recent attempts to propose definitions and algorithms for ranking queries over probabilistic data. However, these all lack many of the intuitive properties of a topk over deterministic data. Specifically, we define a number of fundamental properties, including exactk, containment, uniquerank, valueinvariance, and stability, which are all satisfied by ranking queries on certain data. We argue that all these conditions should also be fulfilled by any reasonable definition for ranking uncertain data. Unfortunately, none of the existing definitions is able to achieve this. To remedy this shortcoming, this work proposes an intuitive new approach of expected rank. This uses the wellfounded notion of the expected rank of each tuple across all possible worlds as the basis of the ranking. We are able to prove that, in contrast to all existing approaches, the expected rank satisfies all the required properties for a ranking query. We provide efficient solutions to compute this ranking across the major models of uncertain data, such as attributelevel and tuplelevel uncertainty. For an uncertain relation of N tuples, the processing cost is O(N log N)—no worse than simply sorting the relation. In settings where there is a high cost for generating each tuple in turn, we provide pruning techniques based on probabilistic tail bounds that can terminate the search early and guarantee that the topk has been found. Finally, a comprehensive experimental study confirms the effectiveness of our approach. I.
Efficient processing of topk queries on uncertain databases
, 2007
"... Abstract — This work introduces novel polynomialtime algorithms for processing topk queries in uncertain databases, under the generally adopted model of xrelations. An xrelation consists of a number of xtuples, and each xtuple randomly instantiates into one tuple from one or more alternatives. ..."
Abstract

Cited by 60 (7 self)
 Add to MetaCart
(Show Context)
Abstract — This work introduces novel polynomialtime algorithms for processing topk queries in uncertain databases, under the generally adopted model of xrelations. An xrelation consists of a number of xtuples, and each xtuple randomly instantiates into one tuple from one or more alternatives. Our results significantly improve the best known algorithms for topk query processing in uncertain databases, in terms of both running time and memory usage. Focusing on the singlealternative case, the new algorithms are orders of magnitude faster. I.
Monochromatic and Bichromatic Reverse Skyline Search over Uncertain Databases
, 2008
"... Reverse skyline queries over uncertain databases have many important applications such as sensor data monitoring and business planning. Due to the existence of uncertainty in many realworld data, answering reverse skyline queries accurately and efficiently over uncertain data has become increasingl ..."
Abstract

Cited by 54 (2 self)
 Add to MetaCart
Reverse skyline queries over uncertain databases have many important applications such as sensor data monitoring and business planning. Due to the existence of uncertainty in many realworld data, answering reverse skyline queries accurately and efficiently over uncertain data has become increasingly important. In this paper, we model the probabilistic reverse skyline query on uncertain data, in both monochromatic and bichromatic cases, and propose effective pruning methods to reduce the search space of query processing. Moreover, efficient query procedures have been presented seamlessly integrating the proposed pruning methods. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approach with various experimental settings.
Efficient Search for the Topk Probable Nearest Neighbors in Uncertain Databases ABSTRACT
"... Uncertainty pervades many domains in our lives. Current reallife applications, e.g., location tracking using GPS devices or cell phones, multimedia feature extraction, and sensor data management, deal with different kinds of uncertainty. Finding the nearest neighbor objects to a given query point i ..."
Abstract

Cited by 54 (1 self)
 Add to MetaCart
(Show Context)
Uncertainty pervades many domains in our lives. Current reallife applications, e.g., location tracking using GPS devices or cell phones, multimedia feature extraction, and sensor data management, deal with different kinds of uncertainty. Finding the nearest neighbor objects to a given query point is an important query type in these applications. In this paper, we study the problem of finding objects with the highest marginal probability of being the nearest neighbors to a query object. We adopt a general uncertainty model allowing for data and query uncertainty. Under this model, we define new query semantics, and provide several efficient evaluation algorithms. We analyze the cost factors involved in query evaluation, and present novel techniques to address the tradeoffs among these factors. We give multiple extensions to our techniques including handling dependencies among data objects, and answering threshold queries. We conduct an extensive experimental study to evaluate our techniques on both real and synthetic data. 1.
Probabilistic frequent itemset mining in uncertain databases
 in KDD
, 2009
"... Probabilistic frequent itemset mining in uncertain transaction databases semantically and computationally differs from traditional techniques applied to standard “certain” transaction databases. The consideration of existential uncertainty of item(sets), indicating the probability that an item(set) ..."
Abstract

Cited by 45 (7 self)
 Add to MetaCart
(Show Context)
Probabilistic frequent itemset mining in uncertain transaction databases semantically and computationally differs from traditional techniques applied to standard “certain” transaction databases. The consideration of existential uncertainty of item(sets), indicating the probability that an item(set) occurs in a transaction, makes traditional techniques inapplicable. In this paper, we introduce new probabilistic formulations of frequent itemsets based on possible world semantics. In this probabilistic context, an itemset X is called frequent if the probability that X occurs in at least minSup transactions is above a given threshold τ. To the best of our knowledge, this is the first approach addressing this problem under possible worlds semantics. In consideration of the probabilistic formulations, we present a framework which is able to solve the Probabilistic Frequent Itemset Mining (PFIM) problem efficiently. An extensive experimental evaluation investigates the impact of our proposed techniques and shows that our approach is orders of magnitude faster than straightforward approaches.
On the semantics and evaluation of topk queries in probabilistic databases
 In: ICDE Workshops
"... Abstract. We study here fundamental issues involved in topk query evaluation in probabilistic databases. We consider simple probabilistic databases in which probabilities are associated with individual tuples, and general probabilistic databases in which, additionally, exclusivity relationships bet ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We study here fundamental issues involved in topk query evaluation in probabilistic databases. We consider simple probabilistic databases in which probabilities are associated with individual tuples, and general probabilistic databases in which, additionally, exclusivity relationships between tuples can be represented. In contrast to other recent research in this area, we do not limit ourselves to injective scoring functions. We formulate three intuitive postulates for the semantics of topk queries in probabilistic databases, and introduce a new semantics, GlobalTopk, that satisfies those postulates to a large degree. We also show how to evaluate queries under the GlobalTopk semantics. For simple databases we design dynamicprogramming based algorithms. For general databases we show polynomialtime reductions to the simple cases, and provide effective heuristics to speed up the computation in practice. For example, we demonstrate that for a fixed k the time complexity of topk query evaluation is as low as linear, under the assumption that probabilistic databases are simple and scoring functions are injective. 1