Results 1 - 10
of
22
Semantics of ranking queries for probabilistic data and expected ranks
- In Proc. of ICDE’09
, 2009
"... Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditi ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditional database settings. The importance of the top-k is perhaps even greater in probabilistic databases, where a relation can encode exponentially many possible worlds. There have been several recent attempts to propose definitions and algorithms for ranking queries over probabilistic data. However, these all lack many of the intuitive properties of a top-k over deterministic data. Specifically, we define a number of fundamental properties, including exact-k, containment, unique-rank, value-invariance, and stability, which are all satisfied by ranking queries on certain data. We argue that all these conditions should also be fulfilled by any reasonable definition for ranking uncertain data. Unfortunately, none of the existing definitions is able to achieve this. To remedy this shortcoming, this work proposes an intuitive new approach of expected rank. This uses the well-founded notion of the expected rank of each tuple across all possible worlds as the basis of the ranking. We are able to prove that, in contrast to all existing approaches, the expected rank satisfies all the required properties for a ranking query. We provide efficient solutions to compute this ranking across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. For an uncertain relation of N tuples, the processing cost is O(N log N)—no worse than simply sorting the relation. In settings where there is a high cost for generating each tuple in turn, we provide pruning techniques based on probabilistic tail bounds that can terminate the search early and guarantee that the top-k has been found. Finally, a comprehensive experimental study confirms the effectiveness of our approach. I.
Exploiting Shared Correlations in Probabilistic Databases
, 2008
"... There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources — from sensor data, experimental data, data from uncurated sources, and many others. There is a growing need for database management systems that can efficiently repre ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources — from sensor data, experimental data, data from uncurated sources, and many others. There is a growing need for database management systems that can efficiently represent and query such data. In this work, we show how data characteristics can be leveraged to make the query evaluation process more efficient. In particular, we exploit what we refer to as shared correlations where the same uncertainties and correlations occur repeatedly in the data. Shared correlations occur mainly due to two reasons: (1) Uncertainty and correlations usually come from general statistics and rarely vary on a tuple-to-tuple basis; (2) The query evaluation procedure itself tends to re-introduce the same correlations. Prior work has shown that the query evaluation problem on probabilistic databases is equivalent to a probabilistic inference problem on an appropriately constructed probabilistic graphical model (PGM). We leverage this by introducing a new data structure, called the random variable elimination graph (rv-elim graph) that can be built from the PGM obtained from query evaluation. We develop techniques based on bisimulation that can be used to compress the rv-elim graph exploiting the presence of shared correlations in the PGM, the compressed rv-elim graph can then be used to run inference. We validate our methods by evaluating them empirically and show that even with a few shared correlations significant speed-ups are possible.
Database Support for Probabilistic Attributes and Tuples
- In IEEE 24th Intl. Conference on Data Engineering
, 2008
"... Abstract — The inherent uncertainty of data present in numerous applications such as sensor databases, text annotations, and information retrieval motivate the need to handle imprecise data at the database level. Uncertainty can be at the attribute or tuple level and is present in both continuous an ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
Abstract — The inherent uncertainty of data present in numerous applications such as sensor databases, text annotations, and information retrieval motivate the need to handle imprecise data at the database level. Uncertainty can be at the attribute or tuple level and is present in both continuous and discrete data domains. This paper presents a model for handling arbitrary probabilistic uncertain data (both discrete and continuous) natively at the database level. Our approach leads to a natural and efficient representation for probabilistic data. We develop a model that is consistent with possible worlds semantics and closed under basic relational operators. This is the first model that accurately and efficiently handles both continuous and discrete uncertainty. The model is implemented in a real database system (PostgreSQL) and the effectiveness and efficiency of our approach is validated experimentally. I.
Finding Frequent Items in Probabilistic Data
, 2008
"... Computing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on probabilistic data: finding the frequent items. One ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
Computing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on probabilistic data: finding the frequent items. One straightforward approach to identify the frequent items in a probabilistic data set is to simply compute the expected frequency of an item and decide if it exceeds a certain fraction of the expected size of the whole data set. However, this simple definition misses important information about the internal structure of the probabilistic data and the interplay among all the uncertain entities. Thus, we propose a new definition based on the possible world semantics that has been widely adopted for many query types in uncertain data management, trying to find all the items that are likely to be frequent in a randomly generated possible world. Our approach naturally leads to the study of ranking frequent items based on confidence as well. Finding likely frequent items in probabilistic data turns out to be much more difficult. We first propose exact algorithms for offline data that run in either quadratic or cubic time. Next, we design novel sampling-based algorithms for streaming data to find all approximately likely frequent items with theoretically guaranteed high probability and accuracy. Our sampling schemes consume sublinear memory and exhibit excellent scalability. Finally, we verify the effectiveness and efficiency of the developed algorithms using both real and synthetic data sets with extensive experimental evaluations.
Orion 2.0: Native Support for Uncertain Data
, 2008
"... Orion is a state-of-the-art uncertain database management system with built-in support for probabilistic data as first class data types. In contrast to other uncertain databases, Orion supports both attribute and tuple uncertainty with arbitrary correlations. This enables the database engine to hand ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Orion is a state-of-the-art uncertain database management system with built-in support for probabilistic data as first class data types. In contrast to other uncertain databases, Orion supports both attribute and tuple uncertainty with arbitrary correlations. This enables the database engine to handle both discrete and continuous pdfs in a natural and accurate manner. The underlying model is closed under the basic relational operators and is consistent with Possible Worlds Semantics. We demonstrate how Orion simplifies the design and enhances the capabilities of two example applications: managing sensor data (continuous uncertainty) and inferring missing values (discrete uncertainty).
Efficiently Answering Probabilistic Threshold Top-k Queries on Uncertain Data
, 2007
"... Uncertain data is inherent in a few important applications such as environmental surveillance and mobile object tracking. Top-k queries (or as known as ranking queries) are often natural and useful in analyzing uncertain data in those applications. In this paper, we study the problem of answering pr ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Uncertain data is inherent in a few important applications such as environmental surveillance and mobile object tracking. Top-k queries (or as known as ranking queries) are often natural and useful in analyzing uncertain data in those applications. In this paper, we study the problem of answering probabilistic threshold top-k queries on uncertain data, which computes uncertain records taking a probability of at least p to be in the top-k list where p is a user specified probability threshold. We present an efficient exact algorithm and a fast sampling algorithm. An empirical study using real and synthetic data sets verifies the effectiveness of probabilistic threshold top-k queries and the efficiency of our methods. 1
Top-k Spatial Joins of Probabilistic Objects
- ICDE
, 2008
"... Probabilistic data have recently become popular in applications such as scientific and geospatial databases. For images and other spatial datasets, probabilistic values can capture the uncertainty in extent and class of the objects in the images. Relating one such dataset to another by spatial joins ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Probabilistic data have recently become popular in applications such as scientific and geospatial databases. For images and other spatial datasets, probabilistic values can capture the uncertainty in extent and class of the objects in the images. Relating one such dataset to another by spatial joins is an important operation for data management systems. We propose probabilistic spatial join (PSJ) queries, which rank the results according to a score that incorporates both the uncertainties associated with the objects and the distances between them. We present the first algorithms for two kinds of PSJ queries: Threshold PSJ queries, which return all pairs that score above a given threshold, and top-k PSJ queries, which return the k top-scoring pairs. For threshold PSJ queries, we propose a plane sweep algorithm that, because it exploits the special structure of the problem, runs in O(n (log n + k)) time, where n is the number of points and k is the number of results. We extend the algorithms to multidimensional data and to top-k PSJ queries. To further speed up top-k PSJ queries, we develop a scheduling technique that estimates the scores at the level of blocks, then hands the blocks to the plane sweep algorithm. By finding high-scoring pairs early, the scheduling allows a large portion of the datasets to be pruned. Experiments demonstrate speed-ups of two orders of magnitude.
Indexing Correlated Probabilistic Databases
"... With large amounts of correlated probabilistic data being generated in a wide range of application domains including sensor networks, information extraction, event detection etc., effectively managing and querying them has become an important research direction. While there is an exhaustive body of ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
With large amounts of correlated probabilistic data being generated in a wide range of application domains including sensor networks, information extraction, event detection etc., effectively managing and querying them has become an important research direction. While there is an exhaustive body of literature on querying independent probabilistic data, supporting efficient queries over large-scale, correlated databases remains a challenge. In this paper, we develop efficient data structures and indexes for supporting inference and decision support queries over such databases. Our proposed hierarchical data structure is suitable both for inmemory and disk-resident databases. We represent the correlations in the probabilistic database using a junction tree over the tuple-existence or attribute-value random variables, and use tree partitioning techniques to build an index structure over it. We show how to efficiently answer inference and aggregation queries using such an index, resulting in orders of magnitude performance benefits in most cases. In addition, we develop novel algorithms for efficiently keeping the index structure up-to-date as changes (inserts, updates) are made to the probabilistic database. We present a comprehensive experimental study illustrating the benefits of our approach to query processing in probabilistic databases.
Cleaning Uncertain Data with Quality Guarantees
, 2008
"... Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to “clean” the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries); and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are presented as well. Experiments, performed on both real and synthetic datasets, show that the PWS-quality metric can be evaluated quickly, and that our cleaning algorithm provides an optimal solution with high efficiency. To our best knowledge, this is the first work that develops a quality metric for a probabilistic database, and investigates how such a metric can be used for data cleaning purposes.
Representing Tuple and Attribute Uncertainty in Probabilistic Databases
"... There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources — sensor data, experimental data, data from uncurated sources, and many others. There is a growing need to be able to flexibly represent the uncertainties in the data, ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources — sensor data, experimental data, data from uncurated sources, and many others. There is a growing need to be able to flexibly represent the uncertainties in the data, and to efficiently query the data. Building on existing probabilistic database work, we present a unifying framework which allows a flexible representation of correlated tuple and attribute level uncertainties. An important capability of our representation is the ability to represent shared correlation structures in the data. We provide motivating examples to illustrate when such shared correlation structures are likely to exist. Representing shared correlations structures allows the use of sophisticated inference techniques based on lifted probabilistic inference that, in turn, allows us to achieve significant speedups while computing probabilities for results of user-submitted queries. 1

