Results 1 - 10
of
26
Consensus answers for queries over probabilistic databases
- in PODS
, 2009
"... We address the problem of finding a “best ” deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (a ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
We address the problem of finding a “best ” deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (answers). This problem can be seen as a generalization of the well-studied inconsistent information aggregation problems (e.g. rank aggregation) to probabilistic databases. We consider this problem for various types of queries including SPJ queries, Top-k ranking queries, group-by aggregate queries, and clustering. For different distance metrics, we obtain polynomial time optimal or approximation algorithms for computing the consensus answers (or prove NP-hardness). Most of our results are for a general probabilistic database model, called and/xor tree model, which significantly generalizes previous probabilistic database models like x-tuples and block-independent disjoint models, and is of independent interest.
Provenance for aggregate queries
- In PODS, 2011. Available at http://arxiv.org/abs/1101.1110
"... doi:10.1145/1989284.1989302 © ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
doi:10.1145/1989284.1989302 © ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The
k-Nearest Neighbors in Uncertain Graphs
"... Complex networks, such as biological, social, and communication networks, often entail uncertainty, and thus, can be modeled as probabilistic graphs. Similar to the problem of similarity search in standard graphs, a fundamental problem for probabilistic graphs is to efficiently answer k-nearest neig ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Complex networks, such as biological, social, and communication networks, often entail uncertainty, and thus, can be modeled as probabilistic graphs. Similar to the problem of similarity search in standard graphs, a fundamental problem for probabilistic graphs is to efficiently answer k-nearest neighbor queries (k-NN), which is the problem of computing the k closest nodes to some specific node. In this paper we introduce a framework for processing k-NN queries in probabilistic graphs. We propose novel distance functions that extend well-known graph concepts, such as shortest paths. In order to compute them in probabilistic graphs, we design algorithms based on sampling. During k-NN query processing we efficiently prune the search space using novel techniques. Our experiments indicate that our distance functions outperform previously used alternatives in identifying true neighbors in real-world biological data. We also demonstrate that our algorithms scale for graphs with tens of millions of edges. 1.
Ranking Continuous Probabilistic Datasets
"... Ranking is a fundamental operation in data analysis and decision support, and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank uncertain datasets in recent years. In this paper, we address the problem of rankin ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Ranking is a fundamental operation in data analysis and decision support, and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank uncertain datasets in recent years. In this paper, we address the problem of ranking when the tuple scores are uncertain, and the uncertainty is captured using continuous probability distributions (e.g. Gaussian distributions). We present a comprehensive solution to compute the values of a parameterized ranking function (P RF) [18] for arbitrary continuous probability distributions (and thus rank the uncertain dataset); P RF can be used to simulate or approximate many other ranking functions proposed in prior work. We develop exact polynomial time algorithms for some continuous probability distribution classes, and efficient approximation schemes with provable guarantees for arbitrary probability distributions. Our algorithms can also be used for exact or approximate evaluation of k-nearest neighbor queries over uncertain objects, whose positions are modeled using continuous probability distributions. Our experimental evaluation over several datasets illustrates the effectiveness of our approach at efficiently ranking uncertain datasets with continuous attribute uncertainty. 1.
Relevance and ranking in online dating systems
- SIGIR, SIGIR
, 2010
"... Match-making systems refer to systems where users want to meet other individuals to satisfy some underlying need. Examples of match-making systems include dating services, resume/job bulletin boards, community based question answering, and consumer-to-consumer marketplaces. One fundamental component ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Match-making systems refer to systems where users want to meet other individuals to satisfy some underlying need. Examples of match-making systems include dating services, resume/job bulletin boards, community based question answering, and consumer-to-consumer marketplaces. One fundamental component of a match-making system is the retrieval and ranking of candidate matches for a given user. We present the first in-depth study of information retrieval approaches applied to match-making systems. Specifically, we focus on retrieval for a dating service. This domain offers several unique problems not found in traditional information retrieval tasks. These include two-sided relevance, very subjective relevance, extremely few relevant matches, and structured queries. We propose a machine learned ranking function that makes use of features extracted from the uniquely rich user profiles that consist of both structured and unstructured attributes. An extensive evaluation carried out using data gathered from a real online dating service shows the benefits of our proposed methodology with respect to traditional match-making baseline systems. Our analysis also provides deep insights into the aspects of match-making that are particularly important for producing highly relevant matches.
Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases
"... Abstract. Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a general-purpose inference engine at a high cost. This paper proposes a new approach by which every q ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a general-purpose inference engine at a high cost. This paper proposes a new approach by which every query is evaluated like a safe query inside the database engine, by using a new method called dissociation. A dissociated query is obtained by adding extraneous variables to some atoms until the query becomes safe. We show that the probability of the original query and that of the dissociated query correspond to two well-known scoring functions on graphs, namely graph reliability (which is #P-hard), and the propagation score (which is related to PageRank and is in PTIME): When restricted to graphs, standard query probability is graph reliability, while the dissociated probability is the propagation score. We define a propagation score for conjunctive queries without self-joins and prove (i) that it is is always an upper bound for query reliability, and (ii) that both scores coincide for all safe queries. Given the widespread and successful use of graph propagation methods in practice, we argue for the dissociation method as a good and efficient way to rank probabilistic query results, especially for those queries which are highly intractable for exact probabilistic inference. 1
Nearest-neighbor Queries in Probabilistic Graphs
"... Abstract — Large probabilistic graphs arise in various domains spanning from social networks to biological and communication networks. An important query in these graphs is the k nearestneighbor query, which involves finding and reporting the k closest nodes to a specific node. This query assumes th ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract — Large probabilistic graphs arise in various domains spanning from social networks to biological and communication networks. An important query in these graphs is the k nearestneighbor query, which involves finding and reporting the k closest nodes to a specific node. This query assumes the existence of a measure of the “proximity ” or the “distance ” between any two nodes in the graph. To that end, we propose various novel distance functions that extend well known notions of classical graph theory, such as shortest paths and random walks. We argue that many meaningful distance functions are computationally intractable to compute exactly. Thus, in order to process nearest-neighbor queries, we resort to Monte Carlo sampling and exploit novel graph-transformation ideas and pruning opportunities. In our extensive experimental analysis, we explore the trade-offs of our approximation algorithms and demonstrate that they scale well on real-world probabilistic graphs with tens of millions of edges. I.
Efficient and Effective Similarity Search over Probabilistic Data based on Earth Mover’s Distance
, 2010
"... Probabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional re ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Probabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional relational database. The problem stems from the limited effectiveness of the distance metric supported by the existing database system. On the other hand, some complicated distance operators have proven their values for better distinguishing ability in the probabilistic domain. In this paper, we discuss the similarity search problem with the Earth Mover’s Distance, which is the most successful distance metric on probabilistic histograms and an expensive operator with cubic complexity. We present a new database approach to answer range queries and k-nearest neighbor queries on probabilistic data, on the basis of Earth Mover’s Distance. Our solution utilizes the primal-dual theory in linear programming and deploys B + tree index structures for effective candidate pruning. Extensive experiments show that our proposal dramatically improves the scalability of probabilistic databases. 1
A Generic Framework for Handling Uncertain Data with Local Correlations
"... Data uncertainty is ubiquitous in many real-world applications such as sensor/RFID data analysis. In this paper, we investigate uncertain data that exhibit local correlations, that is, each uncertain object is only locally correlated with a small subset of data, while being independent of others. We ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Data uncertainty is ubiquitous in many real-world applications such as sensor/RFID data analysis. In this paper, we investigate uncertain data that exhibit local correlations, that is, each uncertain object is only locally correlated with a small subset of data, while being independent of others. We propose a generic framework for dealing with this kind of uncertain and locally correlated data, in which we investigate a classical spatial query, nearest neighbor query, on uncertain data with local correlations (namely LC-PNN). Most importantly, to enable fast LC-PNN query processing, we propose a novel filtering technique via offline pre-computations to reduce the query search space. We demonstrate through extensive experiments the efficiency and effectiveness of our approaches. 1.
Probabilities and Sets in Preference Querying
, 2010
"... I am deeply indebted to my adviser Dr. Jan Chomicki for his continuous encouragement, guidance and support in every stage of my PhD study. His enthusiasm, knowledge and diligence have been a constant source of inspiration for me. Through him, I see what a great researcher should be. His vision and e ..."
Abstract
- Add to MetaCart
I am deeply indebted to my adviser Dr. Jan Chomicki for his continuous encouragement, guidance and support in every stage of my PhD study. His enthusiasm, knowledge and diligence have been a constant source of inspiration for me. Through him, I see what a great researcher should be. His vision and experience guided me consistently through the dissertation writing and made this experience enjoyable. I am very grateful to the other brilliant faculty members who served on my committee: Dr. Hung Q. Ngo and Dr. Michalis Petropoulos. Dr. Hung Q. Ngo, whose lectures were by far my favorites, enlightened me on numerous math puzzles, one of which later turned out to be very useful in this dissertation. Dr. Michalis Petropoulos provided warm encouragement and valuable comments throughout the course of writing this dissertation. I would also like to thank the distinguished database researchers at AT&T Labs, Dr. Graham Cormode, Dr. Lukasz Golab, Dr. Flip Korn and Dr. Divesh Srivastava, with whom I had the honor to work with. Their passion, curiosity and spirits are contagious. They generously shared their vision and provided insightful comments on early drafts of my work and other research problems. This dissertation is dedicated to my beloved husband Chao Chen, my most loving parents Ling Wei and Zhemin Zhang, and my most loyal friends Lan, Jing, Michelle, Sixia and Slawek. It simply would not have been possible without their unconditional love and support.

