Results 1  10
of
22
UVDiagram: A Voronoi Diagram for Uncertain Data
, 2009
"... The Voronoi diagram is an important technique for answering nearestneighbor queries for spatial databases. In this paper, we study how the Voronoi diagram can be used on uncertain data, which are inherent in scientific and business applications. In particular, we propose the UncertainVoronoi Diagr ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
The Voronoi diagram is an important technique for answering nearestneighbor queries for spatial databases. In this paper, we study how the Voronoi diagram can be used on uncertain data, which are inherent in scientific and business applications. In particular, we propose the UncertainVoronoi Diagram (or UVdiagram in short). Conceptually, the data space is divided into distinct “UVpartitions”, where each UVpartition P is associated with a set S of objects; any point q located in P has the set S as its nearest neighbor with nonzero probabilities. The UVdiagram facilitates queries that inquire objects for having nonzero chances of being the nearest neighbor of a given query point. It also allows analysis of nearest neighbor information, e.g., finding out how many objects are the nearest neighbors in a given area. However, a UVdiagram requires exponential construction and storage costs. To tackle these problems, we devise an alternative representation for UVpartitions, and develop an adaptive index for the UVdiagram. This index can be constructed in polynomial time. We examine how it can be extended to support other related queries. We also perform extensive experiments to validate the effectiveness of our approach.
Mining Frequent Itemsets over Uncertain Databases
"... In recent years, due to the wide applications of uncertain data, miningfrequentitemsetsoveruncertaindatabaseshasattracted much attention. In uncertain databases, the support of an itemset is a random variable instead of a fixed occurrence counting of this itemset. Thus, unlike the corresponding prob ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
In recent years, due to the wide applications of uncertain data, miningfrequentitemsetsoveruncertaindatabaseshasattracted much attention. In uncertain databases, the support of an itemset is a random variable instead of a fixed occurrence counting of this itemset. Thus, unlike the corresponding problem in deterministic databases where the frequent itemset has a unique definition, the frequent itemset under uncertain environments has two different definitions so far. The first definition, referred as the expected supportbased frequent itemset, employs the expectation of the support of an itemset to measure whether this itemset is frequent. The second definition, referred as the probabilistic frequent itemset, uses the probability of the support of an itemset to measure its frequency. Thus, existing work on mining frequent itemsets over uncertain databases is divided into two different groups and no study is conducted to comprehensively compare the two different definitions. In addition, since no uniform experimental platform exists, current solutions for the same definition even generate inconsistent results. In this paper, we firstly aim to clarify the relationship between the two different definitions. Through extensive experiments, we verify that the two definitions have a tight connection and can be unified together when the size of data is large enough. Secondly, we provide baseline implementations of eight existing representative algorithms and test their performances with uniform measures fairly. Finally, according to the fair tests over many different benchmark data sets, we clarify several existing inconsistent conclusions and discuss some new findings. 1.
1 Clustering Large Probabilistic Graphs
"... Abstract—We study the problem of clustering probabilistic graphs. Similar to the problem of clustering standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic proteinprotein interaction networks and discovering groups of users in affilia ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Abstract—We study the problem of clustering probabilistic graphs. Similar to the problem of clustering standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic proteinprotein interaction networks and discovering groups of users in affiliation networks. We extend the editdistance based definition of graph clustering to probabilistic graphs. We establish a connection between our objective function and correlation clustering to propose practical approximation algorithms for our problem. A benefit of our approach is that our objective function is parameterfree. Therefore, the number of clusters is part of the output. We also develop methods for testing the statistical significance of the output clustering and study the case of noisy clusterings. Using a real proteinprotein interaction network and groundtruth data, we show that our methods discover the correct number of clusters and identify established protein relationships. Finally, we show the practicality of our techniques using a large social network of Yahoo! users consisting of one billion edges.
Voronoibased Nearest Neighbor Search for MultiDimensional Uncertain Databases
"... , emrich ..."
(Show Context)
ENFrame: A Platform for Processing Probabilistic Data
"... This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as boundedrange loops, list comprehension, aggregate operations on lists, and calls to external databas ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as boundedrange loops, list comprehension, aggregate operations on lists, and calls to external database engines. The program is then interpreted probabilistically by ENFrame. The realisation of ENFrame required novel contributions along several directions. We propose an event language that is expressive enough to succinctly encode arbitrary correlations, trace the computation of user programs, and allow for computation of discrete probability distributions of program variables. We exemplify ENFrame on three clustering algorithms: kmeans, kmedoids, and Markov clustering. We introduce sequential and distributed algorithms for computing the probability of interconnected events exactly or approximately with error guarantees. Experiments with kmedoids clustering of sensor readings from energy networks show ordersofmagnitude improvements of exact clustering using ENFrame over näıve clustering in each possible world, of approximate over exact, and of distributed over sequential algorithms. 1.
Outlier detection on uncertain data: Objects, instances, and inferences
 In ICDE
, 2011
"... Abstract—This paper studies the problem of outlier detection on uncertain data. We start with a comprehensive model considering both uncertain objects and their instances. An uncertain object has some inherent attributes and consists of a set of instances which are modeled by a probability density d ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—This paper studies the problem of outlier detection on uncertain data. We start with a comprehensive model considering both uncertain objects and their instances. An uncertain object has some inherent attributes and consists of a set of instances which are modeled by a probability density distribution. We detect outliers at both the instance level and the object level. To detect outlier instances, it is a prerequisite to know normal instances. By assuming that uncertain objects with similar properties tend to have similar instances, we learn the normal instances for each uncertain object using the instances of objects with similar properties. Consequently, outlier instances can be detected by comparing against normal ones. Furthermore, we can detect outlier objects most of whose instances are outliers. Technically, we use a Bayesian inference algorithm to solve the problem, and develop an approximation algorithm and a filtering algorithm to speed up the computation. An extensive empirical study on both real data and synthetic data verifies the effectiveness and efficiency of our algorithms. I.
Uncertain Centroid based Partitional Clustering of Uncertain Data ABSTRACT
"... Clustering uncertain data has emerged as a challenging task in uncertain data management and mining. Thanks to a computational complexity advantage over other clustering paradigms, partitional clustering has been particularly studied and a number of algorithms have been developed. While existing pro ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Clustering uncertain data has emerged as a challenging task in uncertain data management and mining. Thanks to a computational complexity advantage over other clustering paradigms, partitional clustering has been particularly studied and a number of algorithms have been developed. While existing proposals differ mainly in the notions of cluster centroid and clustering objective function, little attention has been given to an analysis of their characteristics and limits. In this work, we theoretically investigate major existing methods of partitional clustering, and alternatively propose a wellfounded approach to clustering uncertain data based on a novel notion of cluster centroid. A cluster centroid is seen as an uncertain object defined in terms of a random variable whose realizations are derived based on all deterministic representations of the objects to be clustered. As demonstrated theoretically and experimentally, this allows for better representing a cluster of uncertain objects, thus supporting a consistently improved clustering performance while maintaining comparable efficiency with existing partitional clustering algorithms. 1.
Characterizing Uncertain Data using Compression
"... Motivated by sensor networks, mobility data, biology and life sciences, the area of mining uncertain data has recently received a great deal of attention. While various papers have focused on efficiently mining frequent patterns from uncertain data, the problem of discovering a small set of interest ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Motivated by sensor networks, mobility data, biology and life sciences, the area of mining uncertain data has recently received a great deal of attention. While various papers have focused on efficiently mining frequent patterns from uncertain data, the problem of discovering a small set of interesting patterns that provide an accurate and condensed description of a probabilistic database is still unexplored. In this paper we study the problem of discovering characteristic patterns in uncertain data through information theoretic lenses. Adopting the possible worlds interpretation of probabilistic data and a compression scheme based on the MDL principle, we formalize the problem of mining patterns that compress the database well in expectation. Despite its huge search space, we show that this problem can be accurately approximated. In particular, we devise a sequence of three methods where each new method improves the memory requirements orders of magnitudes compared to its predecessor, while giving up only a little in terms of approximation accuracy. We empirically compare our methods on both synthetic data and real data from life science. Results show that from a probabilistic matrix with more than one million rows and columns, we can extract a small set of meaningful patterns that accurately characterize the data distribution of any probable world. 1
Feature Selection with Mutual Information for Uncertain Data
"... Abstract. In many realworld situations, the data cannot be assumed to be precise. Indeed uncertain data are often encountered, due for example to the imprecision of measurement devices or to continuously moving objects for which the exact position is impossible to obtain. One way to model this unc ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. In many realworld situations, the data cannot be assumed to be precise. Indeed uncertain data are often encountered, due for example to the imprecision of measurement devices or to continuously moving objects for which the exact position is impossible to obtain. One way to model this uncertainty is to represent each data value as a probability distribution function; recent works show that adequately taking the uncertainty into account generally leads to improved classification performances. Working with such a representation, this paper proposes to achieve feature selection based on mutual information. Experiments on 8 UCI data sets show that the proposed approach is effective to select relevant features.
A Survey of Clustering Uncertain Data Based Probability Distribution Similarity
"... Abstract Clustering is one of the major tasks in the field of data mining .The main aim of the clustering is grouping the data or similar objects into one group based on their data find the similarity between the objects. ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Clustering is one of the major tasks in the field of data mining .The main aim of the clustering is grouping the data or similar objects into one group based on their data find the similarity between the objects.