Distanceconstraint reachability computation in uncertain graphs
 PVLDB
"... Driven by the emerging network applications, querying and mining uncertain graphs has become increasingly important. In this paper, we investigate a fundamental problem concerning uncertain graphs, which we call the distanceconstraint reachability (DCR) problem: Given two vertices s and t, what is ..."
Cited by 12 (5 self)
Driven by the emerging network applications, querying and mining uncertain graphs has become increasingly important. In this paper, we investigate a fundamental problem concerning uncertain graphs, which we call the distanceconstraint reachability (DCR) problem: Given two vertices s and t, what is the probability that the distance from s to t is less than or equal to a userdefined threshold d in the uncertain graph? Since this problem is #PComplete, we focus on efficiently and accurately approximating DCR online. Our main results include two new estimators for the probabilistic reachability. One is a HorvitzThomson type estimator based on the unequal probabilistic sampling scheme, and the other is a novel recursive sampling estimator, which effectively combines a deterministic recursive computational procedure with a sampling process to boost the estimation accuracy. Both estimators can produce much smaller variance than the direct sampling estimator, which considers each trial to be either 1 or 0. We also present methods to make these estimators more computationally efficient. The comprehensive experiment evaluation on both real and synthetic datasets demonstrates the efficiency and accuracy of our new estimators. 1.
On Graph Query Optimization in Large Networks
"... The dramatic proliferation of sophisticated networks has resulted in a growing need for supporting effective querying and mining methods over such largescale graphstructured data. At the core of many advanced network operations lies a common and critical graph query primitive: how to search graph ..."
Cited by 12 (2 self)
The dramatic proliferation of sophisticated networks has resulted in a growing need for supporting effective querying and mining methods over such largescale graphstructured data. At the core of many advanced network operations lies a common and critical graph query primitive: how to search graph structures efficiently within a large network? Unfortunately, the graph query is hard due to the NPcomplete nature of subgraph isomorphism. It becomes even challenging when the network examined is large and diverse. In this paper, we present a high performance graph indexing mechanism, SPath, to address the graph query problem on large networks. SPath leverages decomposed shortest paths around vertex neighborhood as basic indexing units, which prove to be both effective in graph search space pruning and highly scalable in index construction and deployment. Via SPath, a graph query is processed and optimized beyond the traditional vertexatatime fashion to a more efficient pathatatime way: the query is first decomposed to a set of shortest paths, among which a subset of candidates with good selectivity is picked by a query plan optimizer; Candidate paths are further joined together to help recover the query graph to finalize the graph query processing. We evaluate SPath with the stateoftheart GraphQL on both real and synthetic data sets. Our experimental studies demonstrate the effectiveness and scalability of SPath, which proves to be a more practical and efficient indexing method in addressing graph queries on large networks. 1.
Fast and accurate estimation of shortest paths in large graphs
 In Proceedings of Conference on Information and Knowledge Management (CIKM
, 2010
"... Computing shortest paths between two given nodes is a fundamental operation over graphs, but known to be nontrivial over large diskresident instances of graph data. While a numberoftechniquesexistfor answeringreachabilityqueries and approximating node distances efficiently, determining actual short ..."
Cited by 9 (0 self)
Computing shortest paths between two given nodes is a fundamental operation over graphs, but known to be nontrivial over large diskresident instances of graph data. While a numberoftechniquesexistfor answeringreachabilityqueries and approximating node distances efficiently, determining actual shortest paths (i.e. the sequence of nodes involved) is often neglected. However, in applications arising in massive online social networks, biological networks, and knowledge graphs it is often essential to find out many, if not all, shortest paths between two given nodes. In this paper, we address this problem and present a scalable sketchbased index structure that not only supports estimation of node distances, but also computes corresponding shortest paths themselves. Generating the actual path information allows for further improvements to the estimation accuracy of distances (and paths), leading to nearexact shortestpath approximations in real world graphs. We evaluate our techniques – implemented within a fully functional RDF graph database system – over large realworld social and biological networks of sizes ranging from tens of thousand to millions of nodes and edges. Experiments on several datasets show that we can achieve query response times providing several orders of magnitude speedup over traditional path computations while keeping the estimation errors between 0 % and 1 % on average.
An Optimal Labeling Scheme for Workflow Provenance Using Skeleton Labels
"... We develop a compact and efficient reachability labeling scheme for answering provenance queries on workflow runs that conform to a given specification. Even though a workflow run can be structurally more complex and can be arbitrarily larger than the specification due to fork (parallel) and loop ex ..."
Cited by 8 (5 self)
We develop a compact and efficient reachability labeling scheme for answering provenance queries on workflow runs that conform to a given specification. Even though a workflow run can be structurally more complex and can be arbitrarily larger than the specification due to fork (parallel) and loop executions, we show that a compact reachability labeling for a run can be efficiently computed using the fact that it originates from a fixed specification. Our labeling scheme is optimal in the sense that it uses labels of logarithmic length, runs in linear time, and answers any reachability query in constant time. Our approach is based on using the reachability labeling for the specification as an effective skeleton for designing the reachability labeling for workflow runs. We also demonstrate empirically the effectiveness of our skeletonbased labeling approach.
GRAPH DATA MANAGEMENT AND MINING: A SURVEY OF ALGORITHMS AND APPLICATIONS
, 2010
"... Graph mining and management has become a popular area of research in recent years because of its numerous applications in a wide variety of practical fields, including computational biology, software bug localization and computer networking. Different applications result in graphs of different sizes ..."
Cited by 4 (0 self)
Graph mining and management has become a popular area of research in recent years because of its numerous applications in a wide variety of practical fields, including computational biology, software bug localization and computer networking. Different applications result in graphs of different sizes and complexities. Correspondingly, the applications have different requirements for the underlying mining algorithms. In this chapter, we will provide a survey of different kinds of graph mining and management algorithms. We will also discuss a number of applications, which are dependent upon graph representations. We will discuss how the different graph mining algorithms can be adapted for different applications. Finally, we will discuss important avenues of future research in the area.
Labeling Recursive Workflow Executions OntheFly
, 2011
"... This paper presents a compact labeling scheme for answering reachability queries over workflow executions. In contrast to previous work, our scheme allows nodes (processes and data) in the execution graph to be labeled onthefly, i.e., in a dynamic fashion. In this way, reachability queries can be ..."
Cited by 3 (2 self)
This paper presents a compact labeling scheme for answering reachability queries over workflow executions. In contrast to previous work, our scheme allows nodes (processes and data) in the execution graph to be labeled onthefly, i.e., in a dynamic fashion. In this way, reachability queries can be answered as soon as the relevant data is produced. We first show that, in general, for workflows that contain recursion, dynamic labeling of executions requires long (linearsize) labels. Fortunately, most reallife scientific workflows are linear recursive, and for this natural class we show that dynamic, yet compact (logarithmicsize) labeling is possible. Moreover, our scheme labels the executions in linear time, and answers any reachability query in constant time. We also show that linear recursive workflows are, in some sense, the largest class of workflows that allow compact, dynamic labeling schemes. Interestingly, the empirical evaluation, performed over both real and synthetic workflows, shows that our proposed dynamic scheme outperforms the stateoftheart static scheme for large executions, and creates labels that are shorter by a factor of almost 3.
Densest Subgraph in Streaming and MapReduce
"... The problem of finding locally dense components of a graph is an important primitive in data analysis, with wideranging applications from community mining to spam detection and the discovery of biological network modules. In this paper we present new algorithms for finding the densest subgraph in t ..."
Cited by 3 (1 self)
The problem of finding locally dense components of a graph is an important primitive in data analysis, with wideranging applications from community mining to spam detection and the discovery of biological network modules. In this paper we present new algorithms for finding the densest subgraph in the streaming model. For any ɛ> 0, our algorithms make O(log 1+ɛ n) passes over the input and find a subgraph whose density is guaranteed to be within a factor 2(1 + ɛ) of the optimum. Our algorithms are also easily parallelizable and we illustrate this by realizing them in the MapReduce model. In addition we perform extensive experimental evaluation on massive realworld graphs showing the performance and scalability of our algorithms in practice. 1.
Kreach: Who is in your small world
 PVLDB
"... We study the problem of answering khop reachability queries in a directed graph, i.e., whether there exists a directed path of length k, from a source query vertex to a target query vertex in the input graph. The problem of khop reachability is a general problem of the classic reachability (where ..."
Cited by 3 (2 self)
We study the problem of answering khop reachability queries in a directed graph, i.e., whether there exists a directed path of length k, from a source query vertex to a target query vertex in the input graph. The problem of khop reachability is a general problem of the classic reachability (where k = ∞). Existing indexes for processing classic reachability queries, as well as for processing shortest path queries, are not applicable or not efficient for processing khop reachability queries. We propose an index for processing khop reachability queries, which is simple in design and efficient to construct. Our experimental results on a wide range of real datasets show that our index is more efficient than the stateoftheart indexes even for processing classic reachability queries, for which these indexes are primarily designed. We also show that our index is efficient in answering khop reachability queries. 1.