Results 1  10
of
22
Alldistances sketches, revisited: Hip estimators for massive graphs analysis
 PROC. 33RD ACM SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, ACM
, 2014
"... Graph datasets with billions of edges, such as social and Web graphs, are prevalent. To be feasible, computation on such large graphs should scale linearly with graph size. Alldistances sketches (ADSs) are emerging as a powerful tool for scalable computation of some basic properties of individual n ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Graph datasets with billions of edges, such as social and Web graphs, are prevalent. To be feasible, computation on such large graphs should scale linearly with graph size. Alldistances sketches (ADSs) are emerging as a powerful tool for scalable computation of some basic properties of individual nodes or the whole graph. ADSs were first proposed two decades ago (Cohen 1994) and more recent algorithms include ANF (Palmer, Gibbons, and Faloutsos 2002) and hyperANF (Boldi, Rosa, and Vigna 2011). A sketch of logarithmic size is computed for each node in the graph and the computation in total requires only a near linear number of edge relaxations. From the ADS of a node, we can estimate its neighborhood cardinalities (the number of nodes within some query distance) and closeness centrality. More generally we can estimate the distance distribution, effective diameter, similarities, and other parameters of the full graph. We make several contributions which facilitate a more effective use of ADSs for scalable analysis of massive graphs. We provide, for the first time, a unified exposition of ADS algorithms and applications. We present the Historic Inverse Probability (HIP) estimators which are applied to the ADS of a node to estimate a large natural class of queries including neighborhood cardinalities and closeness centralities. We show that our HIP estimators have at most half the variance of previous neighborhood cardinality estimators and that this is essentially optimal. Moreover, HIP obtains a polynomial improvement for more general queries and the estimators are simple, flexible, unbiased, and elegant. We apply HIP for approximate distinct counting on streams by comparing HIP and the original estimators applied to the HyperLogLog MinHash sketches (Flajolet et al. 2007). We demonstrate significant improvement in estimation quality for this stateoftheart practical algorithm and also illustrate the ease of applying HIP. Finally, we study the quality of ADS estimation of distance ranges, generalizing the nearlinear time factor2 approximation of the diameter.
Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths
 In COSN
, 2013
"... Similarity estimation between nodes based on structural properties of graphs is a basic building block used in the analysis of massive networks for diverse purposes such as link prediction, product recommendations, advertisement, collaborative filtering, and community discovery. While local simila ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
(Show Context)
Similarity estimation between nodes based on structural properties of graphs is a basic building block used in the analysis of massive networks for diverse purposes such as link prediction, product recommendations, advertisement, collaborative filtering, and community discovery. While local similarity measures, based on properties of immediate neighbors, are easy to compute, those relying on global properties have better recall. Unfortunately, this better quality comes with a computational price tag. Aiming for both accuracy and scalability, we make several contributions. First, we define closeness similarity, a natural measure that compares two nodes based on the similarity of their relations to all other nodes. Second, we show how the alldistances sketch (ADS) node labels, which are efficient to compute, can support the estimation of closeness similarity and shortestpath (SP) distances in logarithmic query time. Third, we propose the randomized edge lengths (REL) technique and define the corresponding REL distance, which captures both path length and path multiplicity and therefore improves over the SP distance as a similarity measure. The REL distance can also be the basis of closeness similarity and can be estimated using SP computation or the ADS labels. We demonstrate the effectiveness of our measures and the accuracy of our estimates through experiments on social networks with up to tens of millions of nodes.
Hop Doubling Label Indexing for PointtoPoint Distance Querying on ScaleFree Networks
"... We study the problem of pointtopoint distance querying for massive scalefree graphs, which is important for numerous applications. Given a directed or undirected graph, we propose to build an index for answering such queries based on a novel hopdoubling labeling technique. We derive bounds on th ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
We study the problem of pointtopoint distance querying for massive scalefree graphs, which is important for numerous applications. Given a directed or undirected graph, we propose to build an index for answering such queries based on a novel hopdoubling labeling technique. We derive bounds on the index size, the computation costs and I/O costs based on the properties of unweighted scalefree graphs. We show that our method is much more efficient and effective compared to the stateoftheart techniques, in terms of both querying time and indexing costs. Our empirical study shows that our method can handle graphs that are orders of magnitude larger than existing methods. 1.
Hub labels: Theory and practice
 In SEA
, 2014
"... Abstract. The Hub Labeling algorithm (HL) is an exact shortest path algorithm with excellent query performance on some classes of problems. It precomputes some auxiliary information (stored as a label) for each vertex, and its query performance depends only on the label size. While there are polynom ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract. The Hub Labeling algorithm (HL) is an exact shortest path algorithm with excellent query performance on some classes of problems. It precomputes some auxiliary information (stored as a label) for each vertex, and its query performance depends only on the label size. While there are polynomialtime approximation algorithms to find labels of approximately optimal size, practical solutions use hierarchical hub labels (HHL), which are faster to compute but offer no guarantee on the label size. We improve the theoretical and practical performance of the HL approximation algorithms, enabling us to compute such labels for moderately large problems. Our comparison shows that HHL algorithms scale much better and find labels that usually are not much bigger than the theoretically justified HL labels. 1
Dynamic and historical shortestpath distance queries on large evolving networks by pruned landmark labeling
 In WWW
, 2014
"... We propose two dynamic indexing schemes for shortestpath and distance queries on large timeevolving graphs, which are useful in a wide range of important applications such as realtime networkaware search and network evolution analysis. To the best of our knowledge, these methods are the first p ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We propose two dynamic indexing schemes for shortestpath and distance queries on large timeevolving graphs, which are useful in a wide range of important applications such as realtime networkaware search and network evolution analysis. To the best of our knowledge, these methods are the first practical exact indexing methods to efficiently process distance queries and dynamic graph updates. We first propose a dynamic indexing scheme for queries on the last snapshot. The scalability and efficiency of its offline indexing algorithm and query algorithm are competitive even with previous static methods. Meanwhile, the method is dynamic, that is, it can incrementally update indices as the graph changes over time. Then, we further design another dynamic indexing scheme that can also answer two kinds of historical queries with regard to not only the latest snapshot but also previous snapshots. Through extensive experiments on real and synthetic evolving networks, we show the scalability and efficiency of our methods. Specifically, they can construct indices from large graphs with millions of vertices, answer queries in microseconds, and update indices in milliseconds.
Efficient Densest Subgraph Computation in Evolving Graphs
, 2015
"... Densest subgraph computation has emerged as an important primitive in a wide range of data analysis tasks such as community and event detection. Social media such as Facebook and Twitter are highly dynamic with new friendship links and tweets being generated incessantly, calling for efficient algori ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Densest subgraph computation has emerged as an important primitive in a wide range of data analysis tasks such as community and event detection. Social media such as Facebook and Twitter are highly dynamic with new friendship links and tweets being generated incessantly, calling for efficient algorithms that can handle very large and highly dynamic input data. While either scalable or dynamic algorithms for finding densest subgraphs have been proposed, a viable and satisfactory solution for addressing both the dynamic aspect of the input data and its large size is still missing. We study the densest subgraph problem in the the dynamic graph model, for which we present the first scalable algorithm with provable guarantees. In our model, edges are added adversarially while they are removed uniformly at
Robust distance queries on massive networks
 In Proceedings of the 22nd Annual European Symposium on Algorithms (ESA’14), Lecture Notes in Computer Science
, 2014
"... Abstract. We present a versatile and scalable algorithm for computing exact distances on realworld networks with tens of millions of arcs in real time. Unlike existing approaches, preprocessing and queries are practical on a wide variety of inputs, such as social, communication, sensor, and road n ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We present a versatile and scalable algorithm for computing exact distances on realworld networks with tens of millions of arcs in real time. Unlike existing approaches, preprocessing and queries are practical on a wide variety of inputs, such as social, communication, sensor, and road networks. We achieve this by providing a unified approach based on the concept of 2hop labels, improving upon existing methods. In particular, we introduce a fast samplingbased algorithm to order vertices by importance, as well as effective compression techniques.
Optimal Enumeration: Efficient Topk Tree Matching
"... Driven by many real applications, graph pattern matching has attracted a great deal of attention recently. Consider that a twigpattern matching may result in an extremely large number of matches in a graph; this may not only confuse users by providing too many results but also lead to high computa ..."
Abstract
 Add to MetaCart
(Show Context)
Driven by many real applications, graph pattern matching has attracted a great deal of attention recently. Consider that a twigpattern matching may result in an extremely large number of matches in a graph; this may not only confuse users by providing too many results but also lead to high computational costs. In this paper, we study the problem of topk tree pattern matching; that is, given a rooted tree T, compute its topk matches in a directed graph G based on the twigpattern matching semantics. We firstly present a novel and optimal enumeration paradigm based on the principle of Lawler’s procedure. We show that our enumeration algorithm runs in O(nT + log k) time in each round where nT is the number of nodes in T. Considering that the time complexity to output a match of T is O(nT) and nT ≥ log k in practice, our enumeration technique is optimal. Moreover, the cost of generating top1 match of T in our algorithm is O(mR) where mR is the number of edges in the transitive closure of a data graph G involving all relevant nodes to T. O(mR) is also optimal in the worst case without preknowledge of G. Consequently, our algorithm is optimal with the running time O(mR + k(nT + log k)) in contrast to the time complexity O(mR log k+knT (log k+dT)) of the existing technique where dT is the maximal node degree in T. Secondly, a novel priority based access technique is proposed, which greatly reduces the number of edges accessed and results in a significant performance improvement. Finally, we apply our techniques to the general form of topk graph pattern matching problem (i.e., query is a graph) to improve the existing techniques. Comprehensive empirical studies demonstrate that our techniques may improve the existing techniques by orders of magnitude. 1.
Simple, Fast, and Scalable Reachability Oracle
"... A reachability oracle (or hop labeling) assigns each vertex v two sets of vertices: Lout(v) and Lin(v), such that u reaches v iff Lout(u) ∩ Lin(v) = ∅. Despite their simplicity and elegance, reachability oracles have failed to achieve efficiency in more than ten years since their introduction: The ..."
Abstract
 Add to MetaCart
(Show Context)
A reachability oracle (or hop labeling) assigns each vertex v two sets of vertices: Lout(v) and Lin(v), such that u reaches v iff Lout(u) ∩ Lin(v) = ∅. Despite their simplicity and elegance, reachability oracles have failed to achieve efficiency in more than ten years since their introduction: The main problem is high construction cost, which stems from a setcover framework and the need to materialize transitive closure. In this paper, we present two simple and efficient labeling algorithms, HierarchicalLabeling and DistributionLabeling, which can work on massive realworld graphs: Their construction time is an order of magnitude faster than the setcover based labeling approach, and transitive closure materialization is not needed. On large graphs, their index sizes and their query performance can now beat the stateoftheart transitive closure compression and online search approaches.
Shortest Paths in Microseconds
"... Computing shortest paths is a fundamental primitive for several social network applications including sociallysensitive ranking, locationaware search, social auctions and social network privacy. Since these applications compute paths in response to a user query, the goal is to minimize latency whil ..."
Abstract
 Add to MetaCart
(Show Context)
Computing shortest paths is a fundamental primitive for several social network applications including sociallysensitive ranking, locationaware search, social auctions and social network privacy. Since these applications compute paths in response to a user query, the goal is to minimize latency while maintaining feasible memory requirements. We present ASAP, a system that achieves this goal by exploiting the structure of social networks. ASAP preprocesses a given network to compute and store a partial shortest path tree (PSPT) for each node. The PSPTs have the property that for any two nodes, each edge along the shortest path is with high probability contained in the PSPT of at least one of the nodes. We show that the structure of social networks enable the PSPT of each node to be an extremely small fraction of the entire network; hence, PSPTs can be stored efficiently and each shortest path can be computed extremely quickly. For a real network with 5 million nodes and 69 million edges, ASAP computes a shortest path for most node pairs in less than 49 microseconds per pair. ASAP, unlike any previous technique, also computes hundreds of paths (along with corresponding distances) between any node pair in less than 100 microseconds. Finally, ASAP admits efficient implementation on distributed programming frameworks like MapReduce. 1.