Results 1 - 10
of
15
Patric: A parallel algorithm for counting triangles and computing clustering coefficients in massive networks
, 2012
"... We present MPI-based parallel algorithms for counting triangles and computing clustering coefficients in massive networks. � A triangle in a graph G(V, E) is a set of three nodes u, v, w ∊V such that there is an edge between each pair of nodes. The number of triangles incident on node v, with adjace ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
(Show Context)
We present MPI-based parallel algorithms for counting triangles and computing clustering coefficients in massive networks. � A triangle in a graph G(V, E) is a set of three nodes u, v, w ∊V such that there is an edge between each pair of nodes. The number of triangles incident on node v, with adjacency list N(v), is defined as, � | { ( u, w) � E | u, w � N ( v)} Counting triangles is important in the analysis of various networks, e.g., social, biological, web etc. Emerging massive networks do not fit in the main memory of a single machine and are very challenging to work with. Our distributed-memory parallel algorithm allows us to deal with such massive networks in a time- and space-efficient manner. We were able to count triangles in a graph with 2 billions of nodes and 50 billions of edges in 10 minutes. � The clustering coefficient (CC) of a node v ∊V with degree dv is defined as,
COUNTING TRIANGLES IN MASSIVE GRAPHS WITH MAPREDUCE
, 2013
"... Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give a measure of the connectedness of mutual ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
(Show Context)
Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give a measure of the connectedness of mutual friends. This is often summarized in terms of clustering coefficients, which measure the likelihood that two neighbors of a node are themselves connected. Computing these measures exactly for large-scale networks is prohibitively expensive in both memory and time. However, a recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clustering coefficients. In this paper, we describe how to implement this approach in MapReduce to deal with extremely massive graphs. We show results on publicly-available networks, the largest of which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as well as the global clustering coefficient and total number of triangles, in an average of 0.33 sec. per million edges plus overhead (approximately 225 sec. total for our configuration). The technique can also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we highlight differences between social and non-social networks. To the best of our knowledge, these are the largest triangle-based graph computations published to date.
Scaling Techniques for Massive Scale-Free Graphs in Distributed (External) Memory
"... Abstract—We present techniques to process large scale-free graphs in distributed memory. Our aim is to scale to trillions of edges, and our research is targeted at leadership class supercomputers and clusters with local non-volatile memory, e.g., NAND Flash. We apply an edge list partitioning techni ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Abstract—We present techniques to process large scale-free graphs in distributed memory. Our aim is to scale to trillions of edges, and our research is targeted at leadership class supercomputers and clusters with local non-volatile memory, e.g., NAND Flash. We apply an edge list partitioning technique, designed to accommodate high-degree vertices (hubs) that create scaling challenges when processing scale-free graphs. In addition to partitioning hubs, we use ghost vertices to represent the hubs to reduce communication hotspots. We present a scaling study with three important graph al-gorithms: Breadth-First Search (BFS), K-Core decomposition, and Triangle Counting. We also demonstrate scalability on BG/P Intrepid by comparing to best known Graph500 results [1]. We show results on two clusters with local NVRAM storage that are capable of traversing trillion-edge scale-free graphs. By leveraging node-local NAND Flash, our approach can process thirty-two times larger datasets with only a 39 % performance degradation in Traversed Edges Per Second (TEPS). Keywords-parallel algorithms; graph algorithms; big data; distributed computing. I.
Graph Sample and Hold: A Framework for Big-Graph Analytics
"... Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror ev-ery property of the whole population. Unfortunately, such a perfect sample is hard to collect in c ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror ev-ery property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we pro-pose a generic stream sampling framework for big-graph analytics,
Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions
"... Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasi-clique, k-densest subgraph) are NP-hard. Furthermore, the goal is ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasi-clique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the “true optimum”, but to identify many (if not all) dense substructures, understand their dis-tribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which enables discovering overlapping dense subgraphs. With the right parameters, the nucleus decomposition generalizes the classic notions of k-cores and k-truss decompositions. We give provably efficient algorithms for nucleus decom-positions, and empirically evaluate their behavior in a va-riety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour. ∗Work done while the author was interning at Sandia Na-tional Laboratories, Livermore, CA.
Wedge sampling for computing clustering coefficients and triangle counts on large graphs
- Statistical Analysis and Data Mining
, 2014
"... Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, ev ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, even for moderately-sized graphs with only millions of edges. Previous work has considered node and edge sampling; in contrast, we consider wedge sampling, which provides faster and more accurate approximations than competing techniques. Additionally, wedge sampling enables estimation local clustering coefficients, degree-wise clustering coefficients, uniform triangle sampling, and directed triangle counts. Our methods come with provable and practical probabilistic error estimates for all computations. We provide extensive results that show our methods are both more accurate and faster than state-of-the-art alternatives. 1
Multicore Triangle Computations Without Tuning
"... Abstract—Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and ca ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of real-world graphs. This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges. Our algorithms are provably cache-friendly, easy to implement in a language that supports dynamic parallelism, such as Cilk Plus or OpenMP, and do not require parameter tuning. On a 40-core machine with two-way hyper-threading, our parallel exact global and local triangle counting algorithms obtain speedups of 17–50x on a set of real-world and synthetic graphs, and are faster than previous parallel exact triangle counting algorithms. We can compute the exact triangle count of the Yahoo Web graph (over 6 billion edges) in under 1.5 minutes. In addition, for approximate triangle counting, we are able to approximate the count for the Yahoo graph to within 99.6 % accuracy in under 10 seconds, and for a given accuracy we are much faster than existing parallel approximate triangle counting implementations. I.
A Space-efficient Parallel Algorithm for Counting Exact Triangles in Massive Networks
"... Abstract—Finding the number of triangles in a network (graph) is an important problem in mining and analysis of complex networks. Massive networks emerging from numerous application areas pose a significant challenge in network analytics since these networks consist of millions, or even billions, of ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Finding the number of triangles in a network (graph) is an important problem in mining and analysis of complex networks. Massive networks emerging from numerous application areas pose a significant challenge in network analytics since these networks consist of millions, or even billions, of nodes and edges. Such massive networks necessitate the development of efficient parallel algorithms. There exist several MapReduce and an only MPI (Message Passing Interface) based distributed-memory parallel algorithms for counting triangles. MapReduce based algorithms generate prohibitively large intermediate data. The MPI based algorithm can work on quite large networks, however, the overlapping partitions employed by the algorithm limit its capability to deal with very massive networks. In this paper, we present a space-efficient MPI based paral-lel algorithm for counting exact number of triangles in mas-sive networks. The algorithm divides the network into non-overlapping partitions. Our results demonstrate up to 25-fold space saving over the algorithm with overlapping partitions. This space efficiency allows the algorithm to deal with networks which are 25 times larger. We present a novel approach that reduces communication cost drastically (up to 90%) leading to both a space- and runtime-efficient algorithm. Our adaptation of a parallel partitioning scheme by computing a novel weight function adds further to the efficiency of the algorithm. Denoting average degree of nodes and the number of partitions by d ̄ and P, respectively, our algorithm achieves up to O(P 2)-factor space efficiency over existing MapReduce based algorithms and up to O(d̄)-factor over the algorithm with overlapping partitioning. Keywords-counting triangles, parallel algorithms, massive net-works, social networks, graph mining, space efficiency. I.
Tracking Triadic Cardinality Distributions for Burst Detection in Social Activity Streams
"... Abstract—In online social networks (OSNs), we often observe abnormally frequent interactions among people before or during some important day, e.g., we receive/send more greetings from/to friends on Christmas Day than usual. We also often observe some viral videos suddenly become worldwide popular t ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—In online social networks (OSNs), we often observe abnormally frequent interactions among people before or during some important day, e.g., we receive/send more greetings from/to friends on Christmas Day than usual. We also often observe some viral videos suddenly become worldwide popular through one night diffusion in OSNs. Do these seemingly different phenomena share common structure? All these phenomena are related to sudden surges of user activities in OSNs, and are referred to as bursts in this work. We find that the emergence of a burst is accompanied with the formation of new triangles in networks. This finding provokes a new method for detecting bursts in OSNs. We first introduce a new measure, named triadic cardinality distribution, which measures the fraction of nodes with certain number of triangles, i.e., triadic cardinality, in a network. The distribution will change when burst occurs, and is congenitally immunized against spamming social bots. Hence, by tracking triadic cardinality distributions, we are able to detect bursts in OSNs. To relieve the burden of handling huge activity data generated by OSN users, we then develop a delicately designed sample-estimate solution to estimate triadic cardinality distribution efficiently from sampled data. Extensive experiments conducted on real data demonstrate the usefulness of this triadic cardinality distribution and effectiveness of our sample-estimate solution. I.