Results 1  10
of
17
Triadic Measures on Graphs: The Power of Wedge Sampling
, 2012
"... Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of a graph. Some of the most useful graph metrics, especially those measuring social cohesion, are based on triangles. Despite the importance of these triadic measures, associa ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
(Show Context)
Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of a graph. Some of the most useful graph metrics, especially those measuring social cohesion, are based on triangles. Despite the importance of these triadic measures, associated algorithms can be extremely expensive. We propose a new method based on wedge sampling. This versatile technique allows for the fast and accurate approximation of all current variants of clustering coefficients and enables rapid uniform sampling of the triangles of a graph. Our methods come with provable and practical timeapproximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the stateoftheart, while providing nearly the accuracy of full enumeration. Our results will enable more widescale adoption of triadic measures for analysis of extremely large graphs, as demonstrated on several realworld examples.
COUNTING TRIANGLES IN MASSIVE GRAPHS WITH MAPREDUCE
, 2013
"... Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are trianglebased and give a measure of the connectedness of mutual ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
(Show Context)
Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are trianglebased and give a measure of the connectedness of mutual friends. This is often summarized in terms of clustering coefficients, which measure the likelihood that two neighbors of a node are themselves connected. Computing these measures exactly for largescale networks is prohibitively expensive in both memory and time. However, a recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clustering coefficients. In this paper, we describe how to implement this approach in MapReduce to deal with extremely massive graphs. We show results on publiclyavailable networks, the largest of which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as well as the global clustering coefficient and total number of triangles, in an average of 0.33 sec. per million edges plus overhead (approximately 225 sec. total for our configuration). The technique can also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we highlight differences between social and nonsocial networks. To the best of our knowledge, these are the largest trianglebased graph computations published to date.
On the Streaming Complexity of Computing Local Clustering Coefficients
, 2013
"... Due to a large number of applications, the problem of estimating the number of triangles in graphs revealed as a stream of edges, and the closely related problem of estimating the graph’s clustering coefficient, have received considerable attention in the last decade. Both efficient algorithms and i ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Due to a large number of applications, the problem of estimating the number of triangles in graphs revealed as a stream of edges, and the closely related problem of estimating the graph’s clustering coefficient, have received considerable attention in the last decade. Both efficient algorithms and impossibility results have shed light on the computational complexity of the problem. Motivated by applications in Web mining, Becchetti et al. presented new algorithms for the estimation of the local number of triangles, i.e., the numberoftriangles incidenttoindividualvertices. The algorithms are shown, both theoretically and experimentally, to efficiently handle the problem. However, at least two passes over the data are needed and thus the algorithms are not suitable for real streaming scenarios. In the present work, we consider the problem of estimating the clustering coefficient of individual vertices in a graph over n vertices revealed as a stream of m edges. As a first result we show that any one pass randomized streaming algorithm that can distinguish a graph with no triangles from a graph having a vertex of degree d with clustering coefficient> 1/2 must use Ω(m/d) bits of space in expectation. Our second result is a new randomized one pass algorithm estimating the local clustering coefficient of each vertex with degree at least d. The space requirement of our algorithm is within a logarithmic factor of the lower bound, thus our approach is close to optimal. We also extend the algorithm to local triangle counting and report experimental results on its performance on reallife graphs.
Triangle counting in dynamic graph streams
"... Estimating the number of triangles in graph streams using a limited amount of memory has become a popular topic in the last decade. Different variations of the problem have been studied depending on whether the graph edges are provided in arbitrary order or as incidence lists. However, with a few ex ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Estimating the number of triangles in graph streams using a limited amount of memory has become a popular topic in the last decade. Different variations of the problem have been studied depending on whether the graph edges are provided in arbitrary order or as incidence lists. However, with a few exceptions, the algorithms have considered insertonly streams. We present a new algorithm estimating the number of triangles in dynamic graph streams where edges can be both inserted and deleted. We show that our algorithm achieves better time and space complexity than previous solutions for various graph classes, for example sparse graphs with a relatively small number of triangles. Also, for graphs with constant transitivity coefficient, a common situation in real graphs, this is the first algorithm achieving constant processing time per edge. 1
Wedge sampling for computing clustering coefficients and triangle counts on large graphs
 Statistical Analysis and Data Mining
, 2014
"... Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, ev ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, even for moderatelysized graphs with only millions of edges. Previous work has considered node and edge sampling; in contrast, we consider wedge sampling, which provides faster and more accurate approximations than competing techniques. Additionally, wedge sampling enables estimation local clustering coefficients, degreewise clustering coefficients, uniform triangle sampling, and directed triangle counts. Our methods come with provable and practical probabilistic error estimates for all computations. We provide extensive results that show our methods are both more accurate and faster than stateoftheart alternatives. 1
The Kclique Densest Subgraph Problem
"... Numerous graph mining applications rely on detecting subgraphs which are large nearcliques. Since formulations that are geared towards finding large nearcliques are NPhard and frequently inapproximable due to connections with the Maximum Clique problem, the polytime solvable densest subgraph pr ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Numerous graph mining applications rely on detecting subgraphs which are large nearcliques. Since formulations that are geared towards finding large nearcliques are NPhard and frequently inapproximable due to connections with the Maximum Clique problem, the polytime solvable densest subgraph problem which maximizes the average degree over all possible subgraphs “lies at the core of large scale data mining”[10]. However, frequently the densest subgraph problem fails in detecting large nearcliques in networks. In this work, we introduce the kclique densest subgraph problem, k ≥ 2. This generalizes the well studied densest subgraph problem which is obtained as a special case for k = 2. For k = 3 we obtain a novel formulation which we refer to as the triangle densest subgraph problem: given a graph G(V,E), find a subset of vertices S ∗ such that τ(S∗) = max S⊆V t(S)
Chapter 1
"... 1.2 Basic concepts............................................................. 2 ..."
Abstract
 Add to MetaCart
(Show Context)
1.2 Basic concepts............................................................. 2
Streaming Graph Partitioning in the Planted Partition Model
"... The sheer increase in the size of graph data has created a lot of interest into developing efficient distributed graph processing frameworks. Popular existing frameworks such as GraphLab and Pregel rely on balanced graph partitioning in order to minimize communication and achieve work balance. In t ..."
Abstract
 Add to MetaCart
(Show Context)
The sheer increase in the size of graph data has created a lot of interest into developing efficient distributed graph processing frameworks. Popular existing frameworks such as GraphLab and Pregel rely on balanced graph partitioning in order to minimize communication and achieve work balance. In this work we contribute to the recent research line of streaming graph partitioning [30, 31, 34] which computes an approximately balanced kpartitioning of the vertex set of a graph using a single pass over the graph stream using degreebased criteria. This graph partitioning framework is well tailored to processing largescale and dynamic graphs. In this work we introduce the use of higher length walks for streaming graph partitioning and show that their use incurs a minor computational cost which can significantly improve the quality of the graph partition. We perform an average case analysis of our algorithm using the planted partition model [7, 25]. We complement the recent results of Stanton [30] by showing that our proposed method recovers the true partition with high probability even when the gap of the model tends to zero as the size of the graph grows. Furthermore, among the wide number of choices for the length of the walks we show that the proposed length is optimal. Finally, we perform simulations which indicate that our asymptotic results hold even for small graph sizes.
Scalable Large NearClique Detection in LargeScale Networks via Sampling
"... Extracting dense subgraphs from large graphs is a key primitive in a variety of graph mining applications, ranging from mining social networks and the Web graph to bioinformatics [41]. In this paper we focus on a family of polytime solvable formulations, known as the kclique densest subgraph pr ..."
Abstract
 Add to MetaCart
Extracting dense subgraphs from large graphs is a key primitive in a variety of graph mining applications, ranging from mining social networks and the Web graph to bioinformatics [41]. In this paper we focus on a family of polytime solvable formulations, known as the kclique densest subgraph problem (kCliqueDSP) [57]. When k = 2, the problem becomes the wellknown densest subgraph problem (DSP) [22, 31, 33, 39]. Our main contribution is a sampling scheme that gives densest subgraph sparsifier, yielding a randomized algorithm that produces highquality approximations while providing significant speedups and improved space complexity. We also extend this family of formulations to bipartite graphs by introducing the (p, q)biclique densest subgraph problem ((p,q)BicliqueDSP), and devise an ex