Results 1  10
of
12
A scalable generative graph model with community structure
, 2014
"... Abstract. Network data is ubiquitous and growing, yet we lack realistic generative network models that can be calibrated to match realworld data. The recently proposed Block TwoLevel ErdősRényi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Network data is ubiquitous and growing, yet we lack realistic generative network models that can be calibrated to match realworld data. The recently proposed Block TwoLevel ErdősRényi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(dmax) storage where dmax is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop MapReduce implementation for a modeling a realworld web graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications. Key words. graph generator, network data, block twolevel ErdősRényi (BTER) model, largescale graph benchmarks 1. Introduction. Network
Graph Sample and Hold: A Framework for BigGraph Analytics
"... Sampling is a standard approach in biggraph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in c ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Sampling is a standard approach in biggraph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we propose a generic stream sampling framework for biggraph analytics,
Wedge sampling for computing clustering coefficients and triangle counts on large graphs
 Statistical Analysis and Data Mining
, 2014
"... Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, ev ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, even for moderatelysized graphs with only millions of edges. Previous work has considered node and edge sampling; in contrast, we consider wedge sampling, which provides faster and more accurate approximations than competing techniques. Additionally, wedge sampling enables estimation local clustering coefficients, degreewise clustering coefficients, uniform triangle sampling, and directed triangle counts. Our methods come with provable and practical probabilistic error estimates for all computations. We provide extensive results that show our methods are both more accurate and faster than stateoftheart alternatives. 1
Multicore Triangle Computations Without Tuning
"... Abstract—Triangle counting and enumeration has emerged as a basic tool in largescale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributedmemory setting or the externalmemory setting, and ca ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Triangle counting and enumeration has emerged as a basic tool in largescale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributedmemory setting or the externalmemory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of realworld graphs. This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges. Our algorithms are provably cachefriendly, easy to implement in a language that supports dynamic parallelism, such as Cilk Plus or OpenMP, and do not require parameter tuning. On a 40core machine with twoway hyperthreading, our parallel exact global and local triangle counting algorithms obtain speedups of 17–50x on a set of realworld and synthetic graphs, and are faster than previous parallel exact triangle counting algorithms. We can compute the exact triangle count of the Yahoo Web graph (over 6 billion edges) in under 1.5 minutes. In addition, for approximate triangle counting, we are able to approximate the count for the Yahoo graph to within 99.6 % accuracy in under 10 seconds, and for a given accuracy we are much faster than existing parallel approximate triangle counting implementations. I.
Parallel Triangle Counting and Enumeration using Matrix Algebra
"... Abstract—Triangle counting and enumeration are important kernels that are used to characterize graphs. They are also used to compute important statistics such as clustering coefficients. We provide a simple exact algorithm that is based on operations on sparse adjacency matrices. By parallelizing th ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Triangle counting and enumeration are important kernels that are used to characterize graphs. They are also used to compute important statistics such as clustering coefficients. We provide a simple exact algorithm that is based on operations on sparse adjacency matrices. By parallelizing the individual sparse matrix operations, we achieve a parallel algorithm for triangle counting. The algorithm is generalizable to triangle enumeration by modifying the semiring that underlies the matrix algebra. We present a new primitive, masked matrix multiplication, that can be beneficial especially for the enumeration case. We provide results from an initial implementation for the counting case along with various optimizations for communication reduction and load balance. I.
Compiled Plans for InMemory PathCounting Queries
, 2013
"... Dissatisfaction with relational databases for largescale graph processing has motivated a new class of graph databases that offer fast graph processing but sacrifice the ability to express basic relational idioms. However, we hypothesize that the performance benefits amount to implementation detai ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Dissatisfaction with relational databases for largescale graph processing has motivated a new class of graph databases that offer fast graph processing but sacrifice the ability to express basic relational idioms. However, we hypothesize that the performance benefits amount to implementation details, not a fundamental limitation of the relational model. To evaluate this hypothesis, we are exploring codegeneration to produce fast inmemory algorithms and data structures for graph patterns that are inaccessible to conventional relational optimizers. In this paper, we present preliminary results for this approach on pathcounting queries, which includes triangle counting as a special case. We compile Datalog queries into mainmemory pipelined hashjoin plans in C++, and show that the resulting programs easily outperform PostgreSQL on real graphs with different degrees of skew. We then produce analogous parallel programs for Grappa, a runtime system for distributed memory architectures. Grappa is a good target for building a parallel query system as its shared memory programming model and communication mechanisms provide productivity and performance when building communicationintensive applications. Our experiments suggest that Grappa programs using hash joins have competitive performance with queries executed on Greenplum, a commercial parallel database. We find preliminary evidence that a code generation approach simplifies the design of a query engine for graph analysis and improves performance over conventional relational databases.
A Spaceefficient Parallel Algorithm for Counting Exact Triangles in Massive Networks
"... Abstract—Finding the number of triangles in a network (graph) is an important problem in mining and analysis of complex networks. Massive networks emerging from numerous application areas pose a significant challenge in network analytics since these networks consist of millions, or even billions, of ..."
Abstract
 Add to MetaCart
Abstract—Finding the number of triangles in a network (graph) is an important problem in mining and analysis of complex networks. Massive networks emerging from numerous application areas pose a significant challenge in network analytics since these networks consist of millions, or even billions, of nodes and edges. Such massive networks necessitate the development of efficient parallel algorithms. There exist several MapReduce and an only MPI (Message Passing Interface) based distributedmemory parallel algorithms for counting triangles. MapReduce based algorithms generate prohibitively large intermediate data. The MPI based algorithm can work on quite large networks, however, the overlapping partitions employed by the algorithm limit its capability to deal with very massive networks. In this paper, we present a spaceefficient MPI based parallel algorithm for counting exact number of triangles in massive networks. The algorithm divides the network into nonoverlapping partitions. Our results demonstrate up to 25fold space saving over the algorithm with overlapping partitions. This space efficiency allows the algorithm to deal with networks which are 25 times larger. We present a novel approach that reduces communication cost drastically (up to 90%) leading to both a space and runtimeefficient algorithm. Our adaptation of a parallel partitioning scheme by computing a novel weight function adds further to the efficiency of the algorithm. Denoting average degree of nodes and the number of partitions by d ̄ and P, respectively, our algorithm achieves up to O(P 2)factor space efficiency over existing MapReduce based algorithms and up to O(d̄)factor over the algorithm with overlapping partitioning. Keywordscounting triangles, parallel algorithms, massive networks, social networks, graph mining, space efficiency. I.
Path Sampling: A Fast and Provable Method for Estimating 4Vertex Subgraph Counts∗
"... Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4vertex patterns is highly challenging, and th ..."
Abstract
 Add to MetaCart
(Show Context)
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization. We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4vertex pattern subgraphs. Our algorithm is based on a novel technique of 3path sampling and a special pruning scheme to decrease
MADHAV JHA, Sandia National Laboratories C. SESHADHRI, Sandia National Laboratories
"... space efficient streaming algorithm for estimating transitivity and ..."
(Show Context)