Results 1 - 10
of
12
A scalable generative graph model with community structure
, 2014
"... Abstract. Network data is ubiquitous and growing, yet we lack realistic generative network models that can be calibrated to match real-world data. The recently proposed Block Two-Level Erdős-Rényi (BTER) model can be tuned to capture two fundamental properties: degree distribu-tion and clustering ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Network data is ubiquitous and growing, yet we lack realistic generative network models that can be calibrated to match real-world data. The recently proposed Block Two-Level Erdős-Rényi (BTER) model can be tuned to capture two fundamental properties: degree distribu-tion and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(dmax) storage where dmax is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop MapReduce implementa-tion for a modeling a real-world web graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications. Key words. graph generator, network data, block two-level Erdős-Rényi (BTER) model, large-scale graph benchmarks 1. Introduction. Network
Graph Sample and Hold: A Framework for Big-Graph Analytics
"... Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror ev-ery property of the whole population. Unfortunately, such a perfect sample is hard to collect in c ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror ev-ery property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we pro-pose a generic stream sampling framework for big-graph analytics,
Wedge sampling for computing clustering coefficients and triangle counts on large graphs
- Statistical Analysis and Data Mining
, 2014
"... Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, ev ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, even for moderately-sized graphs with only millions of edges. Previous work has considered node and edge sampling; in contrast, we consider wedge sampling, which provides faster and more accurate approximations than competing techniques. Additionally, wedge sampling enables estimation local clustering coefficients, degree-wise clustering coefficients, uniform triangle sampling, and directed triangle counts. Our methods come with provable and practical probabilistic error estimates for all computations. We provide extensive results that show our methods are both more accurate and faster than state-of-the-art alternatives. 1
Multicore Triangle Computations Without Tuning
"... Abstract—Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and ca ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of real-world graphs. This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges. Our algorithms are provably cache-friendly, easy to implement in a language that supports dynamic parallelism, such as Cilk Plus or OpenMP, and do not require parameter tuning. On a 40-core machine with two-way hyper-threading, our parallel exact global and local triangle counting algorithms obtain speedups of 17–50x on a set of real-world and synthetic graphs, and are faster than previous parallel exact triangle counting algorithms. We can compute the exact triangle count of the Yahoo Web graph (over 6 billion edges) in under 1.5 minutes. In addition, for approximate triangle counting, we are able to approximate the count for the Yahoo graph to within 99.6 % accuracy in under 10 seconds, and for a given accuracy we are much faster than existing parallel approximate triangle counting implementations. I.
Parallel Triangle Counting and Enumeration using Matrix Algebra
"... Abstract—Triangle counting and enumeration are important kernels that are used to characterize graphs. They are also used to compute important statistics such as clustering coefficients. We provide a simple exact algorithm that is based on operations on sparse adjacency matrices. By parallelizing th ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Triangle counting and enumeration are important kernels that are used to characterize graphs. They are also used to compute important statistics such as clustering coefficients. We provide a simple exact algorithm that is based on operations on sparse adjacency matrices. By parallelizing the individual sparse matrix operations, we achieve a parallel algorithm for triangle counting. The algorithm is generalizable to triangle enumeration by modifying the semiring that underlies the matrix algebra. We present a new primitive, masked matrix multiplication, that can be beneficial especially for the enumeration case. We provide results from an initial implementation for the counting case along with various optimizations for communication reduction and load balance. I.
Compiled Plans for In-Memory Path-Counting Queries
, 2013
"... Dissatisfaction with relational databases for large-scale graph processing has motivated a new class of graph databases that offer fast graph processing but sacrifice the ability to express basic relational idioms. However, we hypothesize that the performance benefits amount to implementation detai ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Dissatisfaction with relational databases for large-scale graph processing has motivated a new class of graph databases that offer fast graph processing but sacrifice the ability to express basic relational idioms. However, we hypothesize that the performance benefits amount to implementation details, not a fundamental limitation of the relational model. To evaluate this hypothesis, we are exploring code-generation to produce fast in-memory algorithms and data structures for graph patterns that are inaccessible to conventional relational optimizers. In this paper, we present preliminary results for this approach on path-counting queries, which includes triangle counting as a special case. We compile Datalog queries into main-memory pipelined hash-join plans in C++, and show that the resulting programs easily outperform PostgreSQL on real graphs with different degrees of skew. We then produce analogous parallel programs for Grappa, a runtime system for distributed memory architectures. Grappa is a good target for building a parallel query system as its shared memory programming model and communication mechanisms provide productivity and performance when building communication-intensive applications. Our experiments suggest that Grappa programs using hash joins have competitive performance with queries executed on Greenplum, a commercial parallel database. We find preliminary evidence that a code generation approach simplifies the design of a query engine for graph analysis and improves performance over conventional relational databases.
A Space-efficient Parallel Algorithm for Counting Exact Triangles in Massive Networks
"... Abstract—Finding the number of triangles in a network (graph) is an important problem in mining and analysis of complex networks. Massive networks emerging from numerous application areas pose a significant challenge in network analytics since these networks consist of millions, or even billions, of ..."
Abstract
- Add to MetaCart
Abstract—Finding the number of triangles in a network (graph) is an important problem in mining and analysis of complex networks. Massive networks emerging from numerous application areas pose a significant challenge in network analytics since these networks consist of millions, or even billions, of nodes and edges. Such massive networks necessitate the development of efficient parallel algorithms. There exist several MapReduce and an only MPI (Message Passing Interface) based distributed-memory parallel algorithms for counting triangles. MapReduce based algorithms generate prohibitively large intermediate data. The MPI based algorithm can work on quite large networks, however, the overlapping partitions employed by the algorithm limit its capability to deal with very massive networks. In this paper, we present a space-efficient MPI based paral-lel algorithm for counting exact number of triangles in mas-sive networks. The algorithm divides the network into non-overlapping partitions. Our results demonstrate up to 25-fold space saving over the algorithm with overlapping partitions. This space efficiency allows the algorithm to deal with networks which are 25 times larger. We present a novel approach that reduces communication cost drastically (up to 90%) leading to both a space- and runtime-efficient algorithm. Our adaptation of a parallel partitioning scheme by computing a novel weight function adds further to the efficiency of the algorithm. Denoting average degree of nodes and the number of partitions by d ̄ and P, respectively, our algorithm achieves up to O(P 2)-factor space efficiency over existing MapReduce based algorithms and up to O(d̄)-factor over the algorithm with overlapping partitioning. Keywords-counting triangles, parallel algorithms, massive net-works, social networks, graph mining, space efficiency. I.
Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts∗
"... Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Get-ting results for 4-vertex patterns is highly challenging, and th ..."
Abstract
- Add to MetaCart
(Show Context)
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Get-ting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to mas-sive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs em-ploy clusters and massive parallelization. We provide a sampling algorithm that provably and accu-rately approximates the frequencies of all 4-vertex pattern subgraphs. Our algorithm is based on a novel technique of 3-path sampling and a special pruning scheme to decrease
MADHAV JHA, Sandia National Laboratories C. SESHADHRI, Sandia National Laboratories
"... space efficient streaming algorithm for estimating transitivity and ..."
(Show Context)