Results 1  10
of
32
GraphChi: Largescale Graph Computation On just a PC
 In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, OSDI’12
, 2012
"... Current systems for graph computation require a distributed computing cluster to handle very large realworld problems, such as analysis on social networks or the web graph. While distributed computational resources have become more accessible, developing distributed graph algorithms still remains c ..."
Abstract

Cited by 109 (6 self)
 Add to MetaCart
(Show Context)
Current systems for graph computation require a distributed computing cluster to handle very large realworld problems, such as analysis on social networks or the web graph. While distributed computational resources have become more accessible, developing distributed graph algorithms still remains challenging, especially to nonexperts. In this work, we present GraphChi, a diskbased system for computing efficiently on graphs with billions of edges. By using a wellknown method to break large graphs into small parts, and a novel parallel sliding windows method, GraphChi is able to execute several advanced data mining, graph mining, and machine learning algorithms on very large graphs, using just a single consumerlevel computer. We further extend GraphChi to support graphs that evolve over time, and demonstrate that, on a single computer, GraphChi can process over one hundred thousand graph updates per second, while simultaneously performing computation. We show, through experiments and theoretical analysis, that GraphChi performs well on both SSDs and rotational hard drives. By repeating experiments reported for existing distributed systems, we show that, with only fraction of the resources, GraphChi can solve the same problems in very reasonable time. Our work makes largescale graph computation available to anyone with a modern PC. 1
XStream: Edgecentric Graph Processing using Streaming Partitions
"... XStream is a system for processing both inmemory and outofcore graphs on a single sharedmemory machine. While retaining the scattergather programming model with state stored in the vertices, XStream is novel in (i) using an edgecentric rather than a vertexcentric implementation of this mod ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
(Show Context)
XStream is a system for processing both inmemory and outofcore graphs on a single sharedmemory machine. While retaining the scattergather programming model with state stored in the vertices, XStream is novel in (i) using an edgecentric rather than a vertexcentric implementation of this model, and (ii) streaming completely unordered edge lists rather than performing random access. This design is motivated by the fact that sequential bandwidth for all storage media (main memory, SSD, and magnetic disk) is substantially larger than random access bandwidth. We demonstrate that a large number of graph algorithms can be expressed using the edgecentric scattergather model. The resulting implementations scale well in terms of number of cores, in terms of number of I/O devices, and across different storage media. XStream competes favorably with existing systems for graph processing. Besides sequential access, we identify as one of the main contributors to better performance the fact that XStream does not need to sort edge lists during preprocessing. 1
Distributed Memory BreadthFirst Search Revisited: Enabling BottomUp Search
"... Abstract—Breadthfirst search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional topdown approach always takes as much time as the ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Breadthfirst search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional topdown approach always takes as much time as the worst case. A recently discovered bottomup approach manages to cut down the complexity all the way to the number of vertices in the best case, which is typically at least an order of magnitude less than the number of edges. The bottomup approach is not always advantageous, so it is combined with the topdown approach to make the directionoptimizing algorithm which adaptively switches from topdown to bottomup as the frontier expands. We present a scalable distributedmemory parallelization of this challenging algorithm and show up to an order of magnitude speedups compared to an earlier purely topdown code. Our approach also uses a 2D decomposition of the graph that has previously been shown to be superior to a 1D decomposition. Using the default parameters of the Graph500 benchmark, our new algorithm achieves a performance rate of over 240 billion edges per second on 115 thousand cores of a Cray XE6, which makes it over 7 × faster than a conventional topdown algorithm using the same set of optimizations and data distribution. I.
Scalable and High Performance Betweenness Centrality on the GPU
 in Proceedings of the 26th ACM/IEEE International Conference on High Performance Computing, Networking, Storage, and Analysis (SC
, 2014
"... Abstract—Graphs that model social networks, numerical simulations, and the structure of the Internet are enormous and cannot be manually inspected. A popular metric used to analyze these networks is betweenness centrality, which has applications in community detection, power grid contingency analys ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
(Show Context)
Abstract—Graphs that model social networks, numerical simulations, and the structure of the Internet are enormous and cannot be manually inspected. A popular metric used to analyze these networks is betweenness centrality, which has applications in community detection, power grid contingency analysis, and the study of the human brain. However, these analyses come with a high computational cost that prevents the examination of large graphs of interest. Prior GPU implementations suffer from large local data structures and inefficient graph traversals that limit scalability and performance. Here we present several hybrid GPU implementations, providing good performance on graphs of arbitrary structure rather than just scalefree graphs as was done previously. We achieve up to 13x speedup on highdiameter graphs and an average of 2.71x speedup overall over the best existing GPU algorithm. We observe near linear speedup and performance exceeding tens of GTEPS when running betweenness centrality on 192 GPUs. Keywords—GPUs, Graph Algorithms, Parallel Algorithms I.
Parallel graph decomposition using random shifts
 In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA
, 2013
"... We show an improved parallel algorithm for decomposing an undirected unweighted graph into small diameter pieces with a small fraction of the edges in between. These decompositions form critical subroutines in a number of graph algorithms. Our algorithm builds upon the shifted shortest path approach ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
We show an improved parallel algorithm for decomposing an undirected unweighted graph into small diameter pieces with a small fraction of the edges in between. These decompositions form critical subroutines in a number of graph algorithms. Our algorithm builds upon the shifted shortest path approach introduced in [Blelloch,Gupta,Koutis,Miller,Peng,Tangwongsan, SPAA 2011]. By combining various stages of the previous algorithm, we obtain a significantly simpler algorithm with the same asymptotic guarantees as the best sequential algorithm. 1
Flashgraph: processing billionnode graphs on an array of commodity ssds
"... Abstract—Graph analysis performs many random reads and writes, thus these workloads are typically performed in memory. Traditionally, analyzing large graphs requires a cluster of machines so the aggregate memory exceeds the size of the graph. We demonstrate that a multicore server can process graph ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Graph analysis performs many random reads and writes, thus these workloads are typically performed in memory. Traditionally, analyzing large graphs requires a cluster of machines so the aggregate memory exceeds the size of the graph. We demonstrate that a multicore server can process graphs of billions of vertices and hundreds of billions of edges, utilizing commodity SSDs without much performance loss. We do so by implementing a graphprocessing engine within a userspace SSD file system designed for high IOPS and extreme parallelism. This allows us to localize computation to cached data in a nonuniform memory architecture and hide latency by overlapping computation with I/O. Our semiexternal memory graph engine, called FlashGraph, stores vertex state in memory and adjacency lists on SSDs. FlashGraph exposes a general and flexible programming interface that can express a variety of graph algorithms and their optimizations. FlashGraph in semiexternal memory performs many algorithms up to 20 times faster than PowerGraph, a generalpurpose, inmemory graph engine. Even breadthfirst search, which generates many small random I/Os, runs significantly faster in FlashGraph. I.
NUMAaware graphstructured analytics
 In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP
, 2015
"... Graphstructured analytics has been widely adopted in a number of big data applications such as social computation, websearch and recommendation systems. Though much prior research focuses on scaling graphanalytics on distributed environments, the strong desire on performance per core, dollar and ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Graphstructured analytics has been widely adopted in a number of big data applications such as social computation, websearch and recommendation systems. Though much prior research focuses on scaling graphanalytics on distributed environments, the strong desire on performance per core, dollar and joule has generated considerable interests of processing largescale graphs on a single serverclass machine, which may have several terabytes of RAM and 80 or more cores. However, prior graphanalytics systems are largely neutral to NUMA characteristics and thus have suboptimal performance. This paper presents a detailed study of NUMA characteristics and their impact on the efficiency of graphanalytics. Our study uncovers two insights: 1) either random or interleaved allocation of graph data will significantly hamper data locality and parallelism; 2) sequential internode (i.e., remote) memory accesses have much higher bandwidth than both intra and internode random ones. Based on them, this paper describes Polymer, a NUMAaware graphanalytics system on multicore with two key design decisions. First, Polymer differentially allocates and places topology data, applicationdefined data and mutable runtime states of a graph system according to their access patterns to minimize remote accesses. Second, for some remaining random accesses, Polymer carefully converts random remote accesses into sequential remote accesses, by using lightweight replication of vertices across NUMA nodes. To improve load balance and vertex convergence, Polymer is further built with a hierarchical barrier to boost parallelism and locality, an edgeoriented balanced partitioning for skewed graphs, and adaptive data structures according to the proportion of active vertices. A detailed evaluation on an 80core machine shows that Polymer often outperforms the stateoftheart singlemachine graphanalytics systems, including Ligra, XStream and Galois, for a set of popular realworld and synthetic graphs.
Incremental Algorithms for Network Management and Analysis based on Closeness Centrality
, 1303
"... Analyzing networks requires complex algorithms to extract meaningful information. Centrality metrics have shown to be correlated with the importance and loads of the nodes in network traffic. Here, we are interested in the problem of centralitybased network management. The problem has many applicat ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Analyzing networks requires complex algorithms to extract meaningful information. Centrality metrics have shown to be correlated with the importance and loads of the nodes in network traffic. Here, we are interested in the problem of centralitybased network management. The problem has many applications such as verifying the robustness of the networks and controlling or improving the entity dissemination. It can be defined as finding a small set of topological network modifications which yield a desired closeness centrality configuration. As a fundamental building block to tackle that problem, we propose incremental algorithms which efficiently update the closeness centrality values upon changes in network topology, i.e., edge insertions and deletions. Our algorithms are proven to be efficient on many reallife networks, especially on smallworld networks, which have a small diameter and a spikeshaped shortest distance distribution. In addition to closeness centrality, they can also be a great arsenal for the shortestpathbased management and analysis of the networks. We experimentally validate the efficiency of our algorithms on large networks and show that they update the closeness centrality values of the temporal DBLPcoauthorship network of 1.2 million users 460 times faster than it would take to compute them from scratch. To the best of our knowledge, this is the first work which can yield practical largescale network management based on closeness centrality values.
Graphchidb: Simple design for a scalable graph database system  on just a PC
 CoRR
"... We propose a new data structure, Parallel Adjacency Lists (PAL), for efficiently managing graphs with billions of edges on disk. The PAL structure is based on the graph storage model of GraphChi [14], but we extend it to enable online database features such as queries and fast insertions. In additio ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We propose a new data structure, Parallel Adjacency Lists (PAL), for efficiently managing graphs with billions of edges on disk. The PAL structure is based on the graph storage model of GraphChi [14], but we extend it to enable online database features such as queries and fast insertions. In addition, we extend the model with edge and vertex attributes. Compared to previous data structures, PAL can store graphs more compactly while allowing fast access to both the incoming and the outgoing edges of a vertex, without duplicating data. Based on PAL, we design a graph database management system, GraphChiDB, which can also execute powerful analytical graph computation. We evaluate our design experimentally and demonstrate that GraphChiDB achieves stateoftheart performance on graphs that are much larger than the available memory. GraphChiDB enables anyone with just a laptop or a PC to work with extremely large graphs. 1.
Hardware/software vectorization for closeness centrality on multi/manycore architectures
 In 28th International Parallel and Distributed Processing Symposium Workshops, Workshop on Multithreaded Architectures and Applications (MTAAP
, 2014
"... Abstract—Centrality metrics have shown to be highly correlated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we expl ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Centrality metrics have shown to be highly correlated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we exploit hardware and software vectorization in combination with finegrain parallelization to compute the closeness centrality values. The proposed vectorization approach enables us to do concurrent breadthfirst search operations and significantly increases the performance. We provide a comparison of different vectorization schemes and experimentally evaluate our contributions with respect to the existing parallel CPUbased solutions on cuttingedge hardware. Our implementations achieve to be 11 times faster than the stateoftheart implementation for a graph with 234 million edges. The proposed techniques are beneficial to show how the vectorization can be efficiently utilized to execute other graph kernels that require multiple traversals over a largescale network on cuttingedge architectures. KeywordsCentrality, closeness centrality, vectorization, breadthfirst search, Intel Xeon Phi. I.