Results 1  10
of
27
Scalable gpu graph traversal
 In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’12
, 2012
"... Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrate ..."
Abstract

Cited by 64 (1 self)
 Add to MetaCart
Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on finegrained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(V+E) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quadGPU configurations, respectively. This level of performance is several times faster than stateoftheart implementations both CPU and GPU platforms.
Parallel algorithms for evaluating centrality indices in realworld networks
 In Proceedings of the International Conference on Parallel Processing (ICPP
, 2006
"... This paper discusses fast parallel algorithms for evaluating several centrality indices frequently used in complex network analysis. These algorithms have been optimized to exploit properties typically observed in realworld large scale networks, such as the low average distance, high local density, ..."
Abstract

Cited by 54 (11 self)
 Add to MetaCart
(Show Context)
This paper discusses fast parallel algorithms for evaluating several centrality indices frequently used in complex network analysis. These algorithms have been optimized to exploit properties typically observed in realworld large scale networks, such as the low average distance, high local density, and heavytailed power law degree distributions. We test our implementations on real datasets such as the web graph, proteininteraction networks, movieactor and citation networks, and report impressive parallel performance for evaluation of the computationally intensive centrality metrics (betweenness and closeness centrality) on highend shared memory symmetric multiprocessor and multithreaded architectures. To our knowledge, these are the first parallel implementations of these widelyused social network analysis metrics. We demonstrate that it is possible to rigorously analyze networks three orders of magnitude larger than instances that can be handled by existing network analysis (SNA) software packages. For instance, we compute the exact betweenness centrality value for each vertex in a large US patent citation network (3 million patents, 16 million citations) in 42 minutes on 16 processors, utilizing 20GB RAM of the IBM p5 570. Current SNA packages on the other hand cannot handle graphs with more than hundred thousand edges. 1
Scalable Graph Exploration on Multicore Processors
"... Abstract—Many important problems in computational sciences, social network analysis, security, and business analytics, are dataintensive and lend themselves to graphtheoretical analyses. In this paper we investigate the challenges involved in exploring very large graphs by designing a breadthfirs ..."
Abstract

Cited by 45 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Many important problems in computational sciences, social network analysis, security, and business analytics, are dataintensive and lend themselves to graphtheoretical analyses. In this paper we investigate the challenges involved in exploring very large graphs by designing a breadthfirst search (BFS) algorithm for advanced multicore processors that are likely to become the building blocks of future exascale systems. Our new methodology for largescale graph analytics combines a highlevel algorithmic design that captures the machineindependent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processorspecific optimizations. We present an experimental study that uses stateoftheart Intel Nehalem EP and EX processors and up to 64 threads in a single system. Our performance on several benchmark problems representative of the powerlaw graphs found in realworld problems reaches processing rates that are competitive with supercomputing results in the recent literature. In the experimental evaluation we prove that our graph exploration algorithm running on a 4socket Nehalem EX is (1) 2.4 times faster than a Cray XMT with 128 processors when exploring a random graph with 64 million vertices and 512 millions edges, (2) capable of processing 550 million edges per second with an RMAT graph with 200 million vertices and 1 billion edges, comparable to the performance of a similar graph on a Cray MTA2 with 40 processors and (3) 5 times faster than 256 BlueGene/L processors on a graph with average degree 50. I.
An experimental study of a parallel shortest path algorithm for solving largescale graph instances
 Ninth Workshop on Algorithm Engineering and Experiments (ALENEX 2007)
, 2007
"... We present an experimental study of the single source shortest path problem with nonnegative edge weights (NSSP) on largescale graphs using the $\Delta$stepping parallel algorithm. We report performance results on the Cray MTA2, a multithreaded parallel computer. The MTA2 is a highend shared m ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
(Show Context)
We present an experimental study of the single source shortest path problem with nonnegative edge weights (NSSP) on largescale graphs using the $\Delta$stepping parallel algorithm. We report performance results on the Cray MTA2, a multithreaded parallel computer. The MTA2 is a highend shared memory system offering two unique features that aid the efficient parallel implementation of irregular algorithms: the ability to exploit finegrained parallelism, and lowoverhead synchronization primitives. Our implementation exhibits remarkable parallel speedup when compared with competitive sequential algorithms, for lowdiameter sparse graphs. For instance, $\Delta$stepping on a directed scalefree graph of 100 million vertices and 1 billion edges takes less than ten seconds on 40 processors of the MTA2, with a relative speedup of close to 30. To our knowledge, these are the first performance results of a shortest path problem on realistic graph instances in the order of billions of vertices and edges.
SNAP, Smallworld Network Analysis and Partitioning: an opensource parallel graph framework for the exploration of largescale networks
"... We present SNAP (Smallworld Network Analysis and Partitioning), an opensource graph framework for exploratory study and partitioning of largescale networks. To illustrate the capability of SNAP, we discuss the design, implementation, and performance of three novel parallel community detection alg ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
(Show Context)
We present SNAP (Smallworld Network Analysis and Partitioning), an opensource graph framework for exploratory study and partitioning of largescale networks. To illustrate the capability of SNAP, we discuss the design, implementation, and performance of three novel parallel community detection algorithms that optimize modularity, a popular measure for clustering quality in social network analysis. In order to achieve scalable parallel performance, we exploit typical network characteristics of smallworld networks, such as the low graph diameter, sparse connectivity, and skewed degree distribution. We conduct an extensive experimental study on realworld graph instances and demonstrate that our parallel schemes, coupled with aggressive algorithm engineering for smallworld networks, give significant running time improvements over existing modularitybased clustering heuristics, with little or no loss in clustering quality. For instance, our divisive clustering approach based on approximate edge betweenness centrality is more than two orders of magnitude faster than a competing greedy approach, for a variety of large graph instances on the Sun Fire T2000 multicore system. SNAP also contains parallel implementations of fundamental graphtheoretic kernels and topological analysis metrics (e.g., breadthfirst search, connected components, vertex and edge centrality) that are optimized for smallworld networks. The SNAP framework is extensible; the graph kernels are modular, portable across shared memory multicore and symmetric multiprocessor systems, and simplify the design of highlevel domainspecific applications. 1
Parallel Shortest Path Algorithms for Solving . . .
, 2006
"... We present an experimental study of the single source shortest path problem with nonnegative edge weights (NSSP) on largescale graphs using the ∆stepping parallel algorithm. We report performance results on the Cray MTA2, a multithreaded parallel computer. The MTA2 is a highend shared memory s ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
(Show Context)
We present an experimental study of the single source shortest path problem with nonnegative edge weights (NSSP) on largescale graphs using the ∆stepping parallel algorithm. We report performance results on the Cray MTA2, a multithreaded parallel computer. The MTA2 is a highend shared memory system offering two unique features that aid the efficient parallel implementation of irregular algorithms: the ability to exploit finegrained parallelism, and lowoverhead synchronization primitives. Our implementation exhibits remarkable parallel speedup when compared with competitive sequential algorithms, for lowdiameter sparse graphs. For instance, ∆stepping on a directed scalefree graph of 100 million vertices and 1 billion edges takes less than ten seconds on 40 processors of the MTA2, with a relative speedup of close to 30. To our knowledge, these are the first performance results of a shortest path problem on realistic graph instances in the order of billions of vertices and edges.
Fast and Scalable List Ranking on the GPU
"... General purpose programming on the graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. The GPUs have been used extensively on regular problems that can be easily parallelized. In this paper, ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
General purpose programming on the graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. The GPUs have been used extensively on regular problems that can be easily parallelized. In this paper, we describe two implementations of List Ranking, a traditional irregular algorithm that is difficult to parallelize on such massively multithreaded hardware. We first present an implementation of Wyllie’s algorithm based on pointer jumping. This technique does not scale well to large lists due to the suboptimal work done. We then present a GPUoptimized, Recursive HelmanJáJá (RHJ) algorithm. Our RHJ implementation can rank a random list of 32 million elements in about a second and achieves a speedup of about 89 over a CPU implementation as well as a speedup of 34 over the best reported implementation on the Cell Broadband engine. We also discuss the practical issues relating to the implementation of irregular algorithms on massively multithreaded architectures like that of the GPU. Regular or coalesced memory accesses pattern and balanced load are critical to achieve good performance on the GPU. Categories andSubjectDescriptors
Models for Advancing PRAM and Other ALgorithms into Parallel Programs For A Pramonchip Platform
"... ..."
High performance combinatorial algorithm design on the Cell Broadband Engine processor
, 2007
"... The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD coprocessing units (SPEs) integrated onchip. While the Cell/B.E. processor is architected for multimedia applications with regular pr ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD coprocessing units (SPEs) integrated onchip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with nonuniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library. List ranking is a particularly challenging problem to parallelize on current cachebased and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of SoftwareManaged threads (SMThreads). We apply this generic SPE workpartitioning technique to efficiently implement list ranking, and demonstrate substantial speedup in comparison to traditional cachebased microprocessors. For instance, on a 3.2 GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPEonly implementation. Our second case study, zlib, is a data compression/decompression library that is extensively used in both scientific as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm
High Performance and Scalable GPU Graph Traversal
, 2011
"... Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrate ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on finegrained task management that achieves an asymptotically optimal O(V+E) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quadGPU configurations, respectively. This level of performance is several times faster than stateoftheart implementations both CPU and GPU platforms. 1.