## Efficient parallel graph exploration for multi-core cpu and gpu (2011)

Venue: | In IEEE PACT |

Citations: | 13 - 1 self |

### BibTeX

@INPROCEEDINGS{Hong11efficientparallel,

author = {Sungpack Hong and Tayo Oguntebi and Kunle Olukotun},

title = {Efficient parallel graph exploration for multi-core cpu and gpu},

booktitle = {In IEEE PACT},

year = {2011}

}

### OpenURL

### Abstract

Abstract—Graphs are a fundamental data representation that have been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multi-core CPUs which exploits a fundamental property of randomly shaped real-world graph instances. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current state-of-the-art implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multicore execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worst-case performance on high-diameter graphs. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems; a high-end GPU system performed as well as a quad-socket highend CPU system. I.

### Citations

9158 | Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1998
(Show Context)
Citation Context ...ves as a building block for many other algorithms including betweenness centrality calculation [4], connected component identification [7], community structure detection [8], and max-flow computation =-=[9]-=-. Benchmark suites targeting graph applications perennially include BFS as a primary element [10], [11]. Due to such importance, significant research has been conducted to efficiently implement a para... |

2198 |
Collective dynamics of ‘small-world’ networks
- Watts, Strogatz
- 1998
(Show Context)
Citation Context ...instances that are irregularly shaped by nature. This is because it has been observed that the diameters of real-world graphs are small even for large graph instances, i.e. the small world phenomenon =-=[22]-=-. Consequently, the overhead of level-wise synchronization is tolerable since the number of synchronization events– the diameter of the graph– is small. Similarly, because of the small world phenomeno... |

857 |
Finding and evaluating community structure in networks
- Newman, Girvan
(Show Context)
Citation Context ...aph algorithms, because it serves as a building block for many other algorithms including betweenness centrality calculation [4], connected component identification [7], community structure detection =-=[8]-=-, and max-flow computation [9]. Benchmark suites targeting graph applications perennially include BFS as a primary element [10], [11]. Due to such importance, significant research has been conducted t... |

335 | A Faster algorithm for Betweenness Centrality
- Brandes
- 2001
(Show Context)
Citation Context ...oration is one important example of such problems. Graphs are a fundamental data representation widely used in numerous fields such as intelligence analysis [2], robotics [3], social network analysis =-=[4]-=-, and computational biology [5]. These applications have traditionally required long periods of processing time due to their massive data-set sizes. Parallelism has usually failed to alleviate matters... |

213 | Pregel: a system for large-scale graph processing
- Malewicz, Austern, et al.
- 2010
(Show Context)
Citation Context .... There are a few frameworks or libraries which aim to simplify graph processing in distributed envrionements. PBGL [27] is a message-passing implementation of the classic boost graph library. Pregel =-=[28]-=- is a distributed framework that encapsulates message passing and fault-tolerance in a similar manner to the MapReduce framework; traditional graph algorithms should be expressed in a description suit... |

167 |
The Algorithm Design Manual
- Skiena
- 1997
(Show Context)
Citation Context ...idered one of the most important graph algorithms, because it serves as a building block for many other algorithms including betweenness centrality calculation [4], connected component identification =-=[7]-=-, community structure detection [8], and max-flow computation [9]. Benchmark suites targeting graph applications perennially include BFS as a primary element [10], [11]. Due to such importance, signif... |

154 | R-MAT: A recursive model for graph mining
- Chakrabarti, Zhan, et al.
- 2004
(Show Context)
Citation Context ...As an illustration, Table I shows the number of nodes in each BFS level, obtained from a typical BFS execution on a synthetic graph with 32 million nodes and 256 million edgesgenerated by RMAT model =-=[23]-=-. (See Section IV for more discussion about our graph generation models.) From the table, one can observe that the maximum BFS level is small (7) for such a large graph and that most of the nodes belo... |

79 | Accelerating large graph algorithms on the GPU using
- Harish, Narayanan
- 2007
(Show Context)
Citation Context ... order. Also, in order to accommodate such a wide range of applications, many BFS implementations simply store the BFS level (i.e. hop distance from the root) of each node as their final output [12], =-=[15]-=-, [19]. Two different strategies have been proposed for parallel (and distributed) execution of BFS. The first method, known as the fixed-point algorithm, continuously update the BFS level of every no... |

57 |
Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2
- Bader, Madduri
- 2006
(Show Context)
Citation Context ...sing [29], which features high memory bandwidth, huge memory capacity, and many cores that are heavily multi-threaded. Graph algorithms, including BFS, showed impressive performance on these machines =-=[13]-=-, [16]. Unfortunately, such machines are rare and costly. Some researchers [15], [19] used GPU to accelerate graph algorithms, because GPU shares many architectural properties of aforementioned superc... |

52 |
Graphbased technologies for intelligence analysis
- Coffman, Greenblatt, et al.
- 2004
(Show Context)
Citation Context ...tions have yet to be identified. Graph exploration is one important example of such problems. Graphs are a fundamental data representation widely used in numerous fields such as intelligence analysis =-=[2]-=-, robotics [3], social network analysis [4], and computational biology [5]. These applications have traditionally required long periods of processing time due to their massive data-set sizes. Parallel... |

42 | Global a-optimal robot exploration in slam
- Sim, Roy
- 2005
(Show Context)
Citation Context ... to be identified. Graph exploration is one important example of such problems. Graphs are a fundamental data representation widely used in numerous fields such as intelligence analysis [2], robotics =-=[3]-=-, social network analysis [4], and computational biology [5]. These applications have traditionally required long periods of processing time due to their massive data-set sizes. Parallelism has usuall... |

42 | A scalable distributed parallel breadth-first search algorithm on bluegene/l
- Yoo, Chow, et al.
- 2005
(Show Context)
Citation Context ... perennially include BFS as a primary element [10], [11]. Due to such importance, significant research has been conducted to efficiently implement a parallel BFS for a wide array of computing systems =-=[12]-=-–[19]. Two recent results particularly draw our attention. One is Agarwal et al’s work [18] which presented a state-of-the-art BFS implementation for multi-core systems. Their implementation utilized ... |

41 | The Parallel BGL: A generic library for distributed grap h computations
- Gregor, Lumsdaine
- 2005
(Show Context)
Citation Context ...ributed processing is mandatory: the graph does not fit in a single machine’s memory. There are a few frameworks or libraries which aim to simplify graph processing in distributed envrionements. PBGL =-=[27]-=- is a message-passing implementation of the classic boost graph library. Pregel [28] is a distributed framework that encapsulates message passing and fault-tolerance in a similar manner to the MapRedu... |

39 |
The gpu computing era
- Nickolls, Dally
(Show Context)
Citation Context ...eld devices. The idea of using the graphics processor for general purpose computation has also become popular, since this approach has yielded tremendous performance when applied to suitable problems =-=[1]-=-. Such proliferation of parallelism (multiple threads on a CPU or GPU) and heterogeneity (simultaneous use of a CPU and GPU) has succeeded in greatly improving the performance of many traditional comp... |

39 |
Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
- Lee, Kim, et al.
- 2010
(Show Context)
Citation Context ...heir architectural effects. According to our observation, random memory access bandwidth was most critical to BFS performance. We refer the readers to recent papers regarding CPU vs. GPU debates [1], =-=[30]-=-. VII. CONCLUSION In this paper, we propose new methods for parallel breadthfirst search (BFS) implementations. Our multi-core CPU methodology is simple to apply yet efficient in utilizing memory band... |

26 | Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors
- Bader, Madduri
- 2005
(Show Context)
Citation Context ...4], connected component identification [7], community structure detection [8], and max-flow computation [9]. Benchmark suites targeting graph applications perennially include BFS as a primary element =-=[10]-=-, [11]. Due to such importance, significant research has been conducted to efficiently implement a parallel BFS for a wide array of computing systems [12]–[19]. Two recent results particularly draw ou... |

23 | Challenges in parallel graph processing
- LUMSDAINE, GREGOR, et al.
(Show Context)
Citation Context ...ed to alleviate matters, because parallel speedup of these applications is severely limited by the random nature of their memory access patterns, a fundamental property of graph processing algorithms =-=[6]-=-. Breadth-first search (BFS) is a fundamental graph algorithm that systematically explores the nodes in the graph. BFS is typically considered one of the most important graph algorithms, because it se... |

21 | Accelerating cuda graph algorithms at maximum warp
- HONG, KIM, et al.
- 2011
(Show Context)
Citation Context ... non-critical levels. We used (T1, T2, T3) = (64, max(2 18 , N ∗0.01), 2048) and (α, β) = (2.0, 2.0). We now observe that the idea of the hybrid method can enhance the previous GPU BFS implementation =-=[19]-=- using the same principles. Although not explicitly mentioned in its paper, this GPU implementation does suffer from the same non-critical level inefficiency issue as the Read-based method, since it a... |

17 | Multithreaded asynchronous graph traversal for in-memory and semi-external memory
- PEARCE, GOKHALE, et al.
(Show Context)
Citation Context ...nt strategies. Hassaan et al [25] compared fixedpoint and level synchronous strategies and confirmed that level synchronous strategy allows for a sufficient degree of parallelism in BFS. Pearce et al =-=[26]-=- applied a fixed-point strategy for various graph algorithms including BFS, focusing on reducing synchronization overhead. Finally, Yoo et. al. [12] adopted a fixed-point strategy in order to implemen... |

13 |
Efficient Breadth-First Search on the Cell/BE Processor
- Scarpazza, Villa, et al.
- 2008
(Show Context)
Citation Context ...cache coherence traffic between CPU sockets. When executed on a high-end multi-core system, their implementation outperformed previous proposals, even those including other architectures such as Cell =-=[14]-=-, clusters [12], and shared memory supercomputers [13], [16]. The other proposal is a BFS implementation for GPUs by Hong et al [19], reported around the same time as the Agarwal et al paper. Hong et ... |

11 | small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks - Bader, Madduri, et al. - 2008 |

10 |
Nardeli et al, A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules
- Tong, Drees, et al.
- 2002
(Show Context)
Citation Context ...e of such problems. Graphs are a fundamental data representation widely used in numerous fields such as intelligence analysis [2], robotics [3], social network analysis [4], and computational biology =-=[5]-=-. These applications have traditionally required long periods of processing time due to their massive data-set sizes. Parallelism has usually failed to alleviate matters, because parallel speedup of t... |

7 | Snap: small-world network analysis and partitioning
- Bader, Madduri
- 2010
(Show Context)
Citation Context ...ed on the systems. TABLE III THE SPECIFICATION OF MACHINES USED IN OUR EXPERIMENTS AND THE PREVIOUS WORK [18] fractal community structure. Both of our generators came from a graph library called SNAP =-=[24]-=-. For the parameters of the RMAT graph, we used default values in the SNAP library: (a,b,c)=(0.45,0.25,0.15). As for graph representation, we used the CSR (Compressed Sparse Row) format which merges t... |

5 |
Better benchmarking for supercomputers
- Anderson
- 2011
(Show Context)
Citation Context ...nnected component identification [7], community structure detection [8], and max-flow computation [9]. Benchmark suites targeting graph applications perennially include BFS as a primary element [10], =-=[11]-=-. Due to such importance, significant research has been conducted to efficiently implement a parallel BFS for a wide array of computing systems [12]–[19]. Two recent results particularly draw our atte... |

4 | Ordered vs Unordered: A Comparison of Parallelism and Work-efficiency
- Hassaan, Burtscher, et al.
- 2011
(Show Context)
Citation Context ...ulti-GPU systems in our experiment. However, considering the random access nature of this problem, the additional benefit of using a multi-GPU system is unclear. fixed-point strategies. Hassaan et al =-=[25]-=- compared fixedpoint and level synchronous strategies and confirmed that level synchronous strategy allows for a sufficient degree of parallelism in BFS. Pearce et al [26] applied a fixed-point strate... |

2 | Intel Microarchitecture, Codenamed Nehalem. http: // www. intel. com/ technology/ architecture-silicon/ next-gen/ , Accessed - INTEL - 2009 |

1 |
Early experience with out-of-core applications
- Chavarria-Miranda, Marquez, et al.
- 2008
(Show Context)
Citation Context ...n a high-end multi-core system, their implementation outperformed previous proposals, even those including other architectures such as Cell [14], clusters [12], and shared memory supercomputers [13], =-=[16]-=-. The other proposal is a BFS implementation for GPUs by Hong et al [19], reported around the same time as the Agarwal et al paper. Hong et al solved the workload imbalance issue when processing irreg... |

1 |
Next generation cuda architecture, code named fermi,” http://www.nvidia.com/object/fermi_architecture.html
- Nvidia
(Show Context)
Citation Context ...rmance improvement compared to multi-core CPU implementations. However, their comparison included neither Agarwal’s work nor more recent architectures such as the Nehalem CPU family [20] or Fermi GPU =-=[21]-=-. In this study, we build upon ideas from both previous works and incorporate them into a universal solution that utilizes both the CPU and GPU on a heterogeneous system. Specifically, we first propos... |