Results 1  10
of
33
Scalable gpu graph traversal
 In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’12
, 2012
"... Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrate ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on finegrained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(V+E) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quadGPU configurations, respectively. This level of performance is several times faster than stateoftheart implementations both CPU and GPU platforms.
Accelerating CUDA graph algorithms at maximum warp
 In PPoPP
, 2011
"... Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most realworld graphs t ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most realworld graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture. We then propose a novel virtual warpcentric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables tradeoffs between workload imbalance and ALU underutilization for finetuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30 % improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multithreaded CPU graph implementations is primarily due to the large difference in memory bandwidth.
Parallel breadthfirst search on distributed memory systems
, 2011
"... Dataintensive, graphbased computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for BreadthFirst Search (BFS), a key subroutine in several ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
Dataintensive, graphbased computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for BreadthFirst Search (BFS), a key subroutine in several graph algorithms. We present two highlytuned parallel approaches for BFS on large parallel systems: a levelsynchronous strategy that relies on a simple vertexbased partitioning of the graph, and a twodimensional sparse matrix partitioningbased approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intranode multithreading. Our novel hybrid twodimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributedmemory parallel systems. For instance, for a 40,000core parallel execution on Hopper, an AMD MagnyCours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution. 1.
A workefficient parallel breadthfirst search algorithm (or how to cope with the nondeterminism of reducers
 In SPAA ’10: Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures
, 2010
"... We have developed a multithreaded implementation of breadthfirst search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standard C++ breadthfirst search implementation. PBFS achieves high workefficiency by using a novel imple ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
We have developed a multithreaded implementation of breadthfirst search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standard C++ breadthfirst search implementation. PBFS achieves high workefficiency by using a novel implementation of a multiset data structure, called a “bag, ” in place of the FIFO queue usually employed in serial breadthfirst search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices — a condition met by many realworld graphs — PBFS demonstrates good speedup with the number of processing cores. Since PBFS employs a nonconstanttime “reducer ” — a “hyperobject” feature of Cilk++ — the work inherent in a PBFS execution depends nondeterministically on how the underlying workstealing scheduler loadbalances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutualexclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G =(V,E) with diameter D and bounded outdegree, this dataracefree version of PBFS algorithm runs in time O((V + E)/P + Dlg3 (V /D)) on P processors, which means that it attains nearperfect linear speedup if P ≪ (V + E)/Dlg3 (V /D).
1 Multithreaded Asynchronous Graph Traversal for InMemory and SemiExternal Memory
"... Abstract—Processing large graphs is becoming increasingly important for many computational domains. Unfortunately, many algorithms and implementations do not scale with the demand for increasing graph sizes. As a result, researchers have attempted to meet the growing data demands using parallel and ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Abstract—Processing large graphs is becoming increasingly important for many computational domains. Unfortunately, many algorithms and implementations do not scale with the demand for increasing graph sizes. As a result, researchers have attempted to meet the growing data demands using parallel and external memory techniques. Our work, targeted to chip multiprocessors, takes a highly parallel asynchronous approach to hide the high data latency due to both poor locality and delays in the underlying graph data storage. We present a novel asynchronous approach to compute Breadth First Search (BFS), Single Source Shortest Path (SSSP), and Connected Components (CC) for large graphs in shared memory. We present an experimental study applying our technique to both InMemory (IM) and SemiExternal Memory (SEM) graphs utilizing multicore processors and solidstate memory devices. Our experiments using both synthetic and realworld datasets show that our asynchronous approach is able to overcome data latencies and provide significant speedup over alternative approaches. I.
Scalable Graph Exploration on Multicore Processors
"... Abstract—Many important problems in computational sciences, social network analysis, security, and business analytics, are dataintensive and lend themselves to graphtheoretical analyses. In this paper we investigate the challenges involved in exploring very large graphs by designing a breadthfirs ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Abstract—Many important problems in computational sciences, social network analysis, security, and business analytics, are dataintensive and lend themselves to graphtheoretical analyses. In this paper we investigate the challenges involved in exploring very large graphs by designing a breadthfirst search (BFS) algorithm for advanced multicore processors that are likely to become the building blocks of future exascale systems. Our new methodology for largescale graph analytics combines a highlevel algorithmic design that captures the machineindependent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processorspecific optimizations. We present an experimental study that uses stateoftheart Intel Nehalem EP and EX processors and up to 64 threads in a single system. Our performance on several benchmark problems representative of the powerlaw graphs found in realworld problems reaches processing rates that are competitive with supercomputing results in the recent literature. In the experimental evaluation we prove that our graph exploration algorithm running on a 4socket Nehalem EX is (1) 2.4 times faster than a Cray XMT with 128 processors when exploring a random graph with 64 million vertices and 512 millions edges, (2) capable of processing 550 million edges per second with an RMAT graph with 200 million vertices and 1 billion edges, comparable to the performance of a similar graph on a Cray MTA2 with 40 processors and (3) 5 times faster than 256 BlueGene/L processors on a graph with average degree 50. I.
On the Representation and Multiplication of Hypersparse Matrices
, 2008
"... Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on ..."
Abstract

Cited by 9 (7 self)
 Add to MetaCart
Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on the multiplication of sparse matrices (SpGEMM). We first present the issues with existing sparse matrix representations and multiplication algorithms that make them unscalable to thousands of processors. Then, we develop and analyze two new algorithms that overcome these limitations. We consider our algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm for SpGEMM that would execute different kernels depending on the sparsity of the input matrices. Such a sequential kernel requires a new data structure that exploits the hypersparsity of the individual submatrices owned by a single processor after the 2D partitioning. We experimentally evaluate the performance and characteristics of our algorithms and show that they scale significantly better than existing kernels.
Polymorphic OnChip Networks
"... As the number of cores per die increases, be they processors, memory blocks, or custom accelerators, the onchip interconnect the cores use to communicate gains importance. We begin this study with an areaperformance analysis of the interconnect design space. We find that there is no single network ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
As the number of cores per die increases, be they processors, memory blocks, or custom accelerators, the onchip interconnect the cores use to communicate gains importance. We begin this study with an areaperformance analysis of the interconnect design space. We find that there is no single network design that yields optimal performance across a range of traffic patterns. This indicates that there is an opportunity to gain performance by customizing the interconnect to a particular application or workload. We propose polymorphic onchip networks to enable perapplication network customization. This network can be configured prior to application runtime, to have the topology and buffering of arbitrary network designs. This paper proposes one such polymorphic network architecture. We demonstrate its modes of configurability, and evaluate the polymorphic network architecture design space, producing polymorphic fabrics that minimize the network area overhead. Finally, we expand the network on chip design space to include a polymorphic network design, showing that a single polymorphic network is capable of implementing all of the pareto optimal fixednetwork designs. 1
Efficient parallel graph exploration for multicore cpu and gpu
 In IEEE PACT
, 2011
"... Abstract—Graphs are a fundamental data representation that have been used extensively in various domains. In graphbased applications, a systematic exploration of the graph such as a breadthfirst search (BFS) often serves as a key component in the processing of their massive data sets. In this pape ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Abstract—Graphs are a fundamental data representation that have been used extensively in various domains. In graphbased applications, a systematic exploration of the graph such as a breadthfirst search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multicore CPUs which exploits a fundamental property of randomly shaped realworld graph instances. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current stateoftheart implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multicore execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worstcase performance on highdiameter graphs. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems; a highend GPU system performed as well as a quadsocket highend CPU system. I.