Results 1  10
of
31
Scalable gpu graph traversal
 In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’12
, 2012
"... Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrate ..."
Abstract

Cited by 62 (1 self)
 Add to MetaCart
(Show Context)
Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on finegrained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(V+E) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quadGPU configurations, respectively. This level of performance is several times faster than stateoftheart implementations both CPU and GPU platforms.
XStream: Edgecentric Graph Processing using Streaming Partitions
"... XStream is a system for processing both inmemory and outofcore graphs on a single sharedmemory machine. While retaining the scattergather programming model with state stored in the vertices, XStream is novel in (i) using an edgecentric rather than a vertexcentric implementation of this mod ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
(Show Context)
XStream is a system for processing both inmemory and outofcore graphs on a single sharedmemory machine. While retaining the scattergather programming model with state stored in the vertices, XStream is novel in (i) using an edgecentric rather than a vertexcentric implementation of this model, and (ii) streaming completely unordered edge lists rather than performing random access. This design is motivated by the fact that sequential bandwidth for all storage media (main memory, SSD, and magnetic disk) is substantially larger than random access bandwidth. We demonstrate that a large number of graph algorithms can be expressed using the edgecentric scattergather model. The resulting implementations scale well in terms of number of cores, in terms of number of I/O devices, and across different storage media. XStream competes favorably with existing systems for graph processing. Besides sequential access, we identify as one of the main contributors to better performance the fact that XStream does not need to sort edge lists during preprocessing. 1
Distributed Memory BreadthFirst Search Revisited: Enabling BottomUp Search
"... Abstract—Breadthfirst search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional topdown approach always takes as much time as the ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Breadthfirst search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional topdown approach always takes as much time as the worst case. A recently discovered bottomup approach manages to cut down the complexity all the way to the number of vertices in the best case, which is typically at least an order of magnitude less than the number of edges. The bottomup approach is not always advantageous, so it is combined with the topdown approach to make the directionoptimizing algorithm which adaptively switches from topdown to bottomup as the frontier expands. We present a scalable distributedmemory parallelization of this challenging algorithm and show up to an order of magnitude speedups compared to an earlier purely topdown code. Our approach also uses a 2D decomposition of the graph that has previously been shown to be superior to a 1D decomposition. Using the default parameters of the Graph500 benchmark, our new algorithm achieves a performance rate of over 240 billion edges per second on 115 thousand cores of a Cray XE6, which makes it over 7 × faster than a conventional topdown algorithm using the same set of optimizations and data distribution. I.
SIMD parallelization of applications that traverse irregular data structures
 In CGO
, 2013
"... Finegrained data parallelism is increasingly common in mainstream processors in the form of longer vectors and onchip GPUs. This paper develops support for exploiting such data parallelism for a class of nonnumeric, nongraphic applications, which perform computations while traversing many indepen ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Finegrained data parallelism is increasingly common in mainstream processors in the form of longer vectors and onchip GPUs. This paper develops support for exploiting such data parallelism for a class of nonnumeric, nongraphic applications, which perform computations while traversing many independent, irregular data structures. While the traversal of any one irregular data structure does not give opportunity for parallelization, traversing a set of these does. However, mapping such parallelism to SIMD units is nontrivial and not addressed in prior work. We address this problem by developing an intermediate language for specifying such traversals, followed by a runtime scheduler that maps traversals to SIMD units. A key idea in our runtime scheme is converting branches to arithmetic operations, which then allows us to use SIMD hardware. In order to make our approach fast, we demonstrate several optimizations including a stream compaction method that aids with control flow in SIMD, a set of layouts that reduce memory latency, and a tiling approach that enables more effective prefetching. Using our approach, we demonstrate significant increases in singlecore performance over optimized baselines for two applications.
High Performance and Scalable GPU Graph Traversal
, 2011
"... Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrate ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on finegrained task management that achieves an asymptotically optimal O(V+E) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quadGPU configurations, respectively. This level of performance is several times faster than stateoftheart implementations both CPU and GPU platforms. 1.
PrefEdge: SSD prefetcher for largescale graph traversal
 In Proceedings of the 7th International Systems and Storage Conference, SYSTOR ’14
, 2014
"... Mining large graphs has now become an important aspect of multiple diverse applications and a number of computer systems have been proposed to provide runtime support. Recent interest in this area has led to the construction of single machine graph computation systems that use solid state drives ( ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Mining large graphs has now become an important aspect of multiple diverse applications and a number of computer systems have been proposed to provide runtime support. Recent interest in this area has led to the construction of single machine graph computation systems that use solid state drives (SSDs) to store the graph. This approach reduces the cost and simplifies the implementation of graph algorithms, making computations on large graphs available to the average user. However, SSDs are slower than main memory, and making full use of their bandwidth is crucial for executing graph algorithms in a reasonable amount of time. In this paper, we present PrefEdge, a prefetcher for graph algorithms that parallelises requests to derive maximum throughput from SSDs. PrefEdge combines a judicious distribution of graph state between main memory and SSDs with an innovative readahead algorithm to prefetch needed data in parallel. This is in contrast to existing approaches that depend on multithreading the graph algorithms to saturate available bandwidth. Our experiments on graph algorithms using random access show that PrefEdge not only is capable of maximising the throughput from SSDs but is also able to almost hide the effect of I/O latency. The improvements in runtime for graph algorithms is up to 14 × when compared to a single threaded baseline. When compared to multithreaded implementations, PrefEdge performs up to 80 % faster without the program complexity and the programmer effort needed for multithreaded graph algorithms.
Taskbased Parallel BreadthFirst Search in Heterogeneous Environments
"... Abstract—Breadthfirst search (BFS) is an essential graph traversal strategy widely used in many computing applications. Because of its irregular data access patterns, BFS has become a nontrivial problem hard to parallelize efficiently. In this paper, we introduce a parallelization strategy that al ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Breadthfirst search (BFS) is an essential graph traversal strategy widely used in many computing applications. Because of its irregular data access patterns, BFS has become a nontrivial problem hard to parallelize efficiently. In this paper, we introduce a parallelization strategy that allows the load balancing of computation resources as well as the execution of graph traversals in hybrid environments composed of CPUs and GPUs. To achieve that goal, we use a finegrained taskbased parallelization scheme and the OmpSs programming model. We obtain processing rates up to 2.8 billion traversed edges per second with a single GPU and a multicore processor. Our study shows high processing rates are achievable with hybrid environments despite the GPU communication latency and memory coherence. I.
CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores
 in IEEE International Symposium on Workload Characterization, (IISWC
, 2015
"... Abstract—Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared me ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multithreaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur finegrain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in stateoftheart graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors. I.
Fast Diameter Computation of Large Sparse Graphs using GPUs
"... Abstract—In this paper we propose a highly parallel GPUbased bounding algorithm for computing the exact diameter of large realworld sparse graphs. The diameter is defined as the length of the longest shortest path between vertices in the graph, and serves as a relevant property of all types of grap ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper we propose a highly parallel GPUbased bounding algorithm for computing the exact diameter of large realworld sparse graphs. The diameter is defined as the length of the longest shortest path between vertices in the graph, and serves as a relevant property of all types of graphs that are nowadays frequently studied. Examples include social networks, webgraphs and routing networks. We verify the performance of our parallel approach on a set of large graphs comprised of millions of vertices, and using a CUDA GPU observe an increase in performance of up to 21.1 × compared to a CPU algorithm using the same strategy. Based on these results, we provide a characterization of the types of graphs that are wellsuited for traversal by means of our parallel diameter algorithm. We furthermore include a comparison of different GPU algorithms for singlesource shortest path computations, which is not only a crucial step in computing the diameter, but also relevant in many other distance and neighborhoodbased algorithms. Keywordsdiameter; graph traversal; eccentricity; CUDA; sparse graphs I.
MachSuite: Benchmarks for Accelerator Design and Customized Architectures
"... Abstract—Recent highlevel synthesis and acceleratorrelated architecture papers show a great disparity in workload selection. To improve standardization within the accelerator research community, we present MachSuite, a collection of 19 benchmarks for evaluating highlevel synthesis tools and acce ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Recent highlevel synthesis and acceleratorrelated architecture papers show a great disparity in workload selection. To improve standardization within the accelerator research community, we present MachSuite, a collection of 19 benchmarks for evaluating highlevel synthesis tools and acceleratorcentric architectures. MachSuite spans a broad application space, captures a variety of different program behaviors, and provides implementations tailored towards the needs of accelerator designers and researchers, including support for highlevel synthesis. We illustrate these aspects by characterizing each benchmark along five different dimensions, highlighting trends and salient features. I.