• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L (0)

by A Yoo
Venue:ACM/IEEE SC 2005 Conference (SC’05
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 14
Next 10 →

Software and Algorithms for Graph Queries on Multithreaded Architectures

by Jonathan W. Berry, Bruce Hendrickson, Simon Kahan, Petr Konecny
"... ..."
Abstract - Cited by 8 (1 self) - Add to MetaCart
Abstract not found

Accelerating CUDA graph algorithms at maximum warp

by Sungpack Hong, Sang Kyun, Kim Tayo, Oguntebi Kunle Olukotun - In PPoPP , 2011
"... Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most real-world graphs t ..."
Abstract - Cited by 6 (2 self) - Add to MetaCart
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture. We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30 % improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.

Polymorphic On-Chip Networks

by Martha Mercaldi Kim, John D. Davis, Mark Oskin, Todd Austin
"... As the number of cores per die increases, be they processors, memory blocks, or custom accelerators, the on-chip interconnect the cores use to communicate gains importance. We begin this study with an area-performance analysis of the interconnect design space. We find that there is no single network ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
As the number of cores per die increases, be they processors, memory blocks, or custom accelerators, the on-chip interconnect the cores use to communicate gains importance. We begin this study with an area-performance analysis of the interconnect design space. We find that there is no single network design that yields optimal performance across a range of traffic patterns. This indicates that there is an opportunity to gain performance by customizing the interconnect to a particular application or workload. We propose polymorphic on-chip networks to enable perapplication network customization. This network can be configured prior to application runtime, to have the topology and buffering of arbitrary network designs. This paper proposes one such polymorphic network architecture. We demonstrate its modes of configurability, and evaluate the polymorphic network architecture design space, producing polymorphic fabrics that minimize the network area overhead. Finally, we expand the network on chip design space to include a polymorphic network design, showing that a single polymorphic network is capable of implementing all of the pareto optimal fixed-network designs. 1

1 Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory

by Roger Pearce, Maya Gokhale, Nancy M. Amato
"... Abstract—Processing large graphs is becoming increasingly important for many computational domains. Unfortunately, many algorithms and implementations do not scale with the demand for increasing graph sizes. As a result, researchers have attempted to meet the growing data demands using parallel and ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
Abstract—Processing large graphs is becoming increasingly important for many computational domains. Unfortunately, many algorithms and implementations do not scale with the demand for increasing graph sizes. As a result, researchers have attempted to meet the growing data demands using parallel and external memory techniques. Our work, targeted to chip multi-processors, takes a highly parallel asynchronous approach to hide the high data latency due to both poor locality and delays in the underlying graph data storage. We present a novel asynchronous approach to compute Breadth First Search (BFS), Single Source Shortest Path (SSSP), and Connected Components (CC) for large graphs in shared memory. We present an experimental study applying our technique to both In-Memory (IM) and Semi-External Memory (SEM) graphs utilizing multi-core processors and solid-state memory devices. Our experiments using both synthetic and realworld datasets show that our asynchronous approach is able to overcome data latencies and provide significant speedup over alternative approaches. I.

J.F.: Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors

by Oreste Villa, Daniele Paolo Scarpazza, Fabrizio Petrini, Juan Fernández Peinador - In: Proc. 21st Intl. Parallel and Distr. Processing Symp. (IPDPS 2007 , 2007
"... Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But multi-core processors also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges and design choic ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But multi-core processors also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges and design choices involved in parallelizing a breadth-first search (BFS) algorithm on a state-of-the-art multi-core processor, the Cell Broadband Engine (Cell BE). Our experiments obtained on a pre-production Cell BE board running at 3.2 GHz show almost linear speedups when using multiple synergistic processing units, and an impressive level of performance when compared to other processors. The Cell BE is typically an order of magnitude faster than conventional processors, such as the AMD Opteron and the Intel Pentium 4 and Woodcrest, an order of magnitude faster than the MTA-2 multi-threaded processor, and two orders of magnitude faster than a BlueGene/L processor. 1

On the Representation and Multiplication of Hypersparse Matrices

by Aydın Buluç, John R. Gilbert , 2008
"... Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on ..."
Abstract - Cited by 4 (4 self) - Add to MetaCart
Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on the multiplication of sparse matrices (SpGEMM). We first present the issues with existing sparse matrix representations and multiplication algorithms that make them unscalable to thousands of processors. Then, we develop and analyze two new algorithms that overcome these limitations. We consider our algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm for SpGEMM that would execute different kernels depending on the sparsity of the input matrices. Such a sequential kernel requires a new data structure that exploits the hypersparsity of the individual submatrices owned by a single processor after the 2D partitioning. We experimentally evaluate the performance and characteristics of our algorithms and show that they scale significantly better than existing kernels.

A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers

by Charles E. Leiserson, Tao B. Schardl - In SPAA ’10: Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures , 2010
"... We have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standard C++ breadth-first search implementation. PBFS achieves high work-efficiency by using a novel imple ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
We have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standard C++ breadth-first search implementation. PBFS achieves high work-efficiency by using a novel implementation of a multiset data structure, called a “bag, ” in place of the FIFO queue usually employed in serial breadth-first search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices — a condition met by many real-world graphs — PBFS demonstrates good speedup with the number of processing cores. Since PBFS employs a nonconstant-time “reducer ” — a “hyperobject” feature of Cilk++ — the work inherent in a PBFS execution depends nondeterministically on how the underlying work-stealing scheduler load-balances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutual-exclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G =(V,E) with diameter D and bounded outdegree, this data-race-free version of PBFS algorithm runs in time O((V + E)/P + Dlg3 (V /D)) on P processors, which means that it attains near-perfect linear speedup if P ≪ (V + E)/Dlg3 (V /D).

Scalable Graph Exploration on Multicore Processors

by Virat Agarwal, Fabrizio Petrini, Davide Pasetto, David A. Bader
"... Abstract—Many important problems in computational sciences, social network analysis, security, and business analytics, are data-intensive and lend themselves to graph-theoretical analyses. In this paper we investigate the challenges involved in exploring very large graphs by designing a breadth-firs ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Abstract—Many important problems in computational sciences, social network analysis, security, and business analytics, are data-intensive and lend themselves to graph-theoretical analyses. In this paper we investigate the challenges involved in exploring very large graphs by designing a breadth-first search (BFS) algorithm for advanced multi-core processors that are likely to become the building blocks of future exascale systems. Our new methodology for large-scale graph analytics combines a highlevel algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processorspecific optimizations. We present an experimental study that uses state-of-the-art Intel Nehalem EP and EX processors and up to 64 threads in a single system. Our performance on several benchmark problems representative of the power-law graphs found in real-world problems reaches processing rates that are competitive with supercomputing results in the recent literature. In the experimental evaluation we prove that our graph exploration algorithm running on a 4-socket Nehalem EX is (1) 2.4 times faster than a Cray XMT with 128 processors when exploring a random graph with 64 million vertices and 512 millions edges, (2) capable of processing 550 million edges per second with an R-MAT graph with 200 million vertices and 1 billion edges, comparable to the performance of a similar graph on a Cray MTA-2 with 40 processors and (3) 5 times faster than 256 BlueGene/L processors on a graph with average degree 50. I.

Efficient parallel graph exploration for multi-core cpu and gpu

by Sungpack Hong, Tayo Oguntebi, Kunle Olukotun - In IEEE PACT , 2011
"... Abstract—Graphs are a fundamental data representation that have been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets. In this pape ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Abstract—Graphs are a fundamental data representation that have been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multi-core CPUs which exploits a fundamental property of randomly shaped real-world graph instances. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current state-of-the-art implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multicore execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worst-case performance on high-diameter graphs. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems; a high-end GPU system performed as well as a quad-socket highend CPU system. I.

Highly Parallel Sparse Matrix-Matrix Multiplication ✩,✩✩

by Aydın Buluç, John R. Gilbert
"... Generalized sparse matrix-matrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Generalized sparse matrix-matrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on two-dimensional block distribution of sparse matrices where serial sections use a novel hypersparse kernel for scalability. We give a state-of-the-art MPI implementation of one of our algorithms. Our experiments show scaling up to thousands of processors on a variety of test scenarios.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University