Results 1 
7 of
7
A workefficient parallel breadthfirst search algorithm (or how to cope with the nondeterminism of reducers
 In SPAA ’10: Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures
, 2010
"... We have developed a multithreaded implementation of breadthfirst search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standard C++ breadthfirst search implementation. PBFS achieves high workefficiency by using a novel imple ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
We have developed a multithreaded implementation of breadthfirst search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standard C++ breadthfirst search implementation. PBFS achieves high workefficiency by using a novel implementation of a multiset data structure, called a “bag, ” in place of the FIFO queue usually employed in serial breadthfirst search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices — a condition met by many realworld graphs — PBFS demonstrates good speedup with the number of processing cores. Since PBFS employs a nonconstanttime “reducer ” — a “hyperobject” feature of Cilk++ — the work inherent in a PBFS execution depends nondeterministically on how the underlying workstealing scheduler loadbalances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutualexclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G =(V,E) with diameter D and bounded outdegree, this dataracefree version of PBFS algorithm runs in time O((V + E)/P + Dlg3 (V /D)) on P processors, which means that it attains nearperfect linear speedup if P ≪ (V + E)/Dlg3 (V /D).
Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks
 IN SPAA
, 2009
"... This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running time) and Θ ( √ nlgn) span (criticalpath length), yielding a parallelism of Θ(nnz / √ nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is esssentially the same as that for the morestandard compressedsparserows (CSR) format, for which computing Ax in parallel is easy but A T x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A T x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by offchip memory bandwidth.
On the Representation and Multiplication of Hypersparse Matrices
, 2008
"... Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on ..."
Abstract

Cited by 9 (7 self)
 Add to MetaCart
Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on the multiplication of sparse matrices (SpGEMM). We first present the issues with existing sparse matrix representations and multiplication algorithms that make them unscalable to thousands of processors. Then, we develop and analyze two new algorithms that overcome these limitations. We consider our algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm for SpGEMM that would execute different kernels depending on the sparsity of the input matrices. Such a sequential kernel requires a new data structure that exploits the hypersparsity of the individual submatrices owned by a single processor after the 2D partitioning. We experimentally evaluate the performance and characteristics of our algorithms and show that they scale significantly better than existing kernels.
Highly Parallel Sparse MatrixMatrix Multiplication
, 2010
"... Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on twodimensional block distribution of sparse matrices where serial sections use a novel hypersparse kernel for scalability. We give a stateoftheart MPI implementation of one of our algorithms. Our experiments show scaling up to thousands of processors on a variety of test scenarios.
Elixir: A system for synthesizing concurrent graph programs
, 2012
"... Algorithms in new application areas like machine learning and network analysis use “irregular ” data structures such as graphs, trees and sets. Writing efficient parallel code in these problem domains is very challenging because it requires the programmer to make many choices: a given problem can us ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Algorithms in new application areas like machine learning and network analysis use “irregular ” data structures such as graphs, trees and sets. Writing efficient parallel code in these problem domains is very challenging because it requires the programmer to make many choices: a given problem can usually be solved by several algorithms, each algorithm may have many implementations, and the best choice of algorithm and implementation can depend not only on the characteristics of the parallel platform but also on properties of the input data such as the structure of the graph. One solution is to permit the application programmer to experiment with different algorithms and implementations without writing every variant from scratch. Autotuning to find the best variant is a more ambitious solution. These
Parallel Algorithms for Graph Problems
"... This work demonstrates good speedups for the Scalable Synthetic Compact Application 2 (SSCA2) and algebraic connectivity for small irregular graphs on the Explicit MultiThreading (XMT) manycore architecture. Previous studies of these algorithms have been focused on using high performance computing ..."
Abstract
 Add to MetaCart
This work demonstrates good speedups for the Scalable Synthetic Compact Application 2 (SSCA2) and algebraic connectivity for small irregular graphs on the Explicit MultiThreading (XMT) manycore architecture. Previous studies of these algorithms have been focused on using high performance computing architectures to solve large instances of the problems but little work has been done that establishes performance of these parallel algorithms on problem sizes solvable on a smaller scale system. Additionally, analysis of the algorithms is presented that shows how the ease of programming approach for XMT creates a clear path for developers to utilize multicore resources without having to completely recreate the algorithm for the specific architecture by using the wellstudied Parallel Random Access Model (PRAM) of algorithmic thinking. The compiler provided by the UMD XMT research team further makes it possible to trivially parallelize many segments of code and achieve acceptable speedup for the developer time invested. Future innovations from the UMD XMT research team will result in this process becoming even simpler with larger performance gains provided. 1.
Measurement, Performance
"... The explosion of graph data in social and biological networks, recommendation systems, provenance databases, etc. makes graph storage and processing of paramount importance. We present a performance introspection framework for graph databases, PIG, which provides both a toolset and methodology for u ..."
Abstract
 Add to MetaCart
The explosion of graph data in social and biological networks, recommendation systems, provenance databases, etc. makes graph storage and processing of paramount importance. We present a performance introspection framework for graph databases, PIG, which provides both a toolset and methodology for understanding graph database performance. PIG consists of a hierarchical collection of benchmarks that compose to produce performance models; the models provide a way to illuminate the strengths and weaknesses of a particular implementation. The suite has three layers of benchmarks: primitive operations, composite access patterns, and graph algorithms. While the framework could be used to compare different graph database systems, its primary goal is to help explain the observed performance of a particular system. Such introspection allows one to evaluate the degree to which systems exploit their knowledge of graph access patterns. We present both the PIG methodology and infrastructure and then demonstrate its efficacy by analyzing the popular Neo4j and DEX graph databases.