Results 1 
9 of
9
The Combinatorial BLAS: Design, Implementation, and Applications
, 2010
"... This paper presents a scalable highperformance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of highperformance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse mat ..."
Abstract

Cited by 21 (9 self)
 Add to MetaCart
This paper presents a scalable highperformance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of highperformance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse matrix methods. Graph computations are difficult to parallelize using traditional approaches due to their irregular nature and low operational intensity. Many graph computations, however, contain sufficient coarse grained parallelism for thousands of processors, which can be uncovered by using the right primitives. We describe the Parallel Combinatorial BLAS, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications. We provide an extendible library interface and some guiding principles for future development. The library is evaluated using two important graph algorithms, in terms of both performance and easeofuse. The scalability and raw performance of the example applications, using the combinatorial BLAS, are unprecedented on distributed memory clusters.
Parallel breadthfirst search on distributed memory systems
, 2011
"... Dataintensive, graphbased computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for BreadthFirst Search (BFS), a key subroutine in several ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
Dataintensive, graphbased computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for BreadthFirst Search (BFS), a key subroutine in several graph algorithms. We present two highlytuned parallel approaches for BFS on large parallel systems: a levelsynchronous strategy that relies on a simple vertexbased partitioning of the graph, and a twodimensional sparse matrix partitioningbased approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intranode multithreading. Our novel hybrid twodimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributedmemory parallel systems. For instance, for a 40,000core parallel execution on Hopper, an AMD MagnyCours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution. 1.
Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks
 IN SPAA
, 2009
"... This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running time) and Θ ( √ nlgn) span (criticalpath length), yielding a parallelism of Θ(nnz / √ nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is esssentially the same as that for the morestandard compressedsparserows (CSR) format, for which computing Ax in parallel is easy but A T x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A T x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by offchip memory bandwidth.
Reducedbandwidth multithreaded algorithms for sparse matrixvector multiplication
 In Proc. IPDPS
, 2011
"... Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymme ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fillin zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue. I.
Highly Parallel Sparse MatrixMatrix Multiplication
, 2010
"... Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on twodimensional block distribution of sparse matrices where serial sections use a novel hypersparse kernel for scalability. We give a stateoftheart MPI implementation of one of our algorithms. Our experiments show scaling up to thousands of processors on a variety of test scenarios.
Gaussian Elimination Based Algorithms on the GPU
, 2008
"... We implemented and evaluated several Gaussian elimination based algorithms on Graphic Processing Units (GPUs). These algorithms, LU decomposition without pivoting, allpairs shortestpaths, and transitive closure, all have similar data access patterns. The impressive computational power and memory b ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We implemented and evaluated several Gaussian elimination based algorithms on Graphic Processing Units (GPUs). These algorithms, LU decomposition without pivoting, allpairs shortestpaths, and transitive closure, all have similar data access patterns. The impressive computational power and memory bandwidth of the GPU make it an attractive platform to run such computationally intensive algorithms. Although improvements over CPU implementations have previously been achieved for those algorithms in terms of raw speed, the utilization of the underlying computational resources was quite low. We implemented a recursively partioned allpairs shortestpaths algorithm that harnesses the power of GPUs better than existing implementations. The alternate schedule of path computations allowed us to cast almost all operations into matrixmatrix multiplications on a semiring. Since matrixmatrix multiplication is highly optimized and has a high ratio of computation to communication, our implementation does not suffer from the premature saturation of bandwidth resources as iterative algorithms do. By increasing temporal locality, our implementation runs more than two orders of magnitude faster on an NVIDIA 8800 GPU than on an Opteron. Our work provides evidence that programmers should rethink algorithms instead of directly porting them to GPU.
Modeling the Locality in Graph Traversals
"... Abstract—An increasing number of applications in physical and social sciences require the analysis of large graphs. The efficiency of these programs strongly depends on their memory usage especially the locality of graph data access. Intuitively, the locality in computation should reflect the locali ..."
Abstract
 Add to MetaCart
Abstract—An increasing number of applications in physical and social sciences require the analysis of large graphs. The efficiency of these programs strongly depends on their memory usage especially the locality of graph data access. Intuitively, the locality in computation should reflect the locality in graph topology. Existing locality models, however, operate either at program level for regular loops and arrays or at trace level for arbitrary access streams. They are not sufficient to characterize the relation between locality and connectivity. This paper presents a new metrics called the vertex distance and uses it to model the locality in breadthfirst graph traversal (BFS). It shows three models that use the average node degree and the edge distribution to predict the number of BFS levels and the reuse distance distribution of BFS. Finally, it evaluates the new models using random and nonrandom graphs. Keywordslocality; graph traversals; vertex distance; reuse distance; I.
37th International Conference on Parallel Processing Challenges and Advances in Parallel Sparse MatrixMatrix Multiplication ∗
"... We identify the challenges that are special to parallel sparse matrixmatrix multiplication (PSpGEMM). We show that sparse algorithms are not as scalable as their dense counterparts, because in general, there are not enough nontrivial arithmetic operations to hide the communication costs as well as ..."
Abstract
 Add to MetaCart
We identify the challenges that are special to parallel sparse matrixmatrix multiplication (PSpGEMM). We show that sparse algorithms are not as scalable as their dense counterparts, because in general, there are not enough nontrivial arithmetic operations to hide the communication costs as well as the sparsity overheads. We analyze the scalability of 1D and 2D algorithms for PSpGEMM. While the 1D algorithm is a variant of existing implementations, 2D algorithms presented are completely novel. Most of these algorithms are based on the previous research on parallel dense matrix multiplication. We also provide results from preliminary experiments with 2D algorithms. 1
Algebraic Domain Decomposition Methods for Highly Heterogeneous Problems
, 2013
"... We consider the solving of linear systems arising from porous media flow simulations with high heterogeneities. Using a Newton algorithm to handle the nonlinearity leads to the solving of a sequence of linear systems with different but similar matrices and right hand sides. The parallel solver is a ..."
Abstract
 Add to MetaCart
We consider the solving of linear systems arising from porous media flow simulations with high heterogeneities. Using a Newton algorithm to handle the nonlinearity leads to the solving of a sequence of linear systems with different but similar matrices and right hand sides. The parallel solver is a Schwarz domain decomposition method. The unknowns are partitioned with a criterion based on the entries of the input matrix. This leads to substantial gains compared to a partition based only on the adjacency graph of the matrix. From the information generated during the solving of the first linear system, it is possible to build a coarse space for a twolevel domain decomposition algorithm that leads to an acceleration of the convergence of the subsequent linear systems. We compare two coarse spaces: a classical approach and a new one adapted to parallel