Results 1  10
of
10
Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks
 IN SPAA
, 2009
"... This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running time) and Θ ( √ nlgn) span (criticalpath length), yielding a parallelism of Θ(nnz / √ nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is esssentially the same as that for the morestandard compressedsparserows (CSR) format, for which computing Ax in parallel is easy but A T x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A T x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by offchip memory bandwidth.
A workefficient parallel breadthfirst search algorithm (or how to cope with the nondeterminism of reducers
 In SPAA ’10: Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures
, 2010
"... We have developed a multithreaded implementation of breadthfirst search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standard C++ breadthfirst search implementation. PBFS achieves high workefficiency by using a novel imple ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
We have developed a multithreaded implementation of breadthfirst search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standard C++ breadthfirst search implementation. PBFS achieves high workefficiency by using a novel implementation of a multiset data structure, called a “bag, ” in place of the FIFO queue usually employed in serial breadthfirst search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices — a condition met by many realworld graphs — PBFS demonstrates good speedup with the number of processing cores. Since PBFS employs a nonconstanttime “reducer ” — a “hyperobject” feature of Cilk++ — the work inherent in a PBFS execution depends nondeterministically on how the underlying workstealing scheduler loadbalances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutualexclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G =(V,E) with diameter D and bounded outdegree, this dataracefree version of PBFS algorithm runs in time O((V + E)/P + Dlg3 (V /D)) on P processors, which means that it attains nearperfect linear speedup if P ≪ (V + E)/Dlg3 (V /D).
On the Representation and Multiplication of Hypersparse Matrices
, 2008
"... Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on ..."
Abstract

Cited by 11 (7 self)
 Add to MetaCart
Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on the multiplication of sparse matrices (SpGEMM). We first present the issues with existing sparse matrix representations and multiplication algorithms that make them unscalable to thousands of processors. Then, we develop and analyze two new algorithms that overcome these limitations. We consider our algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm for SpGEMM that would execute different kernels depending on the sparsity of the input matrices. Such a sequential kernel requires a new data structure that exploits the hypersparsity of the individual submatrices owned by a single processor after the 2D partitioning. We experimentally evaluate the performance and characteristics of our algorithms and show that they scale significantly better than existing kernels.
Highly Parallel Sparse MatrixMatrix Multiplication
, 2010
"... Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on twodimensional block distribution of sparse matrices where serial sections use a novel hypersparse kernel for scalability. We give a stateoftheart MPI implementation of one of our algorithms. Our experiments show scaling up to thousands of processors on a variety of test scenarios.
Elixir: A system for synthesizing concurrent graph programs
, 2012
"... Algorithms in new application areas like machine learning and network analysis use “irregular ” data structures such as graphs, trees and sets. Writing efficient parallel code in these problem domains is very challenging because it requires the programmer to make many choices: a given problem can us ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Algorithms in new application areas like machine learning and network analysis use “irregular ” data structures such as graphs, trees and sets. Writing efficient parallel code in these problem domains is very challenging because it requires the programmer to make many choices: a given problem can usually be solved by several algorithms, each algorithm may have many implementations, and the best choice of algorithm and implementation can depend not only on the characteristics of the parallel platform but also on properties of the input data such as the structure of the graph. One solution is to permit the application programmer to experiment with different algorithms and implementations without writing every variant from scratch. Autotuning to find the best variant is a more ambitious solution. These
Measurement, Performance
"... The explosion of graph data in social and biological networks, recommendation systems, provenance databases, etc. makes graph storage and processing of paramount importance. We present a performance introspection framework for graph databases, PIG, which provides both a toolset and methodology for u ..."
Abstract
 Add to MetaCart
The explosion of graph data in social and biological networks, recommendation systems, provenance databases, etc. makes graph storage and processing of paramount importance. We present a performance introspection framework for graph databases, PIG, which provides both a toolset and methodology for understanding graph database performance. PIG consists of a hierarchical collection of benchmarks that compose to produce performance models; the models provide a way to illuminate the strengths and weaknesses of a particular implementation. The suite has three layers of benchmarks: primitive operations, composite access patterns, and graph algorithms. While the framework could be used to compare different graph database systems, its primary goal is to help explain the observed performance of a particular system. Such introspection allows one to evaluate the degree to which systems exploit their knowledge of graph access patterns. We present both the PIG methodology and infrastructure and then demonstrate its efficacy by analyzing the popular Neo4j and DEX graph databases.
Parallel Algorithms for Graph Problems
"... This work demonstrates good speedups for the Scalable Synthetic Compact Application 2 (SSCA2) and algebraic connectivity for small irregular graphs on the Explicit MultiThreading (XMT) manycore architecture. Previous studies of these algorithms have been focused on using high performance computing ..."
Abstract
 Add to MetaCart
This work demonstrates good speedups for the Scalable Synthetic Compact Application 2 (SSCA2) and algebraic connectivity for small irregular graphs on the Explicit MultiThreading (XMT) manycore architecture. Previous studies of these algorithms have been focused on using high performance computing architectures to solve large instances of the problems but little work has been done that establishes performance of these parallel algorithms on problem sizes solvable on a smaller scale system. Additionally, analysis of the algorithms is presented that shows how the ease of programming approach for XMT creates a clear path for developers to utilize multicore resources without having to completely recreate the algorithm for the specific architecture by using the wellstudied Parallel Random Access Model (PRAM) of algorithmic thinking. The compiler provided by the UMD XMT research team further makes it possible to trivially parallelize many segments of code and achieve acceptable speedup for the developer time invested. Future innovations from the UMD XMT research team will result in this process becoming even simpler with larger performance gains provided. 1.
A Study of Parallel Betweenness Centrality Algorithm on a Manycore Architecture
, 2007
"... Large scale graph analysis algorithms–such as those in SCCA2 benchmarks studied in this paper–play an increasingly important role in high performance computing applications. Different from most of traditional scientific computing applications, graph algorithms often show dynamic and irregular comput ..."
Abstract
 Add to MetaCart
Large scale graph analysis algorithms–such as those in SCCA2 benchmarks studied in this paper–play an increasingly important role in high performance computing applications. Different from most of traditional scientific computing applications, graph algorithms often show dynamic and irregular computing behavior. It is difficult to attain good performance on large scale conventional parallel architectures because these programs exhibit (i). little locality and data reuse, (ii). dynamically noncontigous memory access pattern that is less amendable to static analysis and (iii). fine grain parallelism requring lock synchronization. With the rapid advance of multicore/manycore chip technology, some new architecture features are emerging: the traditional data cache is being replaced with fast memories (sometime called scratchpad memories) local to the cores in an explicity (user visible) memory hierarchy, and a large number of processing cores (sometime upto hundreds) are becoming available on a single chip. This presents both challenges and opportunities for mapping graph algorithms to be studied in this paper. In this paper, a scalable parallel algorithm for computing betweenness centrality in scale free sparse graph is proposed and its performance and scalability is investigated. In particular, our algorithm
Analysis and Performance Results of Computing Betweenness
, 2009
"... This paper presents a joint study of application and architecture to improve the performance and scalability of an irregular application – computing betweenness centrality – on a manycore architecture IBM Cyclops64. The characteristics of unstructured parallelism, dynamically noncontiguous memory ..."
Abstract
 Add to MetaCart
This paper presents a joint study of application and architecture to improve the performance and scalability of an irregular application – computing betweenness centrality – on a manycore architecture IBM Cyclops64. The characteristics of unstructured parallelism, dynamically noncontiguous memory access and low arithmetic intensity in betweenness centrality pose an obstacle to an efficient mapping of parallel algorithms on such manycore architectures. By identifying several key architectural features, we propose and evaluate efficient strategies for achieving scalability on a massive multithreading manycore architecture. We demonstrate several optimization strategies including multigrain parallelism, justintime locality with explicit memory hierarchy and nonpreemptive thread execution, and finegrain data synchronization. Comparing with a conventional parallel algorithm, we get 4X50X improvement in performance and 16X improvement in scalability on a 128cores IBM Cyclops64 simulator. i
AN INTERACTIVE ENVIRONMENT TO MANIPULATE LARGE GRAPHS
"... Interactive environments such as MATLAB and STARP have made numerical computing tremendously accessible to engineers and scientists. They allow people who are not well– versed in the art of numerical computing to nonetheless reap the benefits of numerical computing. The same is not true in general ..."
Abstract
 Add to MetaCart
Interactive environments such as MATLAB and STARP have made numerical computing tremendously accessible to engineers and scientists. They allow people who are not well– versed in the art of numerical computing to nonetheless reap the benefits of numerical computing. The same is not true in general for combinatorial computing. Often, many interesting problems require a mix of numerical and combinatorial computing. Tools developed for numerical computing – such as sparse matrix algorithms – can also be used to develop a comprehensive infrastructure for graph algorithms. We describe the current status of our effort to build a comprehensive infrastructure for operations on large graphs in an interactive parallel environment such as STARP.