## The Combinatorial BLAS: Design, Implementation, and Applications (2010)

Citations: | 23 - 9 self |

### BibTeX

@MISC{Buluç10thecombinatorial,

author = {Aydın Buluç and John R. Gilbert},

title = {The Combinatorial BLAS: Design, Implementation, and Applications },

year = {2010}

}

### OpenURL

### Abstract

This paper presents a scalable high-performance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of high-performance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse matrix methods. Graph computations are difficult to parallelize using traditional approaches due to their irregular nature and low operational intensity. Many graph computations, however, contain sufficient coarse grained parallelism for thousands of processors, which can be uncovered by using the right primitives. We describe the Parallel Combinatorial BLAS, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications. We provide an extendible library interface and some guiding principles for future development. The library is evaluated using two important graph algorithms, in terms of both performance and ease-ofuse. The scalability and raw performance of the example applications, using the combinatorial BLAS, are unprecedented on distributed memory clusters.

### Citations

2023 | Mapreduce: simplified data processing on large clusters
- Dean, Ghemawat
- 2008
(Show Context)
Citation Context ...ximal independent set, and (bi)connected components problems efficiently. On the other hand, it is possible to implement some clustering and connected components algorithms using the MapReduce model (=-=Dean and Ghemawat 2008-=-), but the approaches are quite unintuitive and the performance is unknown (Cohen 2009). Our work fills a crucial gap by providing primitives that can be used for traversing graphs. The goal of having... |

1673 |
Iterative Methods for Sparse Linear Systems
- SAAD
- 2003
(Show Context)
Citation Context ...t is different from the SpDistMat, which distributes the storage of its sparse matrices. Almost all popular sparse matrix storage formats are internally composed of a number of arrays (Dongarra 2000; =-=Saad 2003-=-; Buluç et al. 2009), since arrays are cache friendlier than pointerbased data structures. Following this observation, the parallel classes handle object creating and communication through what we cal... |

1184 |
A bridging model for parallel computation
- Valiant
- 1990
(Show Context)
Citation Context ... (Malewicz et al. 2010), which is a vertex-centric message passing system targeting distributed memory. It is intended for programs that can be described in the Bulk Synchronous Parallel (BSP) model (=-=Valiant 1990-=-). In Pregel, edges are not first-class citizens and they can not impose computation. By contrast, the Combinatorial BLAS is edgebased; each element of the sparse matrix represents an edge and the und... |

798 | On understanding types, data abstraction, and polymorphism
- Cardelli, Wegner
- 1985
(Show Context)
Citation Context ...never necessary. The software architecture for matrices is illustrated in Figure 1. Although the inheritance relationships are shown in the traditional way (via inclusion polymorphism as described by =-=Cardelli and Wegner 1985-=-), the class 15hierarchies are static, obtained by the parameterizing the base class with its subclasses as explained below. �������� ������� ����� ��������� ������������ ��� ���� ������� ��� Figure ... |

575 |
Basic linear algebra subprograms for FORTRAN usage
- Lawson, Hanson, et al.
- 1979
(Show Context)
Citation Context ...lication of implementation efforts. Primitives have been successfully used in the past to enable many computing applications. The Basic Linear Algebra Subroutines (BLAS) for numerical linear algebra (=-=Lawson et al. 1979-=-) are probably the canonical example of a successful primitives package. The BLAS became widely popular following the success of LAPACK (Anderson et al. 1992). LINPACK’s use of the BLAS encouraged exp... |

505 |
Introduction to Parallel Computing: Design and Analysis of Algorithms
- Kumar, Grama, et al.
- 1994
(Show Context)
Citation Context ...with all p processors. The partitioning of distributed matrices (sparse and dense) follows this processor grid organization, using a 2D block decomposition, also called the checkerboard partitioning (=-=Grama et al. 2003-=-). Figure 3 shows this for the sparse case. SpDistMat<SpMat,CommGrid> SpMat Figure 3: Distributed sparse matrix class and storage Portions of dense matrices are stored locally as two dimensional dense... |

446 |
LAPACK Users’ Guide
- Anderson, Bai, et al.
- 1995
(Show Context)
Citation Context ...outines (BLAS) for numerical linear algebra (Lawson et al. 1979) are probably the canonical example of a successful primitives package. The BLAS became widely popular following the success of LAPACK (=-=Anderson et al. 1992-=-). LINPACK’s use of the BLAS encouraged experts (preferably the hardware vendors themselves) to implement its vector operations for optimal performance. In addition to efficiency benefits, BLAS offere... |

366 | The Landscape of Parallel Computing Research: A View from Berkeley
- Asanovic, Bodik, et al.
- 2006
(Show Context)
Citation Context ...ing the level of abstraction of parallel computing by identifying the algorithmic commonalities across applications is becoming a widely accepted path to solution for the parallel software challenge (=-=Asanovic et al. 2006-=-; Brodman et al. 2009). Primitives both allow algorithm designers to think on a higher level of abstraction, and help to avoid duplication of implementation efforts. Primitives have been successfully ... |

334 | A Faster Algorithm for Betweenness Centrality - Brandes - 2001 |

320 |
A set of measures of centrality based on betweenness
- Freeman
- 1977
(Show Context)
Citation Context ... for these applications, along with an alpha release of the complete library, can be freely obtained from http://gauss.cs.ucsb.edu/code/index.shtml. 5.1 Betweenness Centrality Betweenness centrality (=-=Freeman 1977-=-), a centrality metric based on shortest paths, is the main computation on which we evaluate the performance of our proof-of-concept implementation of the 21Combinatorial BLAS. There are two reasons ... |

270 |
Vector Models for Data-Parallel Computing
- BLELLOCH
- 1990
(Show Context)
Citation Context ... software stack that eases the application programmer’s job does not exist for computations on graphs. Some existing primitives can be used to implement a number of graph algorithms. Scan primitives (=-=Blelloch 1990-=-) are used for solving 3the maximum flow, minimum spanning tree, maximal independent set, and (bi)connected components problems efficiently. On the other hand, it is possible to implement some cluste... |

207 | Pregel: a system for large-scale graph processing
- Malewicz, Austern, et al.
- 2010
(Show Context)
Citation Context ...ms Limited GAPDT (Gilbert et al. 2008) Distributed Sparse Matrix Both Limited MTGL (Berry et al. 2007) Shared Visitor Algorithms Unknown SNAP (Bader and Madduri 2008) Shared Various Both High Pregel (=-=Malewicz et al. 2010-=-) Distributed Vertex-centric None Preliminary Combinatorial BLAS Distributed Sparse Matrix Kernels High The Parallel Boost Graph Library (PBGL) by Gregor and Lumsdaine (2005) is a parallel library for... |

149 | The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8192
- Petrini, Kerbyson, et al.
- 2003
(Show Context)
Citation Context ...sults on more than 500 processors are not smooth, but the overall upward trend is clear. Run time variability of large-scale parallel codes, which can be due to various factors such as the OS jitter (=-=Petrini et al. 2003-=-), is widely reported in the literature (Van Straalen et al. 2009). The expensive computation prohibited us to run more experiments, which would have smoothed out the results by averaging. The best re... |

116 |
The Boost Graph Library - User Guide and Reference Manual. C++ in-depth series
- Siek, Lee, et al.
- 2002
(Show Context)
Citation Context ...hs. It is a significant step towards facilitating rapid development of high performance applications that use distributed graphs as their main data structure. Like the sequential Boost Graph Library (=-=Siek et al. 2001-=-), it has a dual focus on efficiency and flexibility. It relies heavily on generic programming through C++ templates. Lumsdaine et al. (2007) observed poor scaling of PBGL for some large graph problem... |

86 |
GASNet Specification, v1.1
- Bonachea
- 2002
(Show Context)
Citation Context ... the SpGEMM function by prefetching the internal arrays through one sided communication. Alternatively, another SpDistMat class that uses a completely different communication library, such as GASNet (=-=Bonachea 2002-=-) or ARMCI (Nieplocha et al. 2005), can be implemented without requiring any changes to the sequential SpMat object. Most combinatorial operations use more than the traditional floating-point arithmet... |

84 | An overview of the Trilinos project
- Heroux, Bartlett, et al.
(Show Context)
Citation Context ...ys, much of which is directed at numerical sparse matrix computation rather than graph computation. Many libraries exist for solving sparse linear system and eigenvalue problems; some, like Trilinos (=-=Heroux et al. 2005-=-), include significant combinatorial capabilities. The Sparse BLAS (Duff et al. 2002) is a standard API for numerical matrix- and vector-level primitives; its focus is infrastructure for iterative lin... |

67 | A multigrid tutorial - Second Edition - Briggs, Henson, et al. - 2000 |

56 | Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2 - Bader, Madduri - 2006 |

41 | The parallel bgl: A generic library for distributed graph computations. Parallel Object-Oriented Scientific Computing (POOSC
- GREGOR, LUMSDAINE
- 2005
(Show Context)
Citation Context ...ftware for graph computations is summarized in Table 1. Table 1: High-performance libraries and toolkits for parallel graph analysis Library/Toolkit Parallelism Abstraction Offering Scalability PBGL (=-=Gregor and Lumsdaine 2005-=-) Distributed Visitor Algorithms Limited GAPDT (Gilbert et al. 2008) Distributed Sparse Matrix Both Limited MTGL (Berry et al. 2007) Shared Visitor Algorithms Unknown SNAP (Bader and Madduri 2008) Sha... |

33 | Parallel algorithms for evaluating centrality indices in real-world networks - Bader, Madduri - 2006 |

33 | Curiously Recurring Template Patterns - Coplien - 1996 |

30 | A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets
- Madduri, Ediger, et al.
- 2009
(Show Context)
Citation Context ...ble Graph Analysis Benchmarks (formerly known as the HPCS Scalable Synthetic Compact Applications #2 (Bader et al.)) and various implementations on different platforms exist (Bader and Madduri 2006a; =-=Madduri et al. 2009-=-; Tan et al. 2009) for comparison. 225.2 BC Algorithm and Experimental Setup We compute betweenness centrality using the algorithm of Brandes (2001). It computes single source shortest paths from eac... |

30 | Graph clustering via a discrete uncoupling process - Dongen - 2008 |

26 |
An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum
- Duff, Heroux, et al.
- 2002
(Show Context)
Citation Context ... computation. Many libraries exist for solving sparse linear system and eigenvalue problems; some, like Trilinos (Heroux et al. 2005), include significant combinatorial capabilities. The Sparse BLAS (=-=Duff et al. 2002-=-) is a standard API for numerical matrix- and vector-level primitives; its focus is infrastructure for iterative linear system solvers, and therefore it does not include such primitives as sparse matr... |

25 | Approximating betweenness centrality - Bader, Kintali, et al. |

23 |
Scientific and Engineering C++, an Introduction with Advanced Techniques
- Barton, Nackman
- 1994
(Show Context)
Citation Context ...erformance due to copying of such big objects. The template mechanism of C++ provided a neat solution to the mixed mode arithmetic problem by providing automatic type promotion through trait classes (=-=Barton and Nackman 1994-=-). Arbitrary semiring support for matrix-matrix and matrix-vector products is allowed by passing a class (with staticadd andmultiply functions) as a template parameter to corresponding SpGEMM and SpMV... |

22 |
Graph twiddling in a mapreduce world
- Cohen
(Show Context)
Citation Context ... possible to implement some clustering and connected components algorithms using the MapReduce model (Dean and Ghemawat 2008), but the approaches are quite unintuitive and the performance is unknown (=-=Cohen 2009-=-). Our work fills a crucial gap by providing primitives that can be used for traversing graphs. The goal of having a BLAS-like library for graph computation is to support rapid implementation of graph... |

20 | Advances, applications and performance of the global arrays shared memory programming toolkit
- Nieplocha, Palmer, et al.
(Show Context)
Citation Context ...nfrastructure for iterative linear system solvers, and therefore it does not include such primitives as sparse matrix-matrix multiplication (SpGEMM) and sparse matrix indexing (SpRef). Global Arrays (=-=Nieplocha et al. 2006-=-) is a parallel dense and sparse array library that uses a one-sided communication infrastructure portable to message-passing, NUMA, and shared-memory machines. Star-P (Shah and Gilbert 2004) and pMat... |

17 | Software and Algorithms for Graph Queries on Multithreaded Architectures
- Berry, Hendrickson, et al.
- 2007
(Show Context)
Citation Context ...y/Toolkit Parallelism Abstraction Offering Scalability PBGL (Gregor and Lumsdaine 2005) Distributed Visitor Algorithms Limited GAPDT (Gilbert et al. 2008) Distributed Sparse Matrix Both Limited MTGL (=-=Berry et al. 2007-=-) Shared Visitor Algorithms Unknown SNAP (Bader and Madduri 2008) Shared Various Both High Combinatorial BLAS Distributed Sparse Matrix Kernels High 5The Parallel Boost Graph Library (PBGL) by Gregor... |

16 | High Performance Remote Memory Access Comunications: The ARMCI Approach
- Nieplocha, Tipparaju, et al.
- 2006
(Show Context)
Citation Context ...refetching the internal arrays through one sided communication. Alternatively, another SpDistMat class that uses a completely different communication library, such as GASNet (Bonachea 2002) or ARMCI (=-=Nieplocha et al. 2005-=-), can be implemented without requiring any changes to the sequential SpMat object. Most combinatorial operations use more than the traditional floating-point arithmetic, with integer and boolean oper... |

13 | Challenges and advances in parallel sparse matrix-matrix multiplication - Buluç, Gilbert |

13 | Concept-Controlled Polymorphism
- Jarvi, Willcock, et al.
- 2003
(Show Context)
Citation Context ...ad to a run-time error or an exception. Static OOP catches any such incompatibilities at 20compile time. An equally expressive alternative to CRTP would be to use enable if-based function overloads (=-=Järvi et al. 2003-=-). The SpMat object is local to a node but it need not be sequential. It can be implemented as a shared-memory data structure, amenable to thread-level parallelization. This flexibility will allow fut... |

11 |
small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks
- Bader, Madduri, et al.
- 2008
(Show Context)
Citation Context ... (Gregor and Lumsdaine 2005) Distributed Visitor Algorithms Limited GAPDT (Gilbert et al. 2008) Distributed Sparse Matrix Both Limited MTGL (Berry et al. 2007) Shared Visitor Algorithms Unknown SNAP (=-=Bader and Madduri 2008-=-) Shared Various Both High Combinatorial BLAS Distributed Sparse Matrix Kernels High 5The Parallel Boost Graph Library (PBGL) by Gregor and Lumsdaine (2005) is a parallel library for distributed memo... |

11 | On the representation and multiplication of hypersparse matrices - Buluç, Gilbert - 2008 |

11 |
Parallel MATLAB for Multicore and Multinode
- Kepner
- 2009
(Show Context)
Citation Context ...a parallel dense and sparse array library that uses a one-sided communication infrastructure portable to message-passing, NUMA, and shared-memory machines. Star-P (Shah and Gilbert 2004) and pMatlab (=-=Kepner 2009-=-) are parallel dialects of Matlab that run on distributed-memory messagepassing machines; both include parallel sparse distributed array infrastructures. 83 Design Philosophy 3.1 Overall Design The f... |

10 | Improved external memory BFS implementations
- Ajwani, Meyer, et al.
- 2007
(Show Context)
Citation Context ...mory. Thus, MTGL or SNAP will likely 7to find limited use in commodity architectures without either distributed memory or out-of-core support. Experimental studies show that an out-of-core approach (=-=Ajwani et al. 2007-=-) is two orders of magnitude slower than an MTA-2 implementation for parallel breadth-first search (Bader and Madduri 2006b). Given that many graph algorithms, such as clustering and betweenness centr... |

8 | A static C++ object-oriented programming (SCOOP) paradigm mixing benefits of traditional OOP and generic programming
- Burrus, Duret-Lutz, et al.
- 2003
(Show Context)
Citation Context ... all types of objects derive from their corresponding base classes. The base classes only serve to dictate the interface. This is achieved through static object oriented programming (OOP) techniques (=-=Burrus et al. 2003-=-) rather than expensive dynamic dispatch. A trick known as the Curiously Recurring Template Pattern (CRTP), a term coined by Coplien (1995), emulates dynamic dispatch statically, with some limitations... |

8 | Sparse matrices in Matlab*P: Design and implementation
- Shah, Gilbert
- 2004
(Show Context)
Citation Context ...ox (GAPDT, later renamed KDT) (Gilbert et al. 2008) provides both combinatorial and numerical tools to manipulate large graphs interactively. KDT runs sequentially on Matlab or in parallel on Star-P (=-=Shah and Gilbert 2004-=-), a parallel dialect of Matlab. Although KDT focuses on algorithms, the underlying sparse matrix 6infrastructure also exposes linear algebraic kernels. KDT, like PBGL, targets distributed-memory mac... |

7 | Parallel sparse matrix-matrix multiplication and indexing: implementation and experiments
- Buluç, Gilbert
(Show Context)
Citation Context ...se matrices (SpDistMat objects), on the other hand, have many possible representations, and the right representation depends on the particular setting or the application. We (Buluç and Gilbert 2008b; =-=Buluç and Gilbert 2010-=-) previously reported the problems associated with using the popular compressed sparse rows (CSR) or compressed sparse columns (CSC) representations in a 2D block decomposition. The triples format doe... |

7 |
A unified framework for numerical and combinatorial computing
- Gilbert, Reinhardt, et al.
(Show Context)
Citation Context ...formance libraries and toolkits for parallel graph analysis Library/Toolkit Parallelism Abstraction Offering Scalability PBGL (Gregor and Lumsdaine 2005) Distributed Visitor Algorithms Limited GAPDT (=-=Gilbert et al. 2008-=-) Distributed Sparse Matrix Both Limited MTGL (Berry et al. 2007) Shared Visitor Algorithms Unknown SNAP (Bader and Madduri 2008) Shared Various Both High Combinatorial BLAS Distributed Sparse Matrix ... |

7 | Snap: small-world network analysis and partitioning
- Bader, Madduri
- 2010
(Show Context)
Citation Context ... (Gregor and Lumsdaine 2005) Distributed Visitor Algorithms Limited GAPDT (Gilbert et al. 2008) Distributed Sparse Matrix Both Limited MTGL (Berry et al. 2007) Shared Visitor Algorithms Unknown SNAP (=-=Bader and Madduri 2008-=-) Shared Various Both High Pregel (Malewicz et al. 2010) Distributed Vertex-centric None Preliminary Combinatorial BLAS Distributed Sparse Matrix Kernels High The Parallel Boost Graph Library (PBGL) b... |

5 |
Sparse matrix storage formats
- Dongarra
(Show Context)
Citation Context ... this regard, it is different from the SpDistMat, which distributes the storage of its sparse matrices. Almost all popular sparse matrix storage formats are internally composed of a number of arrays (=-=Dongarra 2000-=-; Saad 2003; Buluç et al. 2009), since arrays are cache friendlier than pointerbased data structures. Following this observation, the parallel classes handle object creating and communication through ... |

5 | A Space-Efficient Parallel Algorithm for Computing Betweenness Centrality in Distributed Memory - Edmonds, Hoefler, et al. |

4 |
Linear Algebraic Primitives for Parallel Computing on Large Graphs
- Buluç
- 2010
(Show Context)
Citation Context ...ying step is also implemented as an SpGEMM operation. For the performance results presented in this section, we use a synchronous implementaton of the Sparse SUMMA algorithm (Buluç and Gilbert 2008a; =-=Buluç 2010-=-), because it is the most portable SpGEMM implementation and relies only on simple MPI-1 features. The other Combinatorial BLAS primitives that are used for implementing the betweenness centrality alg... |

4 |
Analysis and performance results of computing betweenness centrality on IBM Cyclops64
- Tan, Sreedhar, et al.
- 2011
(Show Context)
Citation Context ...nchmarks (formerly known as the HPCS Scalable Synthetic Compact Applications #2 (Bader et al.)) and various implementations on different platforms exist (Bader and Madduri 2006a; Madduri et al. 2009; =-=Tan et al. 2009-=-) for comparison. 225.2 BC Algorithm and Experimental Setup We compute betweenness centrality using the algorithm of Brandes (2001). It computes single source shortest paths from each node in the net... |

3 |
Implementing a portable multi-threaded graph library: The MTGL on Qthreads
- Barrett, Berry, et al.
- 2009
(Show Context)
Citation Context ...) was originally designed for development of graph applications on massively multithreaded machines, namely Cray MTA-2 and XMT. It was later extended to run on mainstream shared-memory architectures (=-=Barrett et al. 2009-=-). MTGL is a significant step towards an extendible and generic parallel graph library. As of now, only preliminary performance results are published for MTGL. The Graph Algorithm and Pattern Discover... |

2 | Efficient Management of Parallelism in Object Oriented Numerical Software Libraries - McInnes - 1997 |

2 | New abstractions for data parallel programming
- Brodman, Fraguela, et al.
- 2009
(Show Context)
Citation Context ...action of parallel computing by identifying the algorithmic commonalities across applications is becoming a widely accepted path to solution for the parallel software challenge (Asanovic et al. 2006; =-=Brodman et al. 2009-=-). Primitives both allow algorithm designers to think on a higher level of abstraction, and help to avoid duplication of implementation efforts. Primitives have been successfully used in the past to e... |

2 | Scal: Non-linearizable computing breaks the scalability barrier
- Kirsch, Payer, et al.
- 2010
(Show Context)
Citation Context ...iority queues that can operate efficiently on distributed sparse matrices. One promising approach involves reformulating such algorithms so that they work on relaxed non-linearizable data structures (=-=Kirsch et al. 2010-=-) that expose more parallelism. Visitor-based search patterns of the Boost Graph Library (Siek et al. 2001) and its relatives (Gregor and Lumsdaine 2005; Berry et al. 2007) are powerful methods to exp... |

1 | Leiserson (2009). Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks - Buluç, Fineman, et al. |