Results 1 - 10
of
14
A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets
, 2009
"... We present a new lock-free parallel algorithm for computing betweenness centrality of massive complex networks that achieves better spatial locality compared with previous approaches. Betweenness centrality is a key kernel in analyzing the importance of vertices (or edges) in applications ranging fr ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
We present a new lock-free parallel algorithm for computing betweenness centrality of massive complex networks that achieves better spatial locality compared with previous approaches. Betweenness centrality is a key kernel in analyzing the importance of vertices (or edges) in applications ranging from social networks, to power grids, to the influence of jazz musicians, and is also incorporated into the DARPA HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph analytics. We design an optimized implementation of betweenness centrality for the massively multithreaded Cray XMT system with the Threadstorm processor. For a small-world network of 268 million vertices and 2.147 billion edges, the 16-processor XMT system achieves a TEPS rate (an algorithmic performance count for the number of edges traversed per second) of 160 million per second, which corresponds to more than a 2 × performance improvement over the previous parallel implementation. We demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for the large IMDb movie-actor network. 1.
1 Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory
"... Abstract—Processing large graphs is becoming increasingly important for many computational domains. Unfortunately, many algorithms and implementations do not scale with the demand for increasing graph sizes. As a result, researchers have attempted to meet the growing data demands using parallel and ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract—Processing large graphs is becoming increasingly important for many computational domains. Unfortunately, many algorithms and implementations do not scale with the demand for increasing graph sizes. As a result, researchers have attempted to meet the growing data demands using parallel and external memory techniques. Our work, targeted to chip multi-processors, takes a highly parallel asynchronous approach to hide the high data latency due to both poor locality and delays in the underlying graph data storage. We present a novel asynchronous approach to compute Breadth First Search (BFS), Single Source Shortest Path (SSSP), and Connected Components (CC) for large graphs in shared memory. We present an experimental study applying our technique to both In-Memory (IM) and Semi-External Memory (SEM) graphs utilizing multi-core processors and solid-state memory devices. Our experiments using both synthetic and realworld datasets show that our asynchronous approach is able to overcome data latencies and provide significant speedup over alternative approaches. I.
On the Representation and Multiplication of Hypersparse Matrices
, 2008
"... Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on the multiplication of sparse matrices (SpGEMM). We first present the issues with existing sparse matrix representations and multiplication algorithms that make them unscalable to thousands of processors. Then, we develop and analyze two new algorithms that overcome these limitations. We consider our algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm for SpGEMM that would execute different kernels depending on the sparsity of the input matrices. Such a sequential kernel requires a new data structure that exploits the hypersparsity of the individual submatrices owned by a single processor after the 2D partitioning. We experimentally evaluate the performance and characteristics of our algorithms and show that they scale significantly better than existing kernels.
Scalable Communication Protocols for Dynamic Sparse Data Exchange
"... Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such programs may involve all (or large numbers) of the participating processes, the actual communication operations are usual ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such programs may involve all (or large numbers) of the participating processes, the actual communication operations are usually sparse in nature. As a result, communication phases are typically expressed explicitly using point-to-point communication operations or collective operations. We define the dynamic sparse data-exchange (DSDE) problem and derive bounds in the well known LogGP model. While current approaches work well with static applications, they run into limitations as modern applications grow in scale, and as the problems that are being solved become increasingly irregular and dynamic. To enable the compact and efficient expression of the communication phase, we develop suitable sparse communication protocols for irregular applications at large scale. We discuss different irregular applications and show the sparsity in the communication for real-world input data. We discuss the time and memory complexity of commonly used protocols for the DSDE problem and develop N BX —a novel fast algorithm with constant memory overhead for solving it. Algorithm N BX improves the runtime of a sparse dataexchange among 8,192 processors on BlueGene/P by a factor of 5.6. In an application study, we show improvements of up to a factor of 28.9 for a parallel breadth first search on 8,192 BlueGene/P processors.
A High-Level Framework for Distributed Processing of Large-Scale Graphs
"... Abstract. Distributed processing of real-world graphs is challenging due to their size and the inherent irregular structure of graph computations. We present HIPG, a distributed framework that facilitates high-level programming of parallel graph algorithms by expressing them as a hierarchy of distri ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. Distributed processing of real-world graphs is challenging due to their size and the inherent irregular structure of graph computations. We present HIPG, a distributed framework that facilitates high-level programming of parallel graph algorithms by expressing them as a hierarchy of distributed computations executed independently and managed by the user. HIPG programs are in general short and elegant; they achieve good portability, memory utilization and performance. 1
Efficient parallel graph exploration for multi-core cpu and gpu
- In IEEE PACT
, 2011
"... Abstract—Graphs are a fundamental data representation that have been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets. In this pape ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract—Graphs are a fundamental data representation that have been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multi-core CPUs which exploits a fundamental property of randomly shaped real-world graph instances. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current state-of-the-art implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multicore execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worst-case performance on high-diameter graphs. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems; a high-end GPU system performed as well as a quad-socket highend CPU system. I.
HipG: Parallel Processing of Large-Scale Graphs
"... Distributed processing of real-world graphs is challenging duetotheirsizeandtheinherentirregularstructureofgraph computations. We present HipG, a distributed framework that facilitates programming parallel graph algorithms by composing the parallel application automatically from the user-defined pie ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Distributed processing of real-world graphs is challenging duetotheirsizeandtheinherentirregularstructureofgraph computations. We present HipG, a distributed framework that facilitates programming parallel graph algorithms by composing the parallel application automatically from the user-defined pieces of sequential work on graph nodes. To make the user code high-level, the framework provides a unified interface to executing methods on local and non-local graph nodes and an abstraction of exclusive execution. The graph computations are managed by logical objects called synchronizers, which we used, for example, to implement distributed divide-and-conquer decomposition into strongly connected components. The code written in HipG is independent of a particular graph representation, to the point that the graph can be created on-the-fly, i.e. by the algorithm that computes on this graph, which we used to implement a distributed model checker. HipG programs are in general short and elegant; they achieve good portability, memory utilization, and performance. 1.
Highly Parallel Sparse Matrix-Matrix Multiplication ✩,✩✩
"... Generalized sparse matrix-matrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Generalized sparse matrix-matrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on two-dimensional block distribution of sparse matrices where serial sections use a novel hypersparse kernel for scalability. We give a state-of-the-art MPI implementation of one of our algorithms. Our experiments show scaling up to thousands of processors on a variety of test scenarios.
and Oracle Labs
"... The increasing importance of graph-data based applications is fueling the need for highly efficient and parallel implementations of graph analysis software. In this paper we describe Green-Marl, a domain-specific language (DSL) whose high level language constructs allow developers to describe their ..."
Abstract
- Add to MetaCart
The increasing importance of graph-data based applications is fueling the need for highly efficient and parallel implementations of graph analysis software. In this paper we describe Green-Marl, a domain-specific language (DSL) whose high level language constructs allow developers to describe their graph analysis algorithms intuitively, but expose the data-level parallelism inherent in the algorithms. We also present our Green-Marl compiler which translates high-level algorithmic description written in Green-Marl into an efficient C++ implementation by exploiting this exposed datalevel parallelism. Furthermore, our Green-Marl compiler applies a set of optimizations that take advantage of the high-level semantic knowledge encoded in the Green-Marl DSL. We demonstrate that graph analysis algorithms can be written very intuitively with Green-Marl through some examples, and our experimental results show that the compiler-generated implementation out of such descriptions performs as well as or better than highly-tuned handcoded implementations.
An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture
"... Abstract—Graph algorithms are notorious for not getting good speedup on parallel architectures. These algorithms tend to suffer from irregular dependencies and a high synchronization cost that prevent an efficient execution on distributed memory machines. Hence such algorithms are mostly parallelize ..."
Abstract
- Add to MetaCart
Abstract—Graph algorithms are notorious for not getting good speedup on parallel architectures. These algorithms tend to suffer from irregular dependencies and a high synchronization cost that prevent an efficient execution on distributed memory machines. Hence such algorithms are mostly parallelized on shared memory machines. However, current commodity shared memory machines do not typically offer enough parallelism to process these problems. In this paper, we are presenting an early investigation of the scalability of such algorithms on Intel’s upcoming Many Integrated Core (Intel MIC) architecture which, when it will be released in 2012, is expected to provide more than 50 physical cores with SMT capability. The Intel MIC architecture can be programmed through many programming models, here we investigate the three most popular of these models namely OpenMP, Cilk Plus and Intel’s TBB. We present scalability results of a parallel graph coloring algorithm, three variations of a breadth-first search algorithm and a microbenchmark for irregular computations using these three programming models. Our results on a prototype board show that the multi-threaded architecture of Intel MIC can be effectively used for hiding latencies in irregular applications to achieve almost perfect speedup. Keywords-Graph algorithm; unstructured irregular computation; scalability; multi-threaded architectures; graph coloring; breadth-first search I.

