Results 1 - 10
of
14
Hardware/software vectorization for closeness centrality on multi-/many-core architectures
- In 28th International Parallel and Distributed Processing Symposium Workshops, Workshop on Multithreaded Architectures and Applications (MTAAP
, 2014
"... Abstract—Centrality metrics have shown to be highly corre-lated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we expl ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Centrality metrics have shown to be highly corre-lated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we exploit hardware and software vectorization in combination with fine-grain parallelization to compute the closeness centrality values. The proposed vectorization approach enables us to do concur-rent breadth-first search operations and significantly increases the performance. We provide a comparison of different vector-ization schemes and experimentally evaluate our contributions with respect to the existing parallel CPU-based solutions on cutting-edge hardware. Our implementations achieve to be 11 times faster than the state-of-the-art implementation for a graph with 234 million edges. The proposed techniques are ben-eficial to show how the vectorization can be efficiently utilized to execute other graph kernels that require multiple traversals over a large-scale network on cutting-edge architectures. Keywords-Centrality, closeness centrality, vectorization, breadth-first search, Intel Xeon Phi. I.
SPARSE MATRIX MULTIPLICATION ON AN ASSOCIATIVE PROCESSOR
"... Abstract—Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with the entire second matrix, and where t ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with the entire second matrix, and where the AP execution time of vector dot product does not depend on the vector size. Four sparse matrix multiplication algorithms are explored in this paper, combining AP and CPU processing to various levels. They are evaluated by simulation on a large set of sparse matrices. The computational complexity of sparse matrix multiplication on AP is shown to be an O(M) where M is the number of nonzero elements. The AP is found to be especially efficient in binary sparse matrix multiplication. AP outperforms conventional solutions in power efficiency.
Incremental closeness centrality in distributed memory
"... a b s t r a c t Networks are commonly used to model traffic patterns, social interactions, or web pages. The vertices in a network do not possess the same characteristics: some vertices are naturally more connected and some vertices can be more important. Closeness centrality (CC) is a global metri ..."
Abstract
- Add to MetaCart
(Show Context)
a b s t r a c t Networks are commonly used to model traffic patterns, social interactions, or web pages. The vertices in a network do not possess the same characteristics: some vertices are naturally more connected and some vertices can be more important. Closeness centrality (CC) is a global metric that quantifies how important is a given vertex in the network. When the network is dynamic and keeps changing, the relative importance of the vertices also changes. The best known algorithm to compute the CC scores makes it impractical to recompute them from scratch after each modification. In this paper, we propose STREAMER, a distributed memory framework for incrementally maintaining the closeness centrality scores of a network upon changes. It leverages pipelined, replicated parallelism, and SpMM-based BFSs, and it takes NUMA effects into account. It makes maintaining the closeness centrality values of real-life networks with millions of interactions significantly faster and obtains almost linear speedups on a 64 nodes 8 threads/node cluster.
A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture
"... Abstract. The Sparse Matrix-Vector Multiplication is the key operation in many iterative methods. The widely used CSR (Compressed Sparse Row) ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. The Sparse Matrix-Vector Multiplication is the key operation in many iterative methods. The widely used CSR (Compressed Sparse Row)
Partnership for Advanced Computing in Europe Porting FEASTFLOW to the Intel Xeon Phi: Lessons Learned
"... In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and typical optimization strategies like parallelization and vectoriza ..."
Abstract
- Add to MetaCart
(Show Context)
In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and typical optimization strategies like parallelization and vectorization. Since the straightforward porting process of the already existing OpenCL version of the code encountered performance problems that require further analysis, we focused our efforts on the implementation and optimization of two core building block kernels for FEASTFLOW: an axpy vector operation and a sparse matrix-vector multiplication (spmv). Our experimental results on these building blocks indicate the Xeon Phi can serve as a promising accelerator for our software infrastructure. 1.
Incremental Closeness Centrality in Distributed Memory
, 2015
"... Networks are commonly used to model traffic patterns, social interactions, or web pages. The vertices in a network do not possess the same characteristics: some vertices are naturally more connected and some vertices can be more important. Closeness centrality (CC) is a global metric that quantifies ..."
Abstract
- Add to MetaCart
(Show Context)
Networks are commonly used to model traffic patterns, social interactions, or web pages. The vertices in a network do not possess the same characteristics: some vertices are naturally more connected and some vertices can be more important. Closeness centrality (CC) is a global metric that quantifies how important is a given vertex in the network. When the network is dynamic and keeps changing, the relative importance of the vertices also changes. The best known algorithm to compute the CC scores makes it impractical to recompute them from scratch after each modification. In this paper, we propose Streamer, a distributed memory framework for incrementally maintaining the closeness centrality scores of a network upon changes. It leverages pipelined, replicated parallelism, and SpMM-based BFSs, and it takes NUMA effects into account. It makes maintaining the Closeness Centrality values of real-life networks with millions of interactions significantly faster and obtains almost linear speedups on a 64 nodes 8 threads/node cluster.
Evaluating the capabilities of the Xeon Phi
"... (will be inserted by the editor) ..."
(Show Context)
SOFTWARE Open Access
"... Heterogeneous computing architecture for fast detection of SNP-SNP interactions ..."
Abstract
- Add to MetaCart
(Show Context)
Heterogeneous computing architecture for fast detection of SNP-SNP interactions
Regularizing Graph Centrality Computations
, 2014
"... Centrality metrics such as betweenness and closeness have been used to identify important nodes in a network. However, it takes days to months on a high-end workstation to com-pute the centrality of today’s networks. The main reasons are the size and the irregular structure of these networks. While ..."
Abstract
- Add to MetaCart
(Show Context)
Centrality metrics such as betweenness and closeness have been used to identify important nodes in a network. However, it takes days to months on a high-end workstation to com-pute the centrality of today’s networks. The main reasons are the size and the irregular structure of these networks. While today’s computing units excel at processing dense and regular data, their performance is questionable when the data is sparse. In this work, we show how centrality computations can be regularized to reach higher performance. For be-tweenness centrality, we deviate from the traditional fine-grain approach by allowing a GPU to execute multiple BFSs at the same time. Furthermore, we exploit hardware and software vectorization to compute closeness centrality values on CPUs, GPUs and Intel Xeon Phi. Experiments show that only by reengineering the algorithms and without using additional hardware, the proposed techniques can speed up the centrality computations significantly: an improvement of a factor 5.9 on CPU architectures, 70.4 on GPU architectures and 21.0 on Intel Xeon Phi.
A UNIFIED SPARSE MATRIX DATA FORMAT FOR EFFICIENT GENERAL SPARSE MATRIX-VECTOR MULTIPLY ON MODERN PROCESSORS WITH WIDE SIMD UNITS
"... Abstract. Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-C-σ, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from General Purpose Graphics Processing Units (GPGPUs) and vector com-puter programming. We discuss the advantages of SELL-C-σ compared to established formats like Compressed Row Storage (CRS) and ELLPACK and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi and Nvidia Tesla K20) for a wide range of test matri-ces from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-C-σ spMVM kernel. SELL-C-σ comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent (“catch-all”) sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.