Results 1 - 10
of
15
Fast Parallel Algorithms for Short-Range Molecular Dynamics
- JOURNAL OF COMPUTATIONAL PHYSICS
, 1995
"... Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dyn ..."
Abstract
-
Cited by 128 (6 self)
- Add to MetaCart
Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently -- those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed--memory parallel machine which allows for message--passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers -- the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventi...
Hypergraph-Partitioning Based Decomposition for Parallel Sparse-Matrix Vector Multiplication
- IEEE Trans. on Parallel and Distributed Computing
"... In this work, we show that the standard graph-partitioning based decomposition of sparse matrices does not reflect the actual communication volume requirement for parallel matrix-vector multiplication. We propose two computational hypergraph models which avoid this crucial deficiency of the graph mo ..."
Abstract
-
Cited by 49 (26 self)
- Add to MetaCart
In this work, we show that the standard graph-partitioning based decomposition of sparse matrices does not reflect the actual communication volume requirement for parallel matrix-vector multiplication. We propose two computational hypergraph models which avoid this crucial deficiency of the graph model. The proposed models reduce the decomposition problem to the well-known hypergraph partitioning problem. The recently proposed successful multilevel framework is exploited to develop a multilevel hypergraph partitioning tool PaToH for the experimental verification of our proposed hypergraph models. Experimental results on a wide range of realistic sparse test matrices confirm the validity of the proposed hypergraph models. In the decomposition of the test matrices, the hypergraph models using PaToH and hMeTiS result in up to 63% less communication volume (30%--38% less on the average) than the graph model using MeTiS, while PaToH is only 1.3--2.3 times slower than MeTiS on the average. ...
A Two-Dimensional Data Distribution Method For Parallel Sparse Matrix-Vector Multiplication
- SIAM REVIEW
"... A new method is presented for distributing data in sparse matrix-vector multiplication. The method is two-dimensional, tries to minimise the true communication volume, and also tries to spread the computation and communication work evenly over the processors. The method starts with a recursive bipar ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
A new method is presented for distributing data in sparse matrix-vector multiplication. The method is two-dimensional, tries to minimise the true communication volume, and also tries to spread the computation and communication work evenly over the processors. The method starts with a recursive bipartitioning of the sparse matrix, each time splitting a rectangular matrix into two parts with a nearly equal number of nonzeros. The communication volume caused by the split is minimised. After the matrix partitioning, the input and output vectors are partitioned with the objective of minimising the maximum communication volume per processor. Experimental results of our implementation, Mondriaan, for a set of sparse test matrices show a reduction in communication compared to one-dimensional methods, and in general a good balance in the communication work.
An Efficient Parallel Algorithm for Matrix-Vector Multiplication
- International Journal of High Speed Computing
, 1995
"... . The multiplication of a vector by a matrix is the kernel operation in many algorithms used in scientific computation. A fast and efficient parallel algorithm for this calculation is therefore desirable. This paper describes a parallel matrix--vector multiplication algorithm which is particularly ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
. The multiplication of a vector by a matrix is the kernel operation in many algorithms used in scientific computation. A fast and efficient parallel algorithm for this calculation is therefore desirable. This paper describes a parallel matrix--vector multiplication algorithm which is particularly well suited to dense matrices or matrices with an irregular sparsity pattern. Such matrices can arise from discretizing partial differential equations on irregular grids or from problems exhibiting nearly random connectivity between data structures. The communication cost of the algorithm is independent of the matrix sparsity pattern and is shown to scale as O(n= p p + log(p)) for an n \Theta n matrix on p processors. The algorithm's performance is demonstrated by using it within the well known NAS conjugate gradient benchmark. This resulted in the fastest run times achieved to date on both the 1024 node nCUBE 2 and the 128 node Intel iPSC/860. Additional improvements to the algorithm whic...
A New Parallel Method for Molecular Dynamics Simulation of Macromolecular Systems
, 1994
"... Short--range molecular dynamics simulations of molecular systems are commonly parallelized by replicated--data methods, where each processor stores a copy of all atom positions. This enables computation of bonded 2--, 3--, and 4--body forces within the molecular topology to be partitioned among p ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
Short--range molecular dynamics simulations of molecular systems are commonly parallelized by replicated--data methods, where each processor stores a copy of all atom positions. This enables computation of bonded 2--, 3--, and 4--body forces within the molecular topology to be partitioned among processors straightforwardly. A drawback to such methods is that the inter--processor communication scales as N , the number of atoms, independent of P , the number of processors. Thus, their parallel efficiency falls off rapidly when large numbers of processors are used. In this article a new parallel method for simulating macromolecular or small--molecule systems is presented, called force--decomposition. Its memory and communication costs scale as N= p P , allowing larger problems to be run faster on greater numbers of processors. Like replicated--data techniques, and in contrast to spatial--decomposition approaches, the new method can be simply load--balanced and performs well eve...
Implementation of NAS Parallel Benchmarks in High Performance Fortran
"... We present an HPF implementation of BT, SP, LU, FT, CG and MG of the NPB2.3-serial benchmark set. The implementation is based on HPF performance model of the benchmark specific primitive operations with distributed arrays. We present profiling and performance data on SGI Origin 2000 and compare the ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
We present an HPF implementation of BT, SP, LU, FT, CG and MG of the NPB2.3-serial benchmark set. The implementation is based on HPF performance model of the benchmark specific primitive operations with distributed arrays. We present profiling and performance data on SGI Origin 2000 and compare the results with NPB2.3. We discuss advantages and limitations of HPF and pghpf compiler.
A scalable distributed parallel breadth-first search algorithm on bluegene/l
- In SC ’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing
, 2005
"... Many emerging large-scale data science applications require searching large graphs distributed across multiple memories and processors. This paper presents a distributed breadthfirst search (BFS) scheme that scales for random graphs with up to three billion vertices and 30 billion edges. Scalability ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Many emerging large-scale data science applications require searching large graphs distributed across multiple memories and processors. This paper presents a distributed breadthfirst search (BFS) scheme that scales for random graphs with up to three billion vertices and 30 billion edges. Scalability was tested on IBM BlueGene/L with 32,768 nodes at the Lawrence Livermore National Laboratory. Scalability was obtained through a series of optimizations, in particular, those that ensure scalable use of memory. We use 2D (edge) partitioning of the graph instead of conventional 1D (vertex) partitioning to reduce communication overhead. For Poisson random graphs, we show that the expected size of the messages is scalable for both 2D and 1D partitionings. Finally, we have developed efficient collective communication functions for the 3D torus architecture of BlueGene/L that also take advantage of the structure in the problem. The performance and characteristics of the algorithm are measured and reported. 1
Partitioning Rectangular And Structurally Nonsymmetric Sparse Matrices For Parallel Processing
- SIAM J. Sci. Comput
, 1998
"... . A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrix-transposevector product is also required. This paper addresses the efficient parallelization of these operations. We show that ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
. A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrix-transposevector product is also required. This paper addresses the efficient parallelization of these operations. We show that the problem can be expressed in terms of partitioning bipartite graphs. We then introduce several algorithms for this partitioning problem and compare their performance on a set of test matrices. Key words. matrix partitioning, iterative method, parallel computing, rectangular matrix, structurally nonsymmetric matrix, bipartite graph AMS subject classifications. 05C50, 65F10, 65F50, 65Y05 1. Introduction. Matrix-vector and matrix-transpose-vector products that repeatedly involve the same large, sparse, structurally nonsymmetric or rectangular matrix arise in many iterative algorithms. Examples include algorithms for solving linear systems, least squares problems, and linear programs. To e...
Partitioning rectangular and structurally unsymmetric sparse matrices for parallel processing
- SIAM J. Sci. Comput
"... Abstract. A common operation in scientific computingis the multiplication of a sparse, rectangular, or structurally unsymmetric matrix and a vector. In many applications the matrix-transposevector product is also required. This paper addresses the efficient parallelization of these operations. We sh ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Abstract. A common operation in scientific computingis the multiplication of a sparse, rectangular, or structurally unsymmetric matrix and a vector. In many applications the matrix-transposevector product is also required. This paper addresses the efficient parallelization of these operations. We show that the problem can be expressed in terms of partitioningbipartite graphs. We then introduce several algorithms for this partitioning problem and compare their performance on a set of test matrices.

