Results 1  10
of
30
Communicationoptimal parallel algorithm for Strassen’s matrix multiplication
 In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12
, 2012
"... Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix mul ..."
Abstract

Cited by 32 (21 self)
 Add to MetaCart
Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassenbased, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen’s algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA’11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communicationoptimal. It exhibits perfect strong scaling within the maximum possible range.
Matrix multiplication on the OTISMesh optoelectronic computer
 IEEE Trans. Computers
, 2001
"... ..."
(Show Context)
Highly Parallel Sparse MatrixMatrix Multiplication
, 2010
"... Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on twodimensional block distribution of sparse matrices where serial sections use a novel hypersparse kernel for scalability. We give a stateoftheart MPI implementation of one of our algorithms. Our experiments show scaling up to thousands of processors on a variety of test scenarios.
CommunicationOptimal Parallel Recursive Rectangular Matrix Multiplication
, 2012
"... ..."
(Show Context)
HostAssisted ZeroCopy Remote Memory Access Communication on InfiniBand
"... This paper describes how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel hostassisted approach while achieving zerocopy communication and supporting an excellent overlap of computa ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
This paper describes how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel hostassisted approach while achieving zerocopy communication and supporting an excellent overlap of computation with communication. For contiguous data we are able to achieve a small message latency of 6s and a peak bandwidth of 830 MB/s for 'put' and a small message latency of 12s and a peak bandwidth of 765 Megabytes for 'get'. These numbers are almost as good as the performance of the native VAPI layer. For the noncontiguous data, the host assisted approach can deliver bandwidth close to that for the contiguous data. We also demonstrate the superior tolerance of hostassisted datatransfer operations to CPU intensive tasks due to minimum host involvement in our approach as compared to the traditional hostbased approach. Our implementation also supports a very high degree of overlap of computation and communication. 99% overlap for contiguous and up to 95% for non contiguous in case of large message sizes were achieved. The NAS MG and matrix multiplication benchmarks were used to validate effectiveness of our approach, and demonstrated excellent overall performance
CommunicationAvoiding Parallel Strassen: Implementation and Performance
"... Abstract—Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n 3) matrix multiplication, even though there exist Strassenlike matrix multiplication algorithms that have lower arithmetic complexit ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
Abstract—Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n 3) matrix multiplication, even though there exist Strassenlike matrix multiplication algorithms that have lower arithmetic complexity, as the classical ones perform better in practice. We recently obtained a new parallel algorithm that is based on Strassen’s fast matrix multiplication (SPAA ’12) that minimizes communication: it communicates asymptotically less than all classical and all previous Strassenbased algorithms, and it attains corresponding lower bounds. It is also the first parallelStrassen algorithm that exhibits perfect strong scaling. In this paper, we show that the new algorithm is also faster in practice. We benchmark and compare the performance of our new algorithm to previous algorithms on Franklin (Cray XT4), Hopper (Cray XE6), and Intrepid (IBM BG/P). We demonstrate significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors. We model and analyze the performance of the algorithm, and predict its performance on future exascale platforms. I.
Distributed Computational Electromagnetics Systems
"... We describe our development of a "real world" electromagnetic application on distributed computing systems. A computational electromagnetics (CEM) simulation for radar crosssection(RCS) modeling of full scale airborne systems has been ported to three networked workstation cluster systems: ..."
Abstract

Cited by 7 (7 self)
 Add to MetaCart
We describe our development of a "real world" electromagnetic application on distributed computing systems. A computational electromagnetics (CEM) simulation for radar crosssection(RCS) modeling of full scale airborne systems has been ported to three networked workstation cluster systems: an IBM RS/6000 cluster with Ethernet connection; a DEC Alpha farm connected by a FDDIbased Gigaswitch; and an ATMconnected SUN IPXs testbed. We used the ScaLAPACK LU solver from Oak Ridge National Laboratory/University of Tennessee in our parallel implementation for solving the dense matrix which forms the computationally intensive kernel of this application, and we have adopted BLACS as the message passing interface in all of our code development toachieve high portability across the three con gurations. The performance data from this work is reported, together with timing data from other MPP systems on which wehave implemented this application including an Intel iPSC/860 and a CM5, and which we include for comparison.
Parallel Multidimensional Scaling Performance on Multicore Systems at workshop on
 Advances in HighPerformance EScience Middleware and Applications in Proceedings of eScience 2008 Indianapolis IN December 712 2008 http://grids.ucs.indiana.edu/ptliupages/publications/eScience 2008_bae3.pdf
"... Multidimensional scaling constructs a configuration points into the target lowdimensional space, while the interpoint distances are approximated to corresponding known dissimilarity value as much as possible. SMACOF algorithm is an elegant gradient descent approach to solve Multidimensional scalin ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
Multidimensional scaling constructs a configuration points into the target lowdimensional space, while the interpoint distances are approximated to corresponding known dissimilarity value as much as possible. SMACOF algorithm is an elegant gradient descent approach to solve Multidimensional scaling problem. We design parallel SMACOF program using parallel matrix multiplication to run on a multicore machine. Also, we propose a block decomposition algorithm based on the number of threads for the purpose of keeping good load balance. The proposed block decomposition algorithm works very well if the number of row blocks is at least a half of the number of threads. In this paper, we investigate performance results of the implemented parallel SMACOF in terms of the block size, data size, and the number of threads. The speedup factor is almost 7.7 with 2048 points data over 8 running threads. In addition, performance comparison between jagged array and twodimensional array in C # language is carried out. The jagged array data structure performs at least 40 % better than the twodimensional array structure. 1.