Results 1  10
of
23
Matrix Multiplication On The OTISMesh Optoelectronic Computer
 In Proceedings of the sixth international conference on Massively Parallel Processing using Optical Interconnections (MPPOI’99
, 2001
"... We develop algorithms to multiply two vectors, a vector and a matrix, and two matrices on an OTISMesh optoelectronic computer. Two mappings, group row and group submesh [25], of a matrix onto an OTISMesh are considered and the relative merits of each compared. We show that our algorithms to mul ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
We develop algorithms to multiply two vectors, a vector and a matrix, and two matrices on an OTISMesh optoelectronic computer. Two mappings, group row and group submesh [25], of a matrix onto an OTISMesh are considered and the relative merits of each compared. We show that our algorithms to multiply a column and row vector use an optimal number of data moves for both the group row and group submesh mappings; our algorithm to multiply a row vector and a column vector is optimal for the group row mapping; and our algorithm to multiply a matrix by a column vector is optimal for the group row mapping.
Communicationoptimal parallel algorithm for Strassen’s matrix multiplication
 In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12
, 2012
"... Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix mul ..."
Abstract

Cited by 16 (13 self)
 Add to MetaCart
Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassenbased, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen’s algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA’11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communicationoptimal. It exhibits perfect strong scaling within the maximum possible range.
HostAssisted ZeroCopy Remote Memory Access Communication on InfiniBand
"... This paper describes how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel hostassisted approach while achieving zerocopy communication and supporting an excellent overlap of computa ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
This paper describes how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel hostassisted approach while achieving zerocopy communication and supporting an excellent overlap of computation with communication. For contiguous data we are able to achieve a small message latency of 6s and a peak bandwidth of 830 MB/s for 'put' and a small message latency of 12s and a peak bandwidth of 765 Megabytes for 'get'. These numbers are almost as good as the performance of the native VAPI layer. For the noncontiguous data, the host assisted approach can deliver bandwidth close to that for the contiguous data. We also demonstrate the superior tolerance of hostassisted datatransfer operations to CPU intensive tasks due to minimum host involvement in our approach as compared to the traditional hostbased approach. Our implementation also supports a very high degree of overlap of computation and communication. 99% overlap for contiguous and up to 95% for non contiguous in case of large message sizes were achieved. The NAS MG and matrix multiplication benchmarks were used to validate effectiveness of our approach, and demonstrated excellent overall performance
Distributed Computational Electromagnetics Systems
"... We describe our development of a "real world" electromagnetic application on distributed computing systems. A computational electromagnetics (CEM) simulation for radar crosssection(RCS) modeling of full scale airborne systems has been ported to three networked workstation cluster systems: an IBM RS ..."
Abstract

Cited by 7 (7 self)
 Add to MetaCart
We describe our development of a "real world" electromagnetic application on distributed computing systems. A computational electromagnetics (CEM) simulation for radar crosssection(RCS) modeling of full scale airborne systems has been ported to three networked workstation cluster systems: an IBM RS/6000 cluster with Ethernet connection; a DEC Alpha farm connected by a FDDIbased Gigaswitch; and an ATMconnected SUN IPXs testbed. We used the ScaLAPACK LU solver from Oak Ridge National Laboratory/University of Tennessee in our parallel implementation for solving the dense matrix which forms the computationally intensive kernel of this application, and we have adopted BLACS as the message passing interface in all of our code development toachieve high portability across the three con gurations. The performance data from this work is reported, together with timing data from other MPP systems on which wehave implemented this application including an Intel iPSC/860 and a CM5, and which we include for comparison.
Highly Parallel Sparse MatrixMatrix Multiplication
, 2010
"... Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on twodimensional block distribution of sparse matrices where serial sections use a novel hypersparse kernel for scalability. We give a stateoftheart MPI implementation of one of our algorithms. Our experiments show scaling up to thousands of processors on a variety of test scenarios.
Parallel Multidimensional Scaling Performance on Multicore Systems at workshop on
 Advances in HighPerformance EScience Middleware and Applications in Proceedings of eScience 2008 Indianapolis IN December 712 2008 http://grids.ucs.indiana.edu/ptliupages/publications/eScience 2008_bae3.pdf
"... Multidimensional scaling constructs a configuration points into the target lowdimensional space, while the interpoint distances are approximated to corresponding known dissimilarity value as much as possible. SMACOF algorithm is an elegant gradient descent approach to solve Multidimensional scalin ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Multidimensional scaling constructs a configuration points into the target lowdimensional space, while the interpoint distances are approximated to corresponding known dissimilarity value as much as possible. SMACOF algorithm is an elegant gradient descent approach to solve Multidimensional scaling problem. We design parallel SMACOF program using parallel matrix multiplication to run on a multicore machine. Also, we propose a block decomposition algorithm based on the number of threads for the purpose of keeping good load balance. The proposed block decomposition algorithm works very well if the number of row blocks is at least a half of the number of threads. In this paper, we investigate performance results of the implemented parallel SMACOF in terms of the block size, data size, and the number of threads. The speedup factor is almost 7.7 with 2048 points data over 8 running threads. In addition, performance comparison between jagged array and twodimensional array in C # language is carried out. The jagged array data structure performs at least 40 % better than the twodimensional array structure. 1.
A Scalable Paradigm for EffectivelyDense Matrix Formulated Applications
 Proc. of the European Conference on HighPerformance Computing and Networking
, 1994
"... There is a class of problems in computational science and engineering which require formulation in full matrix form and which are generally solved as dense matrices either because they are dense or because the sparsity can not be easily exploited. Problems such as those posed by computational electr ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
There is a class of problems in computational science and engineering which require formulation in full matrix form and which are generally solved as dense matrices either because they are dense or because the sparsity can not be easily exploited. Problems such as those posed by computational electromagnetics, computational chemistry and some quantum physics applications frequently fall into this class. It is not sufficient just to solve the matrix problem for these applications as other components of the calculation are usually of equal computational load on current computer systems, and these components are consequently of equal importance to the end user of the application. We describe a general method for programming such applications using a combination of distributed computing systems and of more powerful backend compute resources to schedule the components of such applications. We show how this not only improves computational performance but by making more memory available, all...
EDONIO: Extended distributed object network I/O library
, 1995
"... This report has been reproduced directly from the best available copy. Available to DOE and DOE contractors from the Office of Scientific and Techni ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
This report has been reproduced directly from the best available copy. Available to DOE and DOE contractors from the Office of Scientific and Techni
Exploiting nonblocking remote memory access communication in scientific benchmarks
 In High Performance Computing  HiPC
, 2003
"... In the last decade message passing has become the predominant programming model for scientific applications. The current paper attempts to answer the question to what degree performance of well tuned application benchmarks coded in MPI can be improved by using another related programming model, remo ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
In the last decade message passing has become the predominant programming model for scientific applications. The current paper attempts to answer the question to what degree performance of well tuned application benchmarks coded in MPI can be improved by using another related programming model, remote memory access (RMA) communication. In the past,
CRPC Research into Linear Algebra Software for High Performance Computers
, 1994
"... In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for highperformance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for highperformance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library for performing dense and banded linear algebra computations, and was designed to run efficiently on high performance computers. We focus on the design of the distributed memory version of LAPACK, and on an objectoriented interface to LAPACK. The templates project aims at making the task of developing sparse linear algebra software simpler and easier. Reusable software templates are provided that the user can then customize to modify and optimize a particular algorithm, and hence build a more complex applications. ARPACK is a software package for solving large scale eigenvalue problems, and is based on an implicitly restarted variant of the Arnoldi scheme. The paper focuses on issues impact...