Results 1 - 10
of
12
Matrix Multiplication On The OTIS-Mesh Optoelectronic Computer
- In Proceedings of the sixth international conference on Massively Parallel Processing using Optical Interconnections (MPPOI’99
, 2001
"... We develop algorithms to multiply two vectors, a vector and a matrix, and two matrices on an OTIS-Mesh optoelectronic computer. Two mappings, group row and group submesh [25], of a matrix onto an OTIS-Mesh are considered and the relative merits of each compared. We show that our algorithms to mul ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
We develop algorithms to multiply two vectors, a vector and a matrix, and two matrices on an OTIS-Mesh optoelectronic computer. Two mappings, group row and group submesh [25], of a matrix onto an OTIS-Mesh are considered and the relative merits of each compared. We show that our algorithms to multiply a column and row vector use an optimal number of data moves for both the group row and group submesh mappings; our algorithm to multiply a row vector and a column vector is optimal for the group row mapping; and our algorithm to multiply a matrix by a column vector is optimal for the group row mapping.
Distributed Computational Electromagnetics Systems
"... We describe our development of a "real world" electromagnetic application on distributed computing systems. A computational electromagnetics (CEM) simulation for radar crosssection(RCS) modeling of full scale airborne systems has been ported to three networked workstation cluster systems: an IBM RS ..."
Abstract
-
Cited by 8 (8 self)
- Add to MetaCart
We describe our development of a "real world" electromagnetic application on distributed computing systems. A computational electromagnetics (CEM) simulation for radar crosssection(RCS) modeling of full scale airborne systems has been ported to three networked workstation cluster systems: an IBM RS/6000 cluster with Ethernet connection; a DEC Alpha farm connected by a FDDI-based Gigaswitch; and an ATM-connected SUN IPXs testbed. We used the ScaLAPACK LU solver from Oak Ridge National Laboratory/University of Tennessee in our parallel implementation for solving the dense matrix which forms the computationally intensive kernel of this application, and we have adopted BLACS as the message passing interface in all of our code development toachieve high portability across the three con gurations. The performance data from this work is reported, together with timing data from other MPP systems on which wehave implemented this application including an Intel iPSC/860 and a CM-5, and which we include for comparison.
Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand
"... This paper describes how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel host-assisted approach while achieving zero-copy communication and supporting an excellent overlap of computa ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper describes how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel host-assisted approach while achieving zero-copy communication and supporting an excellent overlap of computation with communication. For contiguous data we are able to achieve a small message latency of 6s and a peak bandwidth of 830 MB/s for 'put' and a small message latency of 12s and a peak bandwidth of 765 Megabytes for 'get'. These numbers are almost as good as the performance of the native VAPI layer. For the noncontiguous data, the host assisted approach can deliver bandwidth close to that for the contiguous data. We also demonstrate the superior tolerance of host-assisted data-transfer operations to CPU intensive tasks due to minimum host involvement in our approach as compared to the traditional host-based approach. Our implementation also supports a very high degree of overlap of computation and communication. 99% overlap for contiguous and up to 95% for non contiguous in case of large message sizes were achieved. The NAS MG and matrix multiplication benchmarks were used to validate effectiveness of our approach, and demonstrated excellent overall performance
Parallel Multidimensional Scaling Performance on Multicore Systems at workshop on
- Advances in High-Performance E-Science Middleware and Applications in Proceedings of eScience 2008 Indianapolis IN December 7-12 2008 http://grids.ucs.indiana.edu/ptliupages/publications/eScience 2008_bae3.pdf
"... Multidimensional scaling constructs a configuration points into the target low-dimensional space, while the interpoint distances are approximated to corresponding known dissimilarity value as much as possible. SMA-COF algorithm is an elegant gradient descent approach to solve Multidimensional scalin ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Multidimensional scaling constructs a configuration points into the target low-dimensional space, while the interpoint distances are approximated to corresponding known dissimilarity value as much as possible. SMA-COF algorithm is an elegant gradient descent approach to solve Multidimensional scaling problem. We design parallel SMACOF program using parallel matrix multiplication to run on a multicore machine. Also, we propose a block decomposition algorithm based on the number of threads for the purpose of keeping good load balance. The proposed block decomposition algorithm works very well if the number of row blocks is at least a half of the number of threads. In this paper, we investigate performance results of the implemented parallel SMACOF in terms of the block size, data size, and the number of threads. The speedup factor is almost 7.7 with 2048 points data over 8 running threads. In addition, performance comparison between jagged array and two-dimensional array in C # language is carried out. The jagged array data structure performs at least 40 % better than the two-dimensional array structure. 1.
A Scalable Paradigm for Effectively-Dense Matrix Formulated Applications
- Proc. of the European Conference on High-Performance Computing and Networking
, 1994
"... There is a class of problems in computational science and engineering which require formulation in full matrix form and which are generally solved as dense matrices either because they are dense or because the sparsity can not be easily exploited. Problems such as those posed by computational electr ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
There is a class of problems in computational science and engineering which require formulation in full matrix form and which are generally solved as dense matrices either because they are dense or because the sparsity can not be easily exploited. Problems such as those posed by computational electromagnetics, computational chemistry and some quantum physics applications frequently fall into this class. It is not sufficient just to solve the matrix problem for these applications as other components of the calculation are usually of equal computational load on current computer systems, and these components are consequently of equal importance to the end user of the application. We describe a general method for programming such applications using a combination of distributed computing systems and of more powerful back-end compute resources to schedule the components of such applications. We show how this not only improves computational performance but by making more memory available, all...
EDONIO: Extended distributed object network I/O library
, 1995
"... This report has been reproduced directly from the best available copy. Available to DOE and DOE contractors from the Office of Scientific and Techni- ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This report has been reproduced directly from the best available copy. Available to DOE and DOE contractors from the Office of Scientific and Techni-
Exploiting non-blocking remote memory access communication in scientific benchmarks
- In High Performance Computing - HiPC
, 2003
"... In the last decade message passing has become the predominant programming model for scientific applications. The current paper attempts to answer the question to what degree performance of well tuned application benchmarks coded in MPI can be improved by using another related programming model, remo ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In the last decade message passing has become the predominant programming model for scientific applications. The current paper attempts to answer the question to what degree performance of well tuned application benchmarks coded in MPI can be improved by using another related programming model, remote memory access (RMA) communication. In the past,
CRPC Research into Linear Algebra Software for High Performance Computers
, 1994
"... In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for high-performance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for high-performance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library for performing dense and banded linear algebra computations, and was designed to run efficiently on high performance computers. We focus on the design of the distributed memory version of LAPACK, and on an object-oriented interface to LAPACK. The templates project aims at making the task of developing sparse linear algebra software simpler and easier. Reusable software templates are provided that the user can then customize to modify and optimize a particular algorithm, and hence build a more complex applications. ARPACK is a software package for solving large scale eigenvalue problems, and is based on an implicitly restarted variant of the Arnoldi scheme. The paper focuses on issues impact...
Formal Methods For High-Performance Linear Algebra
, 2000
"... The core curriculum of any first-rate undergraduate Computer Science department includes at least one course that focuses on the formal derivation and verification of algorithms [6]. Many of us in scientific... ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The core curriculum of any first-rate undergraduate Computer Science department includes at least one course that focuses on the formal derivation and verification of algorithms [6]. Many of us in scientific...
Exploiting Non-blocking Remote Memory Access Communication in Scientific Benchmarks
- In High Performance Computing - HiPC
"... This paper describes a comparative performance study of MPI and Remote Memory Access (RMA) communication models in context of four scientific benchmarks: NAS MG, NAS CG, SUMMA matrix multiplication, and Lennard Jones molecular dynamics on clusters with the Myrinet network. It is shown that RMA c ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper describes a comparative performance study of MPI and Remote Memory Access (RMA) communication models in context of four scientific benchmarks: NAS MG, NAS CG, SUMMA matrix multiplication, and Lennard Jones molecular dynamics on clusters with the Myrinet network. It is shown that RMA communication delivers a consistent performance advantage over MPI. In some cases an improvement as much as 50% was achieved. Benefits of using non-blocking RMA for overlapping computation and communication are discussed.

