Results 1 
8 of
8
Applied Numerical Linear Algebra
 Society for Industrial and Applied Mathematics
, 1997
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate ..."
Abstract

Cited by 532 (26 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
Fast Parallel Algorithms for ShortRange Molecular Dynamics
 JOURNAL OF COMPUTATIONAL PHYSICS
, 1995
"... Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of interatomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dyn ..."
Abstract

Cited by 184 (6 self)
 Add to MetaCart
Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of interatomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently  those with shortrange forces where the neighbors of each atom change rapidly. They can be implemented on any distributedmemory parallel machine which allows for messagepassing of data between independently executing processors. The algorithms are tested on a standard LennardJones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers  the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray YMP and C90 algorithm shows that the current generation of parallel machines is competitive with conventi...
Communication Primitives for Unstructured Finite Element Simulations on Data Parallel Architectures
 Computing Systems in Engineering, 3(1  4):6372
, 1992
"... Efficient data motion is critical for high performance computing on distributed memory architectures. The value of some techniques for efficient data motion is illustrated by identifying generic communication primitives. Further, the efficiency of these primitives is demonstrated on three different ..."
Abstract

Cited by 9 (8 self)
 Add to MetaCart
Efficient data motion is critical for high performance computing on distributed memory architectures. The value of some techniques for efficient data motion is illustrated by identifying generic communication primitives. Further, the efficiency of these primitives is demonstrated on three different applications using the finite element method for unstructured grids and sparse solvers with different communication requirements. For the applications presented, the techniques advocated reduced the communication times by a factor of between 1.5  3. 1 Introduction The finite element method is a popular technique for solving boundary and initial value problems. Moderate sized engineering problems have been successfully simulated using this technique. The primary bottleneck for the simulation of large problems has been available computational resources. With the advent of massively parallel architectures, simulating significantly larger 1 Also affiliated with the Division of Applied Scien...
Alltoall Broadcast and Applications on the Connection Machine
, 1991
"... An alltoall broadcast algorithm that exploits concurrent communication on all channels of the Connection Machine system CM200 binary cube network is described. Issues in integrating a physical alltoall broadcast between processing nodes into a language environment using a global address sp ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
An alltoall broadcast algorithm that exploits concurrent communication on all channels of the Connection Machine system CM200 binary cube network is described. Issues in integrating a physical alltoall broadcast between processing nodes into a language environment using a global address space is dicussed. Timings for the physical broadcast between nodes, and the virtual broadcast are given for the Connection Machine system CM200. The peak data transfer rate for the physical broadcast on a CM200 is 5.9 Gbytes/sec, and the peak rate for the virtual broadcast is 31 Gbytes/sec. Array reshaping is an effective performance optimization technique. An example is given where reshaping improved performance by a factor of seven by reducing the amount of local data motion. We also show how to exploit symmetry for computation of an interaction matrix using the alltoall broadcast function. Further optimizations are suggested for Nbody type calculations. Using the alltoall broa...
Alltoall Communication Algorithms for Distributed BLAS
 Harvard University
, 1993
"... This article illustrates the use of the alltoall broadcast and reduce primitives for dense distributed basic linear algebra operations (DBLAS) such as matrixvector, vectormatrix multiply, and rank1 updates. These applications require not only the data values but their indices as well. Deta ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This article illustrates the use of the alltoall broadcast and reduce primitives for dense distributed basic linear algebra operations (DBLAS) such as matrixvector, vectormatrix multiply, and rank1 updates. These applications require not only the data values but their indices as well. Detailed schedules for alltoall broadcast and reduction are described for the data motion of arrays mapped to the processing nodes of binary cube networks using binary encoding and binaryreflected Gray encoding. These algorithms compute the indices for the communicated data locally. Thus, no communication bandwidth is consumed for moving data array indices around. Algorithms for alltoall broadcast and reduction based on single and multiple Hamiltonian cycles in binary dcubes and their implementation on a Connection Machine system, CM200, are described. The performance of different implementations of the Hamiltonian cycles based algorithms is compared with the performance of alltoall algorithms based on edgedisjoint, multiple spanning trees of minimum height, and the performance of butterfly network based algorithms. These alltoall algorithms have been incorporated in the distributed matrixvector (DGEMV) and vectormatrix multiplication (DGEMV with TRANS) and rank1 (DGER) update functions available in the Connection Machine Scientific Software Library, CMSSL [7], Version 3.0. A summary of the performance of the matrixvector (DGEMV) and vectormatrix (DGEMVTRANS) routines are given in Table 1, and in Figure 2. P0 P1 P2 P3 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Before reduction P0 P1 P2 P3 0 2 4 6 1 3 5 7 After reduction Figure 1: Alltoall reduction on a four processing node system. <F NaN>P
Language and compiler issues in scalable high performance scientific libraries
 PROCEEDINGS OF THE THIRD WORKSHOP ON COMPILERS FOR PARALLEL COMPUTERS
, 1992
"... Library functions for scalable architectures must be designed to correctly and efficiently support any distributed data structure that can be created with the supported languages and associated compiler directives. Libraries must be designed also to support concurrency in each function evaluation, a ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Library functions for scalable architectures must be designed to correctly and efficiently support any distributed data structure that can be created with the supported languages and associated compiler directives. Libraries must be designed also to support concurrency in each function evaluation, as well as the concurrent application of the functions to disjoint array segments, known as multipleinstance computation. Control over the data distribution is often critical for locality of reference, and so is the control over the interprocessor data motion. Scalability, while preserving efficiency, implies that the data distribution, the data motion, and the scheduling is adapted to the object shapes, the machine configuration, and the size of the objects relative to the machine size. The Connection Machine Scientific Software Library is a scalable library for distributed data structures. The library is designed for languages with an array syntax. It is accessible from all supported languages (Lisp, C, CMFortran, and Paris (PARallel Instruction Set) in combination with Lisp, C, and Fortran 77). Single library calls can manage both concurrent application of a function to disjoint array segments, as well as concurrency in
Massively Parallel Computing: Mathematics and communications libraries
, 1993
"... Massively parallel computing holds the promise of extreme performance. The utility of these systems will depend heavily upon the availability of libraries until compilation and runtime system technology is developed to a level comparable to what today is common on most uniprocessor systems. Critica ..."
Abstract
 Add to MetaCart
Massively parallel computing holds the promise of extreme performance. The utility of these systems will depend heavily upon the availability of libraries until compilation and runtime system technology is developed to a level comparable to what today is common on most uniprocessor systems. Critical for performance is the ability to exploit locality of reference and effective management of the communication resources. We discuss some techniques for preserving locality of reference in distributed memory architectures. In particular, we discuss the benefits of multidimensional address spaces instead of the conventional linearized address spaces, partitioning of irregular grids, and placement of partitions among nodes. Some of these techniques are supported as language directives, others as runtime system functions, and others still are part of the Connection Machine Scientific Software Library, CMSSL. We briefly discuss some of the unique design issues in this library for distribute...
PRELIMINARY DOCUMENTATION
, 1993
"... The information in this document is subject to change without notice and should not be construed as a commitment by Thinking Machines Corporation. Thinking Machines assumes no liability for errors in this document. This document does not describe any product that is currently available from Thinking ..."
Abstract
 Add to MetaCart
The information in this document is subject to change without notice and should not be construed as a commitment by Thinking Machines Corporation. Thinking Machines assumes no liability for errors in this document. This document does not describe any product that is currently available from Thinking Machines Corporation, and Thinking Machines does not commit to implement the contents of this document in any product. 4 ******************************************************************************&I & 4I&. & 444 44 44.4.4444 & &. 44 & 4.I&I&&. 4 Connection Machine ® is a registered trademark of Thinking Machines Corporation. CM, CM2, CM200, CM5, CM5 Scale 3, and DataVault are trademarks of Thinking Machines Corporation. CMosr, CMAX, and Prism are trademarks of Thinking Machines Corporation. C * ® is a registered trademark of Thinking Machines Corporation. FastGraph is a trademark of Thinking Machines Corporation. Paris, *Lisp, and CM Fortran are trademarks of Thinking Machines Corporation. CMMD, CMSSL, and CMX11 are trademarks of Thinking Machines Corporation. Scalable Computing (SC) is a trademark of Thinking Machines Corporation.