Results 1 
8 of
8
Transporting Distributed BLAS to the Fujitsu AP3000 and VPP300
, 1998
"... The DBLAS Distributed BLAS Library is a portable version of parallel BLAS that has been highly tuned for the Fujitsu AP1000 and AP+. In this paper, we describe performance enhancements made for two very different high performance distributed memory platforms, the Fujitsu AP3000 and the Fujitsu VPP3 ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
The DBLAS Distributed BLAS Library is a portable version of parallel BLAS that has been highly tuned for the Fujitsu AP1000 and AP+. In this paper, we describe performance enhancements made for two very different high performance distributed memory platforms, the Fujitsu AP3000 and the Fujitsu VPP300. Even with the provision of highly tuned (vendorsupplied) serial BLAS implementations, attention must be given to cell computation speed issues, since serial BLAS does not supply a local matrix transpose routine (which is needed in many places), nor does it supply routines to adequately handle the triangular matrices which arise in the parallel context. We will describe the differing principles used on the UltraSPARC and VPP300 nodes to optimise memory access patterns for the local matrix transpose operation and the large matrix multiply. The former uses partitioning methods which can yield a factor of 34 improvement of naive methods. The latter simultaneously optimizes usage of two leve...
A Dense Complex Symmetric Indefinite Solver for the Fujitsu AP3000
, 1999
"... This paper describes the design, implementation and performance of a parallel direct dense symmetricindefinite solver routine. Such a solver is required for the large complex systems arising from electromagnetic field analysis, such as are generated from the AccuField application. The primary targ ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
This paper describes the design, implementation and performance of a parallel direct dense symmetricindefinite solver routine. Such a solver is required for the large complex systems arising from electromagnetic field analysis, such as are generated from the AccuField application. The primary target architecture for the solver is the Fujitsu AP3000, a distributed memory machine based on the UltraSPARC processor. The routine is written entirely in terms of the DBLAS Distributed Library, recently extended for complex precision. It uses the BunchKaufman diagonal pivoting method and is based on the LAPACK algorithm, with several modi cations required for efficient parallel implementation and one modification to reduce the amount of symmetric pivoting. Currently the routine uses a standard BLAS computational interface and can use either the MPI, BLACS or VPPLib communication interfaces (the latter is only available under the APruntime V2.0 system for the AP3000). The routine outperforms its e...
OPTIMAL LOAD BALANCING TECHNIQUES FOR BLOCKCYCLIC DECOMPOSITIONS FOR MATRIX FACTORIZATION
"... In this paper, we present a new load balancing technique, called panel scattering, which is generally applicable for parallel blockpartitioned dense linear algebra algorithms, such as matrix factorization. Here, the panels formed in such computation are divided across their length, and evenly (re) ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
In this paper, we present a new load balancing technique, called panel scattering, which is generally applicable for parallel blockpartitioned dense linear algebra algorithms, such as matrix factorization. Here, the panels formed in such computation are divided across their length, and evenly (re)distributed among all processors. It is shown how this technique can be efficiently implemented for the general blockcyclic matrix distribution, requiring only the collective communication primitives that required for blockcyclic parallel BLAS. In most situations, panel scattering yields optimal load balance and cell computation speed across all stages of the computation. It has also advantages in naturally yielding good memory access patterns. Compared with traditional methods which minimize communication costs at the expense of load balance, it has a small (in some situations negative) increase in communication volume costs. It however incurs extra communication startup costs, but only by a factor not exceeding 2. To maximize load balance, storage block sizes should be kept small; furthermore, in many situations of interest, there will be a small or even negative communication penalty for doing so. Results will be given on the Fujitsu AP+ parallel computer, which will compare the performance of panel scattering with previously established methods, for LU, LLT and QR factorization. These are consistent with a detailed performance model for LU factorization.
AN EFFICIENT AND STABLE METHOD FOR PARALLEL FACTORIZATION OF DENSE SYMMETRIC INDEFINITE MATRICES
"... This paper investigates the efficient parallelization of algorithms with strong stability guarantees to factor dense symmetric indefinite matrices. It shows how the bounded BunchKaufman algorithm may be efficiently parallelized, and then how its performance can be enhanced by using exhaustive block ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
This paper investigates the efficient parallelization of algorithms with strong stability guarantees to factor dense symmetric indefinite matrices. It shows how the bounded BunchKaufman algorithm may be efficiently parallelized, and then how its performance can be enhanced by using exhaustive block searching techniques, which is effective in keeping most symmetric interchanges within the current elimination block. This can avoid wasted computation and the communication normally involved in parallel symmetric interchanges, but requires considerable effort to reduce its introduced overheads. It has also great potential for outofcore algorithms. Results on a 16 node Fujitsu AP3000 multicomputer showed the block search increased performance over the (plain) bounded BunchKaufman algorithm by 5–14 % on strongly indefinite matrices, and in some cases outperforming the wellknown BunchKaufman algorithm (which is without strong stability guarantees).
Development of a Sparse Direct LLT Solver for the Fujitsu AP3000
"... EÆcient parallel algorithms for Sparse Direct LLT Solvers are complex and involve many tradeos. In this paper, we compare the performance of two sparse matrix solvers, a parallel one called PSPASES and a sequential one called DRSPSL, on a (virtual) AP3000, which is UltraSPARCbased. An overview of t ..."
Abstract
 Add to MetaCart
EÆcient parallel algorithms for Sparse Direct LLT Solvers are complex and involve many tradeos. In this paper, we compare the performance of two sparse matrix solvers, a parallel one called PSPASES and a sequential one called DRSPSL, on a (virtual) AP3000, which is UltraSPARCbased. An overview of the algorithms and software structure of the two solvers will be given, together with an analysis on the suitability of both for machines such as the AP3000. DRSPSL's performance compares favorably on a 2node virtual AP3000. An outline of how to parallelise DRSPSL will then be presented. 1 Introduction Parallel sparse direct LLT solvers are used to solve a system of linear equations AX = B where A is a sparse symmetric positive denite matrix. There are a variety of parallel algorithms available, but nding an eÆcient algorithm is complex and involves many tradeos. In this paper we compare two solvers on a (virtual) AP3000: PSPASES (Parallel SPArse Symmetric dirEct Solver), developed at...
Deverlopment of a Sparse Direct LLT Solver . . .
"... Efficient parallel algorithms for Sparse Direct LLT Solvers are complex and involve many tradeoffs. In this paper, we compare two solvers called PSPASES and VPPdLLT on a (virtual) AP3000, which is UltraSPARCbased. Both of these involve 4 stages: a reordering stage to minimize fillin, a symbolic fa ..."
Abstract
 Add to MetaCart
Efficient parallel algorithms for Sparse Direct LLT Solvers are complex and involve many tradeoffs. In this paper, we compare two solvers called PSPASES and VPPdLLT on a (virtual) AP3000, which is UltraSPARCbased. Both of these involve 4 stages: a reordering stage to minimize fillin, a symbolic factorization stage to determine the nonzero structure of the factored matrix, a numerical factorization stage to compute the values of the nonzeroes, and a triangular systems solution stage. An overview of the algorithms and software structure of the two solvers will be given, together with an analysis on the suitability of both for machines such as the AP3000. VPPdLLT's performance compares favorably on a 2node virtual AP3000. An outline of how to tune and extend VPPdLLT to larger processor configurations will then be presented.