Results 1  10
of
11
Transporting Distributed BLAS to the Fujitsu AP3000 and VPP300
, 1998
"... The DBLAS Distributed BLAS Library is a portable version of parallel BLAS that has been highly tuned for the Fujitsu AP1000 and AP+. In this paper, we describe performance enhancements made for two very different high performance distributed memory platforms, the Fujitsu AP3000 and the Fujitsu VPP3 ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
The DBLAS Distributed BLAS Library is a portable version of parallel BLAS that has been highly tuned for the Fujitsu AP1000 and AP+. In this paper, we describe performance enhancements made for two very different high performance distributed memory platforms, the Fujitsu AP3000 and the Fujitsu VPP300. Even with the provision of highly tuned (vendorsupplied) serial BLAS implementations, attention must be given to cell computation speed issues, since serial BLAS does not supply a local matrix transpose routine (which is needed in many places), nor does it supply routines to adequately handle the triangular matrices which arise in the parallel context. We will describe the differing principles used on the UltraSPARC and VPP300 nodes to optimise memory access patterns for the local matrix transpose operation and the large matrix multiply. The former uses partitioning methods which can yield a factor of 34 improvement of naive methods. The latter simultaneously optimizes usage of two leve...
A Dense Complex Symmetric Indefinite Solver for the Fujitsu AP3000
, 1999
"... This paper describes the design, implementation and performance of a parallel direct dense symmetricindefinite solver routine. Such a solver is required for the large complex systems arising from electromagnetic field analysis, such as are generated from the AccuField application. The primary targ ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
This paper describes the design, implementation and performance of a parallel direct dense symmetricindefinite solver routine. Such a solver is required for the large complex systems arising from electromagnetic field analysis, such as are generated from the AccuField application. The primary target architecture for the solver is the Fujitsu AP3000, a distributed memory machine based on the UltraSPARC processor. The routine is written entirely in terms of the DBLAS Distributed Library, recently extended for complex precision. It uses the BunchKaufman diagonal pivoting method and is based on the LAPACK algorithm, with several modi cations required for efficient parallel implementation and one modification to reduce the amount of symmetric pivoting. Currently the routine uses a standard BLAS computational interface and can use either the MPI, BLACS or VPPLib communication interfaces (the latter is only available under the APruntime V2.0 system for the AP3000). The routine outperforms its e...
OPTIMAL LOAD BALANCING TECHNIQUES FOR BLOCKCYCLIC DECOMPOSITIONS FOR MATRIX FACTORIZATION
"... In this paper, we present a new load balancing technique, called panel scattering, which is generally applicable for parallel blockpartitioned dense linear algebra algorithms, such as matrix factorization. Here, the panels formed in such computation are divided across their length, and evenly (re) ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
In this paper, we present a new load balancing technique, called panel scattering, which is generally applicable for parallel blockpartitioned dense linear algebra algorithms, such as matrix factorization. Here, the panels formed in such computation are divided across their length, and evenly (re)distributed among all processors. It is shown how this technique can be efficiently implemented for the general blockcyclic matrix distribution, requiring only the collective communication primitives that required for blockcyclic parallel BLAS. In most situations, panel scattering yields optimal load balance and cell computation speed across all stages of the computation. It has also advantages in naturally yielding good memory access patterns. Compared with traditional methods which minimize communication costs at the expense of load balance, it has a small (in some situations negative) increase in communication volume costs. It however incurs extra communication startup costs, but only by a factor not exceeding 2. To maximize load balance, storage block sizes should be kept small; furthermore, in many situations of interest, there will be a small or even negative communication penalty for doing so. Results will be given on the Fujitsu AP+ parallel computer, which will compare the performance of panel scattering with previously established methods, for LU, LLT and QR factorization. These are consistent with a detailed performance model for LU factorization.
AN EFFICIENT AND STABLE METHOD FOR PARALLEL FACTORIZATION OF DENSE SYMMETRIC INDEFINITE MATRICES
"... This paper investigates the efficient parallelization of algorithms with strong stability guarantees to factor dense symmetric indefinite matrices. It shows how the bounded BunchKaufman algorithm may be efficiently parallelized, and then how its performance can be enhanced by using exhaustive block ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
This paper investigates the efficient parallelization of algorithms with strong stability guarantees to factor dense symmetric indefinite matrices. It shows how the bounded BunchKaufman algorithm may be efficiently parallelized, and then how its performance can be enhanced by using exhaustive block searching techniques, which is effective in keeping most symmetric interchanges within the current elimination block. This can avoid wasted computation and the communication normally involved in parallel symmetric interchanges, but requires considerable effort to reduce its introduced overheads. It has also great potential for outofcore algorithms. Results on a 16 node Fujitsu AP3000 multicomputer showed the block search increased performance over the (plain) bounded BunchKaufman algorithm by 5–14 % on strongly indefinite matrices, and in some cases outperforming the wellknown BunchKaufman algorithm (which is without strong stability guarantees).
Deverlopment of a Sparse Direct LLT Solver . . .
"... Efficient parallel algorithms for Sparse Direct LLT Solvers are complex and involve many tradeoffs. In this paper, we compare two solvers called PSPASES and VPPdLLT on a (virtual) AP3000, which is UltraSPARCbased. Both of these involve 4 stages: a reordering stage to minimize fillin, a symbolic fa ..."
Abstract
 Add to MetaCart
Efficient parallel algorithms for Sparse Direct LLT Solvers are complex and involve many tradeoffs. In this paper, we compare two solvers called PSPASES and VPPdLLT on a (virtual) AP3000, which is UltraSPARCbased. Both of these involve 4 stages: a reordering stage to minimize fillin, a symbolic factorization stage to determine the nonzero structure of the factored matrix, a numerical factorization stage to compute the values of the nonzeroes, and a triangular systems solution stage. An overview of the algorithms and software structure of the two solvers will be given, together with an analysis on the suitability of both for machines such as the AP3000. VPPdLLT's performance compares favorably on a 2node virtual AP3000. An outline of how to tune and extend VPPdLLT to larger processor configurations will then be presented.
Software Overhead and Blocking Issues in Parallel BLAS
, 1998
"... 1 Introduction ScaLAPACK [3, 5, 1] is a parallel dense linear algebra library based on blockpartitioned algorithms, with essentially all computations (and most communication) being expressed the Level 1, 2 and 3 Parallel Basic Linear Algebra Subroutines (PBLAS). ..."
Abstract
 Add to MetaCart
1 Introduction ScaLAPACK [3, 5, 1] is a parallel dense linear algebra library based on blockpartitioned algorithms, with essentially all computations (and most communication) being expressed the Level 1, 2 and 3 Parallel Basic Linear Algebra Subroutines (PBLAS).
Technology Development Group, Fujitsu Ltd.
, 2000
"... Abstract ACCUFIELD is a commercial application for electromagnetic field analysis currently being developed by Fujitsu Ltd. It can be used to simulate the emissions for computer hardware subsystems such as a printed circuit board, a combination of circuit boards and wire connecting boards, or even ..."
Abstract
 Add to MetaCart
Abstract ACCUFIELD is a commercial application for electromagnetic field analysis currently being developed by Fujitsu Ltd. It can be used to simulate the emissions for computer hardware subsystems such as a printed circuit board, a combination of circuit boards and wire connecting boards, or even a cabinet. The ACCUFIELD software consists of a preprocessor, a solver and a postprocessor. The preprocessor is used to input or modify a model of the subsystem to be analysed. It converts the model to a dense complex symmetric indefinite linear system, which can then be solved for various EM frequencies by the solver. These systems are often `weakly indefinite', that is, most diagonal elements are considerably larger than the offdiagonal elements in the same column. The postprocessor displays the resultant electromagnetic fields. ACCUFIELD also has a highly developed user interface, enabling many convenient model editing functions and graphical displays of the resulting analysis.
Frequency Interpolation Methods for Accelerating Parallel EMC Analysis
"... Electromagnetic field analysis applications based on the Method of Moments can be used to simulate the emissions for electrical devices such as a printed circuit board, a combination of circuit boards and wire connecting boards, or even a cabinet. At the heart of such applications is a solver, whic ..."
Abstract
 Add to MetaCart
Electromagnetic field analysis applications based on the Method of Moments can be used to simulate the emissions for electrical devices such as a printed circuit board, a combination of circuit boards and wire connecting boards, or even a cabinet. At the heart of such applications is a solver, which solves a symmetric indefinite dense linear system of size Æ assembled from the model of the electrical device. The main computational challenges lie in the solver stage, where an Ç Æ computation is required for the direct solution of the linear system about a central frequency �. For the direct solution we use a general symmetric matrix factorization algorithm, requiring Æ � Ç Æ FLOPs. This algorithm’s efficiency is demonstrated by its parallel speedup of 5 for moderate sized matrices on an 8 node AP3000. Some of this cost can be amortized using the fast frequency stepping method, where the system can be solved for nearby frequencies � by Ç Æ iterative methods, using the solution at � as a preconditioner. Due to the high parallel efficiency of the direct method, the frequency stepping method reduced parallel solution time by a factor of 2 for moderatesized matrices, with larger improvements expected for large matrices. 1