Results 1  10
of
59
A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers
, 1992
"... This paper describes ScaLAPACK, a distributed memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of distributed versions of the Level LAS as building blocks, and an ob ectbased interface to the library routines. The square block s ..."
Abstract

Cited by 161 (33 self)
 Add to MetaCart
This paper describes ScaLAPACK, a distributed memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of distributed versions of the Level LAS as building blocks, and an ob ectbased interface to the library routines. The square block scattered decomposition is described. The implementation of a distributed memory version of the rightlooking LU factorization algorithm on the Intel Delta multicomputer is discussed, and performance results are presented that demonstrated the scalability of the algorithm.
Scalable Load Balancing Techniques for Parallel Computers
, 1994
"... In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics : the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not po ..."
Abstract

Cited by 100 (16 self)
 Add to MetaCart
In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics : the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible (or very difficult) to estimate the size of total work at a given processor. Such problems require a load balancing scheme that distributes the work dynamically among different processors. Our goal here is to determine the most scalable load balancing schemes for different architectures such as hypercube, mesh and network of workstations. For each of these architectures, we establish lower bounds on the scalability of any possible load balancing scheme. We present the scalability analysis of a number of load balancing schemes that have not been analyzed before. This gives us valuable insights into their relative performance for different problem and architectural characteristi...
Analyzing Scalability of Parallel Algorithms and Architectures
 Journal of Parallel and Distributed Computing
, 1994
"... The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithmarchitecture combination for a problem under different constraints on the growth of ..."
Abstract

Cited by 90 (18 self)
 Add to MetaCart
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithmarchitecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms t...
Software libraries for linear algebra computations on high performance computers
 SIAM REVIEW
, 1995
"... This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed b ..."
Abstract

Cited by 68 (17 self)
 Add to MetaCart
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of blockpartitioned algorithms in reducing the frequency of data movement between different levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct highe...
A Parallel Implementation of the Nonsymmetric QR Algorithm for Distributed Memory Architectures
 SIAM J. SCI. COMPUT
, 2002
"... One approach to solving the nonsymmetric eigenvalue problem in parallel is to parallelize the QR algorithm. Not long ago, this was widely considered to be a hopeless task. Recent efforts have led to significant advances, although the methods proposed up to now have suffered from scalability problems ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
One approach to solving the nonsymmetric eigenvalue problem in parallel is to parallelize the QR algorithm. Not long ago, this was widely considered to be a hopeless task. Recent efforts have led to significant advances, although the methods proposed up to now have suffered from scalability problems. This paper discusses an approach to parallelizingthe QR algorithm that greatly improves scalability. A theoretical analysis indicates that the algorithm is ultimately not scalable, but the nonscalability does not become evident until the matrix dimension is enormous. Experiments on the Intel Paragon system, the IBM SP2 supercomputer, the SGI Origin 2000, and the Intel ASCI Option Red supercomputer are reported.
Unstructured Tree Search on SIMD Parallel Computers
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we present new methods for load balancing of unstructured tree computations on largescale SIMD machines, and analyze the scalability of these and other existing schemes. An efficient formulation of tree search on a SIMD machine comprises of two major components: (i) a triggering mech ..."
Abstract

Cited by 35 (14 self)
 Add to MetaCart
In this paper, we present new methods for load balancing of unstructured tree computations on largescale SIMD machines, and analyze the scalability of these and other existing schemes. An efficient formulation of tree search on a SIMD machine comprises of two major components: (i) a triggering mechanism, which determines when the search space redistribution must occur to balance search space over processors; and (ii) a scheme to redistribute the search space. We have devised a new redistribution mechanism and a new triggering mechanism. Either of these can be used in conjunction with triggering and redistribution mechanisms developed by other researchers. We analyze the scalability of these mechanisms, and verify the results experimentally. The analysis and experiments show that our new load balancing methods are highly scalable on SIMD architectures. Their scalability is shown to be no worse than that of the best load balancing schemes on MIMD architectures. We verify our theoretical...
Parallel Algorithms For The Spectral Transform Method
, 1994
"... The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; howev ..."
Abstract

Cited by 32 (13 self)
 Add to MetaCart
The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; however, no detailed comparison of the different algorithms has previously been attempted. In this paper, we describe these different parallel algorithms and report on computational experiments that we have conducted to evaluate their efficiency on parallel computers. The experiments used a testbed code that solves the nonlinear shallow water equations on a sphere; considerable care was taken to ensure that the experiments provide a fair comparison of the different algorithms and that the results are relevant to global models. We focus on hypercube and meshconnected multicomputers with cutthrough routing, such as the Intel iPSC/860, DELTA, and Paragon, and the nCUBE/2, but also indicate how th...
Scalability of parallel algorithms for the allpairs shortest path problem
 in the Proceedings of the International Conference on Parallel Processing
, 1991
"... Abstract This paper uses the isoefficiency metric to analyze the scalability of several parallel algorithms for finding shortest paths between all pairs of nodes in a densely connected graph. Parallel algorithms analyzed in this paper have either been previously presented elsewhere or are small vari ..."
Abstract

Cited by 32 (13 self)
 Add to MetaCart
Abstract This paper uses the isoefficiency metric to analyze the scalability of several parallel algorithms for finding shortest paths between all pairs of nodes in a densely connected graph. Parallel algorithms analyzed in this paper have either been previously presented elsewhere or are small variations of them. Scalability is analyzed with respect to mesh, hypercube and sharedmemory architectures. We demonstrate that isoefficiency functions are a compact and useful predictor of performance. In fact, previous comparative predictions of some of the algorithms based on experimental results are shown to be incorrect whereas isoefficiency functions predict correctly. We find the classic tradeoffs of hardware cost vs. time and memory vs. time to be represented here as tradeoffs of hardware cost vs. scalability and memory vs. scalability.
Performance and scalability of preconditioned conjugate gradient methods on parallel computers
 Department of Computer Science, University of Minnesota
, 1995
"... ..."
Scalability Issues Affecting the Design of a Dense Linear Algebra Library
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1994
"... This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers ..."
Abstract

Cited by 23 (12 self)
 Add to MetaCart
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on blockpartitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel blockpartitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128node Intel iPSC/860 hypercube. It is shown that the routines are highl...