Results 1  10
of
78
Improving MemorySystem Performance of Sparse MatrixVector Multiplication
 IBM Journal of Research and Development
, 1997
"... Sparse MatrixVector Multiplication is an important kernel that often runs inefficiently on superscalar RISC processors. This paper describe techniques that increase instructionlevel parallelism and improve performance. The techniques include reordering to reduce cache misses originally due to Das ..."
Abstract

Cited by 72 (0 self)
 Add to MetaCart
Sparse MatrixVector Multiplication is an important kernel that often runs inefficiently on superscalar RISC processors. This paper describe techniques that increase instructionlevel parallelism and improve performance. The techniques include reordering to reduce cache misses originally due to Das et al., blocking to reduce load instructions, and prefetching to prevent multiple loadstore units from stalling simulteneously. The techniques improve performnance from about 40 Mflops (on a wellordered matrix) to over 100 Mflops on a 266 Mflops machine. The techniques are applicable to other superscalar RISC processors as well and have improved performance on a Sun UltraSparc I workstation, for example. 1 Introduction Sparse matrixvector multiplication is an important computational kernel in many iterative linear solvers (see [5], for example). Unfortunately, on many computers this kernel runs slowly relative to other numerical codes, such as dense matrix computations. This paper propos...
An overview of the trilinos project
 ACM Transactions on Mathematical Software
"... The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an objectoriented framework for the solution of largescale, complex multiphysics engineering and scientific problems. Trilinos addresses two fundament ..."
Abstract

Cited by 72 (9 self)
 Add to MetaCart
The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an objectoriented framework for the solution of largescale, complex multiphysics engineering and scientific problems. Trilinos addresses two fundamental issues of developing software for these problems: (i) Providing a streamlined process and set of tools for development of new algorithmic implementations and (ii) promoting interoperability of independently developed software. Trilinos uses a twolevel software structure designed around collections of packages. A Trilinos package is an integral unit usually developed by a small team of experts in a particular algorithms area such as algebraic preconditioners, nonlinear solvers, etc. Packages exist underneath the Trilinos top level, which provides a common lookandfeel, including configuration, documentation, licensing, and bugtracking. Here we present the overall Trilinos design, describing our use of abstract interfaces and default concrete implementations. We discuss the services that Trilinos provides to a prospective package and how these services are used by various packages. We also illustrate how packages can be combined to rapidly develop new algorithms. Finally, we discuss how Trilinos facilitates highquality software engineering practices that are increasingly required from simulation software. Sandia is a multiprogram laboratory operated by Sandia Corporation, a LockheedMartin Company, for the United States Department of Energy under Contract DEAC0494AL85000. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
Optimizing the performance of sparse matrixvector multiplication
, 2000
"... Copyright 2000 by EunJin Im ..."
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY
 In Proceedings of the International Conference on Computational Science, volume 2073 of LNCS
"... Sparse matrixvector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and be ..."
Abstract

Cited by 40 (6 self)
 Add to MetaCart
Sparse matrixvector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and because the performance is highly dependent on the nonzero structure of the matrix. The Sparsity system is designed to address these problem by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. The most difficult aspect of optimizing these algorithms is selecting among a large set of possible transformations and choosing parameters, such as block size. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that for matrices arising in scientific simulations, register level optimizations are critical, and we focus here on the optimizations and parameter selection techniques used in Sparsity for registerlevel optimizations. We demonstrate speedups of up to 2 for the single vector case and 5 for the multiple vector case.
Globalized NewtonKrylovSchwarz algorithms and software for parallel implicit CFD
 Int. J. High Performance Computing Applications
, 1998
"... Key words. NewtonKrylovSchwarz algorithms, parallel CFD, implicit methods Abstract. Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is e ..."
Abstract

Cited by 36 (14 self)
 Add to MetaCart
Key words. NewtonKrylovSchwarz algorithms, parallel CFD, implicit methods Abstract. Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is essential. The pseudotransient matrixfree NewtonKrylovSchwarz (ΨNKS) algorithmic framework is presented as a widely applicable answer. This article shows that, for the classical problem of threedimensional transonic Euler flow about an M6 wing, ΨNKS can simultaneously deliver • globalized, asymptotically rapid convergence through adaptive pseudotransient continuation and Newton’s method; • reasonable parallelizability for an implicit method through deferred synchronization and favorable communicationtocomputation scaling in the Krylov linear solver; and • high perprocessor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of ΨNKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. We therefore distill several recommendations from our experience and from our reading of the literature on various algorithmic components of ΨNKS, and we describe a freely available, MPIbased portable parallel software implementation of the solver employed here. 1. Introduction. Disparate
PLAPACK: Parallel Linear Algebra Package
, 1997
"... The PLAPACK project represents an effort to provide an infrastructure for implementing application friendly high performance linear algebra algorithms. The package uses a more applicationcentric data distribution, which we call Physically Based Matrix Distribution, as well as an object based (MPIl ..."
Abstract

Cited by 34 (9 self)
 Add to MetaCart
The PLAPACK project represents an effort to provide an infrastructure for implementing application friendly high performance linear algebra algorithms. The package uses a more applicationcentric data distribution, which we call Physically Based Matrix Distribution, as well as an object based (MPIlike) style of programming. It is this style of programming that allows for highly compact codes, written in C but useable from FORTRAN, that more closely reflect the underlying blocked algorithms. We show that this can be attained without sacrificing high performance. 1 Introduction Parallel implementation of most dense linear algebra operations is a relatively well understood process. Nonetheless, availability of general purpose, high performance parallel dense linear algebra libraries is severely hampered by the fact that translating the sequential algorithms, which typically can be described without filling up more than half a chalkboard, to a parallel code requires careful manipulation ...
Distributed Schur Complement Techniques for General Sparse Linear Systems
 SIAM J. SCI. COMPUT
, 1997
"... This paper presents a few preconditioning techniques for solving general sparse linear systems on distributed memory environments. These techniques utilize the Schur complement system for deriving the preconditioning matrix in a number of ways. Two of these preconditioners consist of an approxima ..."
Abstract

Cited by 33 (13 self)
 Add to MetaCart
This paper presents a few preconditioning techniques for solving general sparse linear systems on distributed memory environments. These techniques utilize the Schur complement system for deriving the preconditioning matrix in a number of ways. Two of these preconditioners consist of an approximate solution process for the global system, which exploit approximate LU factorizations for diagonal blocks of the Schur complement. Another preconditioner uses a sparse approximateinverse technique to obtain certain local approximations of the Schur complement. Comparisons are reported for systems of varying difficulty.
Efficient Management of Parallelism in ObjectOriented Numerical Software Libraries
 Modern Software Tools in Scientific Computing
, 1997
"... Parallel numerical software based on the messagepassing model is enormously complicated. This paper introduces a set of techniques to manage the complexity, while maintaining high efficiency and ease of use. The PETSc 2.0 package uses objectoriented programming to conceal the details of the messag ..."
Abstract

Cited by 33 (0 self)
 Add to MetaCart
Parallel numerical software based on the messagepassing model is enormously complicated. This paper introduces a set of techniques to manage the complexity, while maintaining high efficiency and ease of use. The PETSc 2.0 package uses objectoriented programming to conceal the details of the message passing, without concealing the parallelism, in a highquality set of numerical software libraries. In fact, the programming model used by PETSc is also the most appropriate for NUMA sharedmemory machines, since they require the same careful attention to memory hierarchies as do distributedmemory machines. Thus, the concepts discussed are appropriate for all scalable computing systems. The PETSc libraries provide many of the data structures and numerical kernels required for the scalable solution of PDEs, offering performance portability. 1 Introduction Currently the only generalpurpose, efficient, scalable approach to programming distributedmemory parallel systems is the messagepass...
PETSc users manual
 Tech. Rep. ANL95/11  Revision 2.1.5, Argonne National Laboratory
, 2004
"... This work was supported by the Mathematical, Information, and Computational Sciences ..."
Abstract

Cited by 26 (8 self)
 Add to MetaCart
This work was supported by the Mathematical, Information, and Computational Sciences
Parallel Sparse MatrixVector Multiply Software for Matrices with Data Locality
, 1995
"... In this paper we describe general software utilities for performing unstructured sparse matrixvector multiplications on distributedmemory messagepassing computers. The matrixvector multiply comprises an important kernel in the solution of large sparse linear systems by iterative methods. Our foc ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
In this paper we describe general software utilities for performing unstructured sparse matrixvector multiplications on distributedmemory messagepassing computers. The matrixvector multiply comprises an important kernel in the solution of large sparse linear systems by iterative methods. Our focus is to present the data structures and communication parameters necessary for these utilities for general sparse unstructured matrices with data locality. These type of matrices are commonly produced by finite difference and finite element approximations to systems of partial differential equations. In this discussion we also present representative examples and timings which demonstrate the utility and performance of the software.