Results 1 - 10
of
68
Improving Memory-System Performance of Sparse Matrix-Vector Multiplication
- IBM Journal of Research and Development
, 1997
"... Sparse Matrix-Vector Multiplication is an important kernel that often runs inefficiently on superscalar RISC processors. This paper describe techniques that increase instruction-level parallelism and improve performance. The techniques include reordering to reduce cache misses originally due to Das ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
Sparse Matrix-Vector Multiplication is an important kernel that often runs inefficiently on superscalar RISC processors. This paper describe techniques that increase instruction-level parallelism and improve performance. The techniques include reordering to reduce cache misses originally due to Das et al., blocking to reduce load instructions, and prefetching to prevent multiple load-store units from stalling simulteneously. The techniques improve performnance from about 40 Mflops (on a well-ordered matrix) to over 100 Mflops on a 266 Mflops machine. The techniques are applicable to other superscalar RISC processors as well and have improved performance on a Sun UltraSparc I workstation, for example. 1 Introduction Sparse matrix-vector multiplication is an important computational kernel in many iterative linear solvers (see [5], for example). Unfortunately, on many computers this kernel runs slowly relative to other numerical codes, such as dense matrix computations. This paper propos...
An overview of the trilinos project
- ACM Transactions on Mathematical Software
"... The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an object-oriented framework for the solution of large-scale, complex multi-physics engineering and scientific problems. Trilinos addresses two fundament ..."
Abstract
-
Cited by 48 (7 self)
- Add to MetaCart
The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an object-oriented framework for the solution of large-scale, complex multi-physics engineering and scientific problems. Trilinos addresses two fundamental issues of developing software for these problems: (i) Providing a streamlined process and set of tools for development of new algorithmic implementations and (ii) promoting interoperability of independently developed software. Trilinos uses a two-level software structure designed around collections of packages. A Trilinos package is an integral unit usually developed by a small team of experts in a particular algorithms area such as algebraic preconditioners, nonlinear solvers, etc. Packages exist underneath the Trilinos top level, which provides a common look-and-feel, including configuration, documentation, licensing, and bug-tracking. Here we present the overall Trilinos design, describing our use of abstract interfaces and default concrete implementations. We discuss the services that Trilinos provides to a prospective package and how these services are used by various packages. We also illustrate how packages can be combined to rapidly develop new algorithms. Finally, we discuss how Trilinos facilitates highquality software engineering practices that are increasingly required from simulation software. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed-Martin Company, for the United States Department of Energy under Contract DE-AC04-94AL85000. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
Optimizing the performance of sparse matrix-vector multiplication
, 2000
"... Copyright 2000 by Eun-Jin Im ..."
Distributed Schur Complement Techniques for General Sparse Linear Systems
- SIAM J. SCI. COMPUT
, 1997
"... This paper presents a few preconditioning techniques for solving general sparse linear systems on distributed memory environments. These techniques utilize the Schur complement system for deriving the preconditioning matrix in a number of ways. Two of these preconditioners consist of an approxima ..."
Abstract
-
Cited by 32 (12 self)
- Add to MetaCart
This paper presents a few preconditioning techniques for solving general sparse linear systems on distributed memory environments. These techniques utilize the Schur complement system for deriving the preconditioning matrix in a number of ways. Two of these preconditioners consist of an approximate solution process for the global system, which exploit approximate LU factorizations for diagonal blocks of the Schur complement. Another preconditioner uses a sparse approximate-inverse technique to obtain certain local approximations of the Schur complement. Comparisons are reported for systems of varying difficulty.
PLAPACK: Parallel Linear Algebra Package
, 1997
"... The PLAPACK project represents an effort to provide an infrastructure for implementing application friendly high performance linear algebra algorithms. The package uses a more application-centric data distribution, which we call Physically Based Matrix Distribution, as well as an object based (MPI-l ..."
Abstract
-
Cited by 32 (9 self)
- Add to MetaCart
The PLAPACK project represents an effort to provide an infrastructure for implementing application friendly high performance linear algebra algorithms. The package uses a more application-centric data distribution, which we call Physically Based Matrix Distribution, as well as an object based (MPI-like) style of programming. It is this style of programming that allows for highly compact codes, written in C but useable from FORTRAN, that more closely reflect the underlying blocked algorithms. We show that this can be attained without sacrificing high performance. 1 Introduction Parallel implementation of most dense linear algebra operations is a relatively well understood process. Nonetheless, availability of general purpose, high performance parallel dense linear algebra libraries is severely hampered by the fact that translating the sequential algorithms, which typically can be described without filling up more than half a chalkboard, to a parallel code requires careful manipulation ...
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY
- In Proceedings of the International Conference on Computational Science, volume 2073 of LNCS
"... Sparse matrix-vector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and be ..."
Abstract
-
Cited by 32 (5 self)
- Add to MetaCart
Sparse matrix-vector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and because the performance is highly dependent on the nonzero structure of the matrix. The Sparsity system is designed to address these problem by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. The most difficult aspect of optimizing these algorithms is selecting among a large set of possible transformations and choosing parameters, such as block size. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that for matrices arising in scientific simulations, register level optimizations are critical, and we focus here on the optimizations and parameter selection techniques used in Sparsity for register-level optimizations. We demonstrate speedups of up to 2 for the single vector case and 5 for the multiple vector case.
Globalized Newton-Krylov-Schwarz algorithms and software for parallel implicit CFD
- Int. J. High Performance Computing Applications
, 1998
"... Key words. Newton-Krylov-Schwarz algorithms, parallel CFD, implicit methods Abstract. Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is e ..."
Abstract
-
Cited by 29 (12 self)
- Add to MetaCart
Key words. Newton-Krylov-Schwarz algorithms, parallel CFD, implicit methods Abstract. Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz (ΨNKS) algorithmic framework is presented as a widely applicable answer. This article shows that, for the classical problem of three-dimensional transonic Euler flow about an M6 wing, ΨNKS can simultaneously deliver • globalized, asymptotically rapid convergence through adaptive pseudo-transient continuation and Newton’s method; • reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver; and • high per-processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of ΨNKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. We therefore distill several recommendations from our experience and from our reading of the literature on various algorithmic components of ΨNKS, and we describe a freely available, MPI-based portable parallel software implementation of the solver employed here. 1. Introduction. Disparate
Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries
- Modern Software Tools in Scientific Computing
, 1997
"... Parallel numerical software based on the message-passing model is enormously complicated. This paper introduces a set of techniques to manage the complexity, while maintaining high efficiency and ease of use. The PETSc 2.0 package uses object-oriented programming to conceal the details of the messag ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
Parallel numerical software based on the message-passing model is enormously complicated. This paper introduces a set of techniques to manage the complexity, while maintaining high efficiency and ease of use. The PETSc 2.0 package uses object-oriented programming to conceal the details of the message passing, without concealing the parallelism, in a high-quality set of numerical software libraries. In fact, the programming model used by PETSc is also the most appropriate for NUMA shared-memory machines, since they require the same careful attention to memory hierarchies as do distributed-memory machines. Thus, the concepts discussed are appropriate for all scalable computing systems. The PETSc libraries provide many of the data structures and numerical kernels required for the scalable solution of PDEs, offering performance portability. 1 Introduction Currently the only general-purpose, efficient, scalable approach to programming distributed-memory parallel systems is the message-pass...
Parallel Sparse Matrix-Vector Multiply Software for Matrices with Data Locality
, 1995
"... In this paper we describe general software utilities for performing unstructured sparse matrix-vector multiplications on distributed-memory message-passing computers. The matrix-vector multiply comprises an important kernel in the solution of large sparse linear systems by iterative methods. Our foc ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
In this paper we describe general software utilities for performing unstructured sparse matrix-vector multiplications on distributed-memory message-passing computers. The matrix-vector multiply comprises an important kernel in the solution of large sparse linear systems by iterative methods. Our focus is to present the data structures and communication parameters necessary for these utilities for general sparse unstructured matrices with data locality. These type of matrices are commonly produced by finite difference and finite element approximations to systems of partial differential equations. In this discussion we also present representative examples and timings which demonstrate the utility and performance of the software.
Domain Decomposition and Multi-Level Type Techniques for General Sparse Linear Systems
, 1998
"... Domain-decomposition and multi-level techniques are often formulated for linear systems that arise from the solution of elliptic-type Partial Differential Equations. In this paper, generalizations of these techniques for irregularly structured sparse linear systems are considered. An interesting ..."
Abstract
-
Cited by 17 (16 self)
- Add to MetaCart
Domain-decomposition and multi-level techniques are often formulated for linear systems that arise from the solution of elliptic-type Partial Differential Equations. In this paper, generalizations of these techniques for irregularly structured sparse linear systems are considered. An interesting common approach used to derive successful preconditioners is to resort to Schur complements. In particular, we discuss a multi-level domain decompositiontype algorithm for iterative solution of large sparse linear systems based on independent subsets of nodes. We also discuss a Schur complement technique that utilizes incomplete LU factorizations of local matrices. Key words: Schur complement techniques; Incomplete LU factorization; Schwarz iterations; Multielimination; Multi-level ILU preconditioners; Krylov subspace methods. 1 Introduction A recent trend in parallel preconditioning techniques for general sparse linear systems is to exploit ideas from domain decomposition concepts an...

