Results 1  10
of
106
An overview of the trilinos project
 ACM Transactions on Mathematical Software
"... The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an objectoriented framework for the solution of largescale, complex multiphysics engineering and scientific problems. Trilinos addresses two fundament ..."
Abstract

Cited by 143 (17 self)
 Add to MetaCart
(Show Context)
The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an objectoriented framework for the solution of largescale, complex multiphysics engineering and scientific problems. Trilinos addresses two fundamental issues of developing software for these problems: (i) Providing a streamlined process and set of tools for development of new algorithmic implementations and (ii) promoting interoperability of independently developed software. Trilinos uses a twolevel software structure designed around collections of packages. A Trilinos package is an integral unit usually developed by a small team of experts in a particular algorithms area such as algebraic preconditioners, nonlinear solvers, etc. Packages exist underneath the Trilinos top level, which provides a common lookandfeel, including configuration, documentation, licensing, and bugtracking. Here we present the overall Trilinos design, describing our use of abstract interfaces and default concrete implementations. We discuss the services that Trilinos provides to a prospective package and how these services are used by various packages. We also illustrate how packages can be combined to rapidly develop new algorithms. Finally, we discuss how Trilinos facilitates highquality software engineering practices that are increasingly required from simulation software. Sandia is a multiprogram laboratory operated by Sandia Corporation, a LockheedMartin Company, for the United States Department of Energy under Contract DEAC0494AL85000. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
Improving MemorySystem Performance of Sparse MatrixVector Multiplication
 IBM Journal of Research and Development
, 1997
"... Sparse MatrixVector Multiplication is an important kernel that often runs inefficiently on superscalar RISC processors. This paper describe techniques that increase instructionlevel parallelism and improve performance. The techniques include reordering to reduce cache misses originally due to Das ..."
Abstract

Cited by 93 (0 self)
 Add to MetaCart
(Show Context)
Sparse MatrixVector Multiplication is an important kernel that often runs inefficiently on superscalar RISC processors. This paper describe techniques that increase instructionlevel parallelism and improve performance. The techniques include reordering to reduce cache misses originally due to Das et al., blocking to reduce load instructions, and prefetching to prevent multiple loadstore units from stalling simulteneously. The techniques improve performnance from about 40 Mflops (on a wellordered matrix) to over 100 Mflops on a 266 Mflops machine. The techniques are applicable to other superscalar RISC processors as well and have improved performance on a Sun UltraSparc I workstation, for example. 1 Introduction Sparse matrixvector multiplication is an important computational kernel in many iterative linear solvers (see [5], for example). Unfortunately, on many computers this kernel runs slowly relative to other numerical codes, such as dense matrix computations. This paper propos...
Optimizing the performance of sparse matrixvector multiplication
, 2000
"... Copyright 2000 by EunJin Im ..."
(Show Context)
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY
 In Proceedings of the International Conference on Computational Science, volume 2073 of LNCS
"... Sparse matrixvector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and be ..."
Abstract

Cited by 55 (6 self)
 Add to MetaCart
(Show Context)
Sparse matrixvector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and because the performance is highly dependent on the nonzero structure of the matrix. The Sparsity system is designed to address these problem by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. The most difficult aspect of optimizing these algorithms is selecting among a large set of possible transformations and choosing parameters, such as block size. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that for matrices arising in scientific simulations, register level optimizations are critical, and we focus here on the optimizations and parameter selection techniques used in Sparsity for registerlevel optimizations. We demonstrate speedups of up to 2 for the single vector case and 5 for the multiple vector case.
Efficient Management of Parallelism in ObjectOriented Numerical Software Libraries
 Modern Software Tools in Scientific Computing
, 1997
"... Parallel numerical software based on the messagepassing model is enormously complicated. This paper introduces a set of techniques to manage the complexity, while maintaining high efficiency and ease of use. The PETSc 2.0 package uses objectoriented programming to conceal the details of the messag ..."
Abstract

Cited by 49 (0 self)
 Add to MetaCart
Parallel numerical software based on the messagepassing model is enormously complicated. This paper introduces a set of techniques to manage the complexity, while maintaining high efficiency and ease of use. The PETSc 2.0 package uses objectoriented programming to conceal the details of the message passing, without concealing the parallelism, in a highquality set of numerical software libraries. In fact, the programming model used by PETSc is also the most appropriate for NUMA sharedmemory machines, since they require the same careful attention to memory hierarchies as do distributedmemory machines. Thus, the concepts discussed are appropriate for all scalable computing systems. The PETSc libraries provide many of the data structures and numerical kernels required for the scalable solution of PDEs, offering performance portability. 1 Introduction Currently the only generalpurpose, efficient, scalable approach to programming distributedmemory parallel systems is the messagepass...
A highorder 3D boundary integral equation solver for elliptic pdes in smooth domains
 Journal of Computational Physics
, 2005
"... We present a highorder boundary integral equation solver for 3D elliptic boundary value problems on domains with smooth boundaries. We use Nyström’s method for discretization and we combine it with special quadrature rules for the singular kernels that appear in the boundary integrals. The overall ..."
Abstract

Cited by 45 (7 self)
 Add to MetaCart
(Show Context)
We present a highorder boundary integral equation solver for 3D elliptic boundary value problems on domains with smooth boundaries. We use Nyström’s method for discretization and we combine it with special quadrature rules for the singular kernels that appear in the boundary integrals. The overall asymptotic complexity of our method is O(N 3/2), where N is the number of discretization points on the boundary of the domain, and corresponds to linear complexity in the number of uniformly sampled evaluation points. A kernelindependent fast summation algorithm is used to accelerate the evaluation of the discretized integral operators. We describe a highorder accurate method for evaluating the solution at arbitrary points inside the domain, including points close to the domain boundary. We demonstrate how our solver, combined with a regulargrid spectral solver, can be applied to problems with distributed sources. We present numerical results for the Stokes, Navier, and Poisson problems.
Globalized Newton–Krylov–Schwarz algorithms and software for parallel implicit CFD
 Int. J. High Perform. Comput. Appl
"... Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is essential. The pseudotransient matrixfree NewtonKrylovSchwarz ( Y NKS) algorithmic ..."
Abstract

Cited by 44 (17 self)
 Add to MetaCart
(Show Context)
Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is essential. The pseudotransient matrixfree NewtonKrylovSchwarz ( Y NKS) algorithmic framework is presented as a widely applicable answer. This article shows that for the classical problem of threedimensional transonic Euler flow about an M6 wing, Y NKS can simultaneously deliver globalized, asymptotically rapid convergence through adaptive pseudotransient continuation and Newton’s method; reasonable parallelizability for an implicit method through deferred synchronization and favorable communicationtocomputation scaling in the Krylov linear solver; and high per processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of Y NKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. The authors therefore distill several recommendations from their experience and reading of the literature on various algorithmic components of Y NKS, and they describe a freely available MPIbased portable parallel software implementation of the solver employed here. 1
PLAPACK: Parallel Linear Algebra Package
, 1997
"... The PLAPACK project represents an effort to provide an infrastructure for implementing application friendly high performance linear algebra algorithms. The package uses a more applicationcentric data distribution, which we call Physically Based Matrix Distribution, as well as an object based (MPIl ..."
Abstract

Cited by 43 (10 self)
 Add to MetaCart
The PLAPACK project represents an effort to provide an infrastructure for implementing application friendly high performance linear algebra algorithms. The package uses a more applicationcentric data distribution, which we call Physically Based Matrix Distribution, as well as an object based (MPIlike) style of programming. It is this style of programming that allows for highly compact codes, written in C but useable from FORTRAN, that more closely reflect the underlying blocked algorithms. We show that this can be attained without sacrificing high performance. 1 Introduction Parallel implementation of most dense linear algebra operations is a relatively well understood process. Nonetheless, availability of general purpose, high performance parallel dense linear algebra libraries is severely hampered by the fact that translating the sequential algorithms, which typically can be described without filling up more than half a chalkboard, to a parallel code requires careful manipulation ...
Distributed Schur Complement Techniques for General Sparse Linear Systems
 SIAM J. SCI. COMPUT
, 1997
"... This paper presents a few preconditioning techniques for solving general sparse linear systems on distributed memory environments. These techniques utilize the Schur complement system for deriving the preconditioning matrix in a number of ways. Two of these preconditioners consist of an approxima ..."
Abstract

Cited by 40 (14 self)
 Add to MetaCart
This paper presents a few preconditioning techniques for solving general sparse linear systems on distributed memory environments. These techniques utilize the Schur complement system for deriving the preconditioning matrix in a number of ways. Two of these preconditioners consist of an approximate solution process for the global system, which exploit approximate LU factorizations for diagonal blocks of the Schur complement. Another preconditioner uses a sparse approximateinverse technique to obtain certain local approximations of the Schur complement. Comparisons are reported for systems of varying difficulty.
PETSc users manual
 Tech. Rep. ANL95/11  Revision 2.1.5, Argonne National Laboratory
, 2004
"... This work was supported by the Mathematical, Information, and Computational Sciences ..."
Abstract

Cited by 29 (8 self)
 Add to MetaCart
(Show Context)
This work was supported by the Mathematical, Information, and Computational Sciences