Self adapting linear algebra algorithms and software
, 2004
"One of the main obstacles to the efficient solution of scientific problems is the problem of tuning software, both to the available architecture and to the user problem at hand. We describe approaches for obtaining tuned highperformance kernels, and for automatically choosing suitable algorithms."
One of the main obstacles to the efficient solution of scientific problems is the problem of tuning software, both to the available architecture and to the user problem at hand. We describe approaches for obtaining tuned highperformance kernels, and for automatically choosing suitable algorithms. Specifically, we describe the generation of dense and sparse blas kernels, and the selection of linear solver algorithms. However, the ideas presented here extend beyond these areas, which can be considered proof of concept.
Dome: Parallel programming in a heterogeneous multiuser environment
, 1995
"Writing parallel programs for distributed multiuser computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolerance."
Writing parallel programs for distributed multiuser computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolerance. Dome programmers, with modest effort, can write parallel programs that are automatically distributed over a heterogeneous network, dynamically load balanced as the program runs, and able to survive compute node and network failures. This paper provides the motivation for and an overview of Dome, including a preliminary performance evaluation of dynamic load balancing for distributed vectors. Dome programs are shorter and easier to write than the equivalent programs written with message passing primitives. The performance overhead of Dome is characterized, and it is shown that this overhead can be recouped by dynamic load balancing in imbalanced systems. Finally, we show that a parallel ...
Software libraries for linear algebra computations on high performance computers
 SIAM REVIEW
, 1995
"This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development."
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of blockpartitioned algorithms in reducing the frequency of data movement between different levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct highe...
Implementing a Parallel C++ Runtime System for Scalable Parallel Systems
 In Proceedings of Supercomputing '93
, 1993
"pC++ is a language extension to C++ designed to allow programmers to compose "concurrent aggregate" collection classes which can be aligned and distributed over the memory hierarchy of a parallel machine in a manner modeled on the High Performance Fortran Forum (HPFF) directives for Fortran 90."
pC++ is a language extension to C++ designed to allow programmers to compose "concurrent aggregate" collection classes which can be aligned and distributed over the memory hierarchy of a parallel machine in a manner modeled on the High Performance Fortran Forum (HPFF) directives for Fortran 90. pC++ allows the user to write portable and efficient code which will run on a wide range of scalable parallel computer systems. The first version of the compiler is a preprocessor which generates Single Program Multiple Data (SPMD) C++ code. Currently, it runs on the Thinking Machines CM5, the Intel Paragon, the BBN TC2000, the Kendall Square Research KSR1, and the Sequent Symmetry. In this paper we describe the implementation of the runtime system, which provides the concurrency and communication primitives between objects in a distributed collection. To illustrate the behavior of the runtime system we include a description and performance results on four benchmark programs. 1 Introduction ...
LAPACK++: A Design Overview of ObjectOriented Extensions for High Performance Linear Algebra
, 1993
"... LAPACK++ is an objectoriented C++ extension of the LAPACK ( Linear Algebra PACKage) library for solving the common problems of numerical linear algebra: linear systems, linear least squares, and eigenvalue problems on highperformance computer architectures. The advantages of an objectoriented app ..."
LAPACK++ is an objectoriented C++ extension of the LAPACK ( Linear Algebra PACKage) library for solving the common problems of numerical linear algebra: linear systems, linear least squares, and eigenvalue problems on highperformance computer architectures. The advantages of an objectoriented approach include the ability to encapsulate various matrix representations, hide their implementation details, reduce the number of subroutines, simplify their calling sequences, and provide an extendible software framework that can incorporate future extensions of LAPACK, such as ScaLAPACK++ for distributed memory architectures. We present an overview of the objectoriented design of the matrix and decomposition classes in C++ and discuss its impact on elegance, generality, and performance. 1 Introduction LAPACK++ is an objectoriented C++ extension to the Fortran LAPACK [1] library for numerical linear algebra. This package includes stateoftheart numerical algorithms for the more common l...
Sage++: An ObjectOriented Toolkit and Class Library for Building Fortran and C++ Restructuring Tools
 In The second annual objectoriented numerics conference (OONSKI
, 1994
"Sage++ is an object oriented toolkit for building program transformation and preprocessing tools. It contains parsers for Fortran 77 with many Fortran 90 extensions, C, and C++, integrated with a C++ class library. The library provides a means to access and restructure the program tree, symbol and type tables, and sourcelevel programmer annotations."
Sage++ is an object oriented toolkit for building program transformation and preprocessing tools. It contains parsers for Fortran 77 with many Fortran 90 extensions, C, and C++, integrated with a C++ class library. The library provides a means to access and restructure the program tree, symbol and type tables, and sourcelevel programmer annotations. Sage++ provides an underlying infrastructure on which all types of program preprocessors can be built, including parallelizing compilers, performance analysis tools, and source code optimizers. 1 Introduction Designing and building a sourcetosource translation system is a very time consuming task. However, such systems are often a prerequisite for many compiler and language extension research projects. Sage++ was designed to be a toolkit for such projects. It provides an integrated set of parsers that transform the source program into an internal representation, which we call a program tree; a library of objectoriented methods that man...
Scalability Issues Affecting the Design of a Dense Linear Algebra Library
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1994
"This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers."
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on blockpartitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel blockpartitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128node Intel iPSC/860 hypercube. It is shown that the routines are highl...
An ObjectOriented Framework for Block Preconditioning
, 1998
"General software for preconditioning the iterative solution of linear systems is greatly lagging behind the literature. This is partly because specific problems need specific matrix and preconditioner data structures in order to be solved efficiently; i.e., multiple implementations of a preconditioner with specialized data structures are required."
General software for preconditioning the iterative solution of linear systems is greatly lagging behind the literature. This is partly because specific problems need specific matrix and preconditioner data structures in order to be solved efficiently; i.e., multiple implementations of a preconditioner with specialized data structures are required. This article presents a framework to support preconditioning with various, possibly userdefined, data structures for matrices that are partitioned into blocks. The main idea is to define data structures for the blocks, and an upper layer of software which uses these blocks transparently of their data structure. This transparency can be accomplished by using an objectoriented language. Thus various preconditioners, such as block relaxations and block incomplete factorizations, only need to be defined once, and will work with any block type. In addition, it is possible to transparently interchange various approximate or exact techniques for inverting pivot blocks, or solving systems whose coefficient matrices are diagonal blocks. This leads to a rich variety of preconditioners that can be selected. Operations with the blocks are performed with optimized libraries or fundamental data types. Comparisons with an optimized Fortran 77 code on both workstations and Cray supercomputers show that this framework can approach the efficiency of Fortran 77, as long as suitable block sizes and block types are chosen.
Dome: Distributed object migration environment
, 1994
"Dome is an object based parallel programming environment for heterogeneous distributed networks of machines. This paper gives a brief overview of Dome. We show that Dome programs are easy to write. A description of the load balancing performed in Dome is presented along with performance measurements on a cluster of DEC Alpha workstations connected by a DEC Gigaswitch."
Dome is an object based parallel programming environment for heterogeneous distributed networks of machines. This paper gives a brief overview of Dome. We show that Dome programs are easy to write. A description of the load balancing performed in Dome is presented along with performance measurements on a cluster of DEC Alpha workstations connected by a DEC Gigaswitch. A Dome program is compared with a sequential version and one written in PVM. We also present an overview of architecture independent checkpoint and restart in Dome. This research was sponsored by the National Science Foundation and the Defense Advanced Research Projects Agency under Cooperative Agreement NCR8919038 with the Corporation for National Research Initiatives. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of NSF, CNRI, ARPA, or the U.S. Government. Keywords: Heterogeneous parallel ...